March 23, 2026•14 min read•Security Research•Part 2 of 2

We Re-Audited 8 Major AI SDKs — Here's What Changed

Last week we scanned 4 repositories and found the same three failure modes in all of them. Today we re-ran the analysis with 12 new behavioral checks — and added 4 more codebases. The improvements are real. The patterns are not gone.

This is a follow-up to our March 18 audit.

The original analysis covered vercel/ai, LangChain.js, openai-node, and the MCP Servers reference implementation. Read it first for the full methodology context: We Audited 4 Major AI SDKs

The Numbers

Repos scanned

+4 new

4,665

Files analyzed

JS, TS, Python

260

Critical findings

across 8 repos

10,961

Total findings

all severities

Scan methodology: CodeSlick CLI v1.5.4, quick mode (pattern-based; deep TypeScript compiler analysis excluded for speed). Raw scan data available at github.com/VitorLourenco/ai-sdk-security-audits.

Original 4 Repos: What Changed

All four repositories improved on critical findings. That is genuinely good news. The reductions reflect both active security work by maintainer teams and some structural changes in how these repos are organized — test files and example code have been separated more clearly, which removes a significant source of credential findings.

Repository	Critical (Mar 18)	Critical (Mar 23)	Delta	High (Mar 23)
vercel/ai	17	6	65%	245
langchain-ai/langchainjs	200	150	25%	480
openai/openai-node	2	1	50%	89
modelcontextprotocol/servers	1	1	0%	12

vercel/ai is the biggest improvement: -65% critical

Down from 17 critical to 6, high findings roughly halved (468 → 245). The command-injection and deserialization findings that remain are concentrated in the codemod tooling and streaming utilities — not the core SDK that ships to users.

LangChain improved but still leads on critical count: 150

Down from 200, but 150 critical findings in a framework that orchestrates production AI workflows remains a significant number. The concentration is in integration test files and provider examples — the same structural problem identified in the first audit. High findings (480) barely moved (-13), indicating the error-handling and unvalidated-input patterns are deeply embedded in the codebase.

MCP Behavioral Checks: What the New Analysis Found

Since the March 18 audit, CodeSlick added 12 behavioral checks specifically targeting MCP server patterns: tool poisoning risk, schema validation bypass, missing authentication in tool handlers, excessive permissions, sensitive data exposure through tools, and unsafe resource access. These checks were designed precisely for codebases like modelcontextprotocol/servers.

modelcontextprotocol/serversNo new findings

The official MCP reference implementation returned identical results with and without the new behavioral checks: 1 critical (hardcoded credential in example config), 12 high, 140 total. The new checks targeting tool poisoning, schema bypass, and missing auth handlers did not surface additional findings in this codebase.

This is the expected result for a reference implementation maintained by a security-aware team. It does not mean MCP servers in the wild are equally clean.

modelcontextprotocol/typescript-sdk2 critical — new

This is the SDK used to build MCP servers — the upstream dependency of most TypeScript MCP implementations. Both critical findings are hardcoded credentials in authentication example files (authExtensions.examples.ts). 27 high findings across 96 files.

The same pattern identified in the March 18 audit — credentials in example code — is present in the SDK that developers clone first when evaluating MCP. Example patterns propagate into production implementations. This is precisely how supply chain contamination starts.

4 New Repos: First-Time Scans

CrewAI and Microsoft AutoGen represent the agent framework layer — code that orchestrates multi-step AI operations. The Anthropic and Google Gemini SDKs add the two remaining major model providers to the picture. Together they give a fuller view of the dependency stack that production AI applications run on.

Repository	Files	Critical	High	Density
crewAIInc/crewAI	761	75	82	0.57/file
modelcontextprotocol/typescript-sdk	96	2	27	3.04/file
anthropics/anthropic-sdk-python*	547	24	51	0.21/file
google-gemini/generative-ai-js	55	1	33	6.07/file

* Anthropic Python SDK critical findings are classified as known-malicious-package — a check that matches against a registry of flagged packages. Manual review is required to confirm whether these are true positives or false positives from package name collisions.

CrewAI: 75 critical in 761 files

The highest critical count after LangChain, in a framework that builds multi-agent pipelines where individual agents call tools, access external data, and pass results between each other. The combination of missing error handling (high count) and unvalidated inputs in a multi-agent orchestration context is the highest-risk profile in this audit.

The practical implication: when an agent step fails silently, the orchestrator continues with corrupted or empty context. In a multi-step pipeline with tool calls, that is not a theoretical risk — it is the default behavior when error handling is absent.

Google Gemini JS: Highest finding density — 6.07 per file

The smallest repo in the audit (55 files) with 334 total findings. The single critical finding is a dynamic require in a code transformation utility (samples/utils/insert-import-comments.js) — pattern-matching flagged it as potential require injection, though the context is a developer tool, not production SDK code. The density is driven by high counts of missing error handling and unvalidated inputs. High finding density in a small codebase often indicates systematic omissions rather than isolated bugs.

Full Ecosystem View

Repository	Layer	Files	Critical	High	Total
langchain-ai/langchainjs	Agent orchestration	1,433	150	480	5,347
vercel/ai	Streaming SDK	1,459	6	245	3,480
anthropics/anthropic-sdk-python	Model provider	547	24	51	113
crewAIInc/crewAI	Agent framework	761	75	82	430
openai/openai-node	Model provider	256	1	89	825
modelcontextprotocol/typescript-sdk	MCP SDK	96	2	27	292
google-gemini/generative-ai-js	Model provider	55	1	33	334
modelcontextprotocol/servers	MCP reference	58	1	12	140
TOTAL (8 repos)		4,665	260	1,019	10,961

The 3 Patterns That Persist Across All 8 Repos

Expanding the scope from 4 to 8 repositories did not change the structural findings. The same three categories appear at the top of every codebase's finding list, regardless of language, framework type, or team size.

Hardcoded credentials in example and test code

langchain: 150 criticalmcp-typescript-sdk: 2 criticalopenai-node: 1 criticalmcp-servers: 1 critical

Every repository in this audit has hardcoded credentials. In every case, the credentials are in example files, integration test fixtures, or documentation samples — not in the production SDK code that ships to users. This distinction matters less than it appears. Developers clone repositories to understand patterns. When the first file they open shows a hardcoded API key, that pattern normalizes. It appears in their own code three weeks later. The supply chain risk is behavioral, not infrastructural.

Missing error handling in async and agent flows

vercel/ai: 245 highlangchain: 480 highcrewai: 82 highopenai-node: 89 high

Promise chains without catch handlers and async functions without try-catch are the dominant finding across all repos. In a client SDK this is often tolerable — the application layer handles errors. In an agent orchestration framework, it is not. When LangChain or CrewAI fails to handle an error in an intermediate agent step, the pipeline continues with undefined or empty context. In multi-step reasoning chains, one silent failure corrupts everything downstream.

Unvalidated inputs in tool handlers and API boundaries

mcp-servers: 12 highmcp-typescript-sdk: 27 highcrewai: 82 high

Tool handlers — the functions that AI models call to interact with external systems — do not validate their inputs in the majority of cases examined. This matters most in MCP contexts: MCP servers receive tool calls from AI models that may process untrusted user input. A missing null check in a tool handler is the precondition for a crash that could be triggered through prompt injection. The risk is not hypothetical — it is the default execution path when validation is absent.

What This Means If You Build on These SDKs

Audit your own codebase for patterns you copied from SDK examples

If you cloned a LangChain quickstart or an MCP server example and never audited what you brought in, that is the first place to look. The hardcoded credential pattern spreads through imitation.

Treat agent framework errors as critical paths, not exceptional ones

In multi-agent pipelines, error handling in intermediate steps is not optional. An unhandled rejection in step 3 of a 7-step workflow will produce wrong output, not a visible error. Add explicit error handling at every agent boundary.

Validate tool inputs before execution, not after

If you build MCP tools or expose functions to AI models, validate all arguments at the entry point. Do not assume the model will only pass valid values — especially when the model processes user input that may include adversarial content.

Methodology

Tool: CodeSlick CLI v1.5.4 — 308 security checks across JavaScript, TypeScript, and Python. Includes 12 new MCP behavioral checks (MCP-JS-001–008, MCP-PY-001–004) added March 8, 2026.

Scan mode: Quick mode (--quick) — pattern-based static analysis. Deep TypeScript compiler type checking excluded for scan speed across 8 repos. All credential, injection, error-handling, and input-validation checks are fully active in this mode.

Scope: Shallow clones (--depth 1) of the default branch as of March 23, 2026. All files scanned including examples, tests, and documentation code. This is intentional: example code in SDKs is how patterns propagate.

Interpretation: Static analysis findings require manual triage before treating as confirmed vulnerabilities. Some findings — particularly known-malicious-package in the Anthropic Python SDK — require additional review to distinguish true positives from false positives. All findings are published unfiltered; we report what the tool found, not a curated subset.

Raw data: github.com/VitorLourenco/ai-sdk-security-audits

Audit your own codebase

The patterns in these repos appear in production applications that use them. Run the same analysis on your own code in under 60 seconds.

Scan your code free View raw scan data