We Re-Audited 8 Major AI SDKs — Here's What Changed
Last week we scanned 4 repositories and found the same three failure modes in all of them. Today we re-ran the analysis with 12 new behavioral checks — and added 4 more codebases. The improvements are real. The patterns are not gone.
This is a follow-up to our March 18 audit.
The original analysis covered vercel/ai, LangChain.js, openai-node, and the MCP Servers reference implementation. Read it first for the full methodology context: We Audited 4 Major AI SDKs
The Numbers
8
Repos scanned
+4 new
4,665
Files analyzed
JS, TS, Python
260
Critical findings
across 8 repos
10,961
Total findings
all severities
Scan methodology: CodeSlick CLI v1.5.4, quick mode (pattern-based; deep TypeScript compiler analysis excluded for speed). Raw scan data available at github.com/VitorLourenco/ai-sdk-security-audits.
Original 4 Repos: What Changed
All four repositories improved on critical findings. That is genuinely good news. The reductions reflect both active security work by maintainer teams and some structural changes in how these repos are organized — test files and example code have been separated more clearly, which removes a significant source of credential findings.
| Repository | Critical (Mar 18) | Critical (Mar 23) | Delta | High (Mar 23) |
|---|---|---|---|---|
| vercel/ai | 17 | 6 | 65% | 245 |
| langchain-ai/langchainjs | 200 | 150 | 25% | 480 |
| openai/openai-node | 2 | 1 | 50% | 89 |
| modelcontextprotocol/servers | 1 | 1 | 0% | 12 |
vercel/ai is the biggest improvement: -65% critical
Down from 17 critical to 6, high findings roughly halved (468 → 245). The command-injection and deserialization findings that remain are concentrated in the codemod tooling and streaming utilities — not the core SDK that ships to users.
LangChain improved but still leads on critical count: 150
Down from 200, but 150 critical findings in a framework that orchestrates production AI workflows remains a significant number. The concentration is in integration test files and provider examples — the same structural problem identified in the first audit. High findings (480) barely moved (-13), indicating the error-handling and unvalidated-input patterns are deeply embedded in the codebase.
MCP Behavioral Checks: What the New Analysis Found
Since the March 18 audit, CodeSlick added 12 behavioral checks specifically targeting MCP server patterns: tool poisoning risk, schema validation bypass, missing authentication in tool handlers, excessive permissions, sensitive data exposure through tools, and unsafe resource access. These checks were designed precisely for codebases like modelcontextprotocol/servers.
The official MCP reference implementation returned identical results with and without the new behavioral checks: 1 critical (hardcoded credential in example config), 12 high, 140 total. The new checks targeting tool poisoning, schema bypass, and missing auth handlers did not surface additional findings in this codebase.
This is the expected result for a reference implementation maintained by a security-aware team. It does not mean MCP servers in the wild are equally clean.
This is the SDK used to build MCP servers — the upstream dependency of most TypeScript MCP implementations. Both critical findings are hardcoded credentials in authentication example files (authExtensions.examples.ts). 27 high findings across 96 files.
The same pattern identified in the March 18 audit — credentials in example code — is present in the SDK that developers clone first when evaluating MCP. Example patterns propagate into production implementations. This is precisely how supply chain contamination starts.
4 New Repos: First-Time Scans
CrewAI and Microsoft AutoGen represent the agent framework layer — code that orchestrates multi-step AI operations. The Anthropic and Google Gemini SDKs add the two remaining major model providers to the picture. Together they give a fuller view of the dependency stack that production AI applications run on.
| Repository | Files | Critical | High | Density |
|---|---|---|---|---|
| crewAIInc/crewAI | 761 | 75 | 82 | 0.57/file |
| modelcontextprotocol/typescript-sdk | 96 | 2 | 27 | 3.04/file |
| anthropics/anthropic-sdk-python* | 547 | 24 | 51 | 0.21/file |
| google-gemini/generative-ai-js | 55 | 1 | 33 | 6.07/file |
* Anthropic Python SDK critical findings are classified as known-malicious-package — a check that matches against a registry of flagged packages. Manual review is required to confirm whether these are true positives or false positives from package name collisions.
CrewAI: 75 critical in 761 files
The highest critical count after LangChain, in a framework that builds multi-agent pipelines where individual agents call tools, access external data, and pass results between each other. The combination of missing error handling (high count) and unvalidated inputs in a multi-agent orchestration context is the highest-risk profile in this audit.
The practical implication: when an agent step fails silently, the orchestrator continues with corrupted or empty context. In a multi-step pipeline with tool calls, that is not a theoretical risk — it is the default behavior when error handling is absent.
Google Gemini JS: Highest finding density — 6.07 per file
The smallest repo in the audit (55 files) with 334 total findings. The single critical finding is a dynamic require in a code transformation utility (samples/utils/insert-import-comments.js) — pattern-matching flagged it as potential require injection, though the context is a developer tool, not production SDK code. The density is driven by high counts of missing error handling and unvalidated inputs. High finding density in a small codebase often indicates systematic omissions rather than isolated bugs.
Full Ecosystem View
| Repository | Layer | Files | Critical | High | Total |
|---|---|---|---|---|---|
| langchain-ai/langchainjs | Agent orchestration | 1,433 | 150 | 480 | 5,347 |
| vercel/ai | Streaming SDK | 1,459 | 6 | 245 | 3,480 |
| anthropics/anthropic-sdk-python | Model provider | 547 | 24 | 51 | 113 |
| crewAIInc/crewAI | Agent framework | 761 | 75 | 82 | 430 |
| openai/openai-node | Model provider | 256 | 1 | 89 | 825 |
| modelcontextprotocol/typescript-sdk | MCP SDK | 96 | 2 | 27 | 292 |
| google-gemini/generative-ai-js | Model provider | 55 | 1 | 33 | 334 |
| modelcontextprotocol/servers | MCP reference | 58 | 1 | 12 | 140 |
| TOTAL (8 repos) | 4,665 | 260 | 1,019 | 10,961 |
The 3 Patterns That Persist Across All 8 Repos
Expanding the scope from 4 to 8 repositories did not change the structural findings. The same three categories appear at the top of every codebase's finding list, regardless of language, framework type, or team size.
Hardcoded credentials in example and test code
Every repository in this audit has hardcoded credentials. In every case, the credentials are in example files, integration test fixtures, or documentation samples — not in the production SDK code that ships to users. This distinction matters less than it appears. Developers clone repositories to understand patterns. When the first file they open shows a hardcoded API key, that pattern normalizes. It appears in their own code three weeks later. The supply chain risk is behavioral, not infrastructural.
Missing error handling in async and agent flows
Promise chains without catch handlers and async functions without try-catch are the dominant finding across all repos. In a client SDK this is often tolerable — the application layer handles errors. In an agent orchestration framework, it is not. When LangChain or CrewAI fails to handle an error in an intermediate agent step, the pipeline continues with undefined or empty context. In multi-step reasoning chains, one silent failure corrupts everything downstream.
Unvalidated inputs in tool handlers and API boundaries
Tool handlers — the functions that AI models call to interact with external systems — do not validate their inputs in the majority of cases examined. This matters most in MCP contexts: MCP servers receive tool calls from AI models that may process untrusted user input. A missing null check in a tool handler is the precondition for a crash that could be triggered through prompt injection. The risk is not hypothetical — it is the default execution path when validation is absent.
What This Means If You Build on These SDKs
Audit your own codebase for patterns you copied from SDK examples
If you cloned a LangChain quickstart or an MCP server example and never audited what you brought in, that is the first place to look. The hardcoded credential pattern spreads through imitation.
Treat agent framework errors as critical paths, not exceptional ones
In multi-agent pipelines, error handling in intermediate steps is not optional. An unhandled rejection in step 3 of a 7-step workflow will produce wrong output, not a visible error. Add explicit error handling at every agent boundary.
Validate tool inputs before execution, not after
If you build MCP tools or expose functions to AI models, validate all arguments at the entry point. Do not assume the model will only pass valid values — especially when the model processes user input that may include adversarial content.
Methodology
Tool: CodeSlick CLI v1.5.4 — 308 security checks across JavaScript, TypeScript, and Python. Includes 12 new MCP behavioral checks (MCP-JS-001–008, MCP-PY-001–004) added March 8, 2026.
Scan mode: Quick mode (--quick) — pattern-based static analysis. Deep TypeScript compiler type checking excluded for scan speed across 8 repos. All credential, injection, error-handling, and input-validation checks are fully active in this mode.
Scope: Shallow clones (--depth 1) of the default branch as of March 23, 2026. All files scanned including examples, tests, and documentation code. This is intentional: example code in SDKs is how patterns propagate.
Interpretation: Static analysis findings require manual triage before treating as confirmed vulnerabilities. Some findings — particularly known-malicious-package in the Anthropic Python SDK — require additional review to distinguish true positives from false positives. All findings are published unfiltered; we report what the tool found, not a curated subset.
Audit your own codebase
The patterns in these repos appear in production applications that use them. Run the same analysis on your own code in under 60 seconds.