Security ResearchMarch 21, 2026·11 min read

We Audited Every Major AI Agent Framework.
Here's What We Found.

We ran CodeSlick against the source code of the most-used AI agent frameworks. All of them had serious issues. But agent frameworks compound risk: they run your agents, execute generated code, handle secrets, and persist state between autonomous actions.

Frameworks scanned

916

Vulnerabilities (new)

136

Critical findings

1,181

Files analyzed

What We Scanned

New audits: AutoGen and CrewAI. Combined with previously published results for LangChain, Vercel AI, OpenAI, and MCP Servers.

Framework	Stars	Files	Total	Critical	High
microsoft/autogenNEW	~40k	422	490	61	85
crewAIInc/crewAINEW	~25k	759	426	75	81
langchain-ai/langchainjs	~13k	~3,200	8,650	200	—
vercel/ai	~9k	~2,100	10,460	17	—
openai/openai-node	~8k	—	1,105	1	—
modelcontextprotocol/servers	—	—	140	1	—

Scan date: March 2026. CodeSlick v20260319 (306 security checks). Shallow clone, full repo surface.

AutoGen: `exec()` as Architecture

AutoGen is Microsoft's multi-agent framework built around one idea: LLM agents write Python code, and the framework runs it. CodeSlick flagged 16 uses of eval() / exec() / compile() as critical — and they're not bugs. They're the entire point of AutoGen. The CodeExecutorAgent literally calls exec() on code generated by the LLM.

AutoGen — CodeExecutorAgent (simplified)

exec(generated_code, namespace)
# CodeSlick: CRITICAL — eval-usage (CVSS 9.8, CWE-78)

Why this matters when you build on AutoGen: any prompt injection that reaches the code generator can produce exec()payloads that run in your environment. AutoGen has sandboxing options (Docker execution), but many deployments skip them. If your AutoGen agent has file system or network access and you're not using Docker isolation — an adversarial prompt that reaches the LLM can execute arbitrary code.

Other AutoGen findings

4 insecure deserialization — agent state is pickled/unpickled without validation. Attacker-controlled agent state can achieve RCE via crafted pickle payloads.
1 command injection — subprocess call with unsanitized input
27 missing input validation patterns — agents receiving external data don't sanitize before use
34 silent exception suppressions — agent loop continues with corrupted state on failure

CrewAI: SQL Injection in an Agent Orchestrator

CrewAI is the "role-playing agents" framework — you define crew members with roles and they collaborate on tasks. It's the second most popular agent framework after AutoGen. The most surprising finding: SQL injection in the framework's storage layer.

CrewAI — memory storage (from scan)

query = f"SELECT * FROM tasks WHERE id = {task_id}"
# CodeSlick: CRITICAL — sql-injection (CVSS 9.8, CWE-89)

CrewAI stores crew memory, task outputs, and tool results. SQL injection here means an attacker who can influence task outputs — through a malicious tool response, for example — can manipulate the crew's memory database. Agent output becomes an injection vector.

Other CrewAI findings

9 eval() / exec() calls — tool execution and code evaluation
1 hardcoded credential in config objects
71 silent exception suppressions — highest concentration in tool and memory modules
23 AI-generated code patterns detected — CodeSlick detected AI-written code inside CrewAI itself (hallucinated method names and over-engineered patterns)

The Pattern Both Share: Silent Exception Suppression

This is the finding that worries us most for production agent systems.

Pattern found 34× in AutoGen, 71× in CrewAI

try:
    result = agent.run(task)
except Exception:
    pass  # CodeSlick: silent-exception-suppression (CWE-390)

In a normal web app, a swallowed exception means one request fails silently. In an agent pipeline, it means the agent loop continues with corrupted state. The next agent in the chain receives a None result, infers a default, and the pipeline completes — looking successful to the orchestrator while producing garbage output.

Worse: silent failures in agent loops can create retry storms. If an agent tool silently fails, the LLM may retry indefinitely, consuming tokens and time, before the orchestrator times out.

Triage note: the `request` package

Both repos triggered the known-malicious-package check for "request" (16× in AutoGen, 9× in CrewAI). This is CodeSlick flagging request (singular) — a known typosquat of the legitimate requests library — found in example scripts and test fixtures. Worth flagging even in test code, but not production dependencies. Post-triage adjusted critical counts: AutoGen 45 critical (down from 61), CrewAI 66 critical (down from 75). Still severe.

Risk Comparison Across All 6 Frameworks

Low

openai/openai-node

Client SDK, minimal logic, few findings

Low

modelcontextprotocol/servers

Small, focused, mostly example code

Medium

vercel/ai

Many findings, but mostly infrastructure-level; lower critical density

High

crewAIInc/crewAI

SQL injection in storage, eval in tool execution, silent failures throughout

Very High

microsoft/autogen

exec() is architecture, not an accident; deserialization risk on agent state

Very High

langchain-ai/langchainjs

200 critical, highest volume in the dataset

What to Do If You Build on These

Building on AutoGen

•Never skip Docker/sandbox execution. The exec() surface is real. Sandboxed execution is not optional for production.
•Validate all agent inputs before they reach the code generator.
•Don't trust agent state loaded from external storage — pickle deserialization of untrusted state is RCE.

Building on CrewAI

•Audit your tool implementations for SQL injection if you store crew memory to a database.
•Treat tool outputs as untrusted data — don't interpolate them into queries or commands.
•Add logging to exception handlers in agent pipelines — silent failures destroy reliability.

For all agent frameworks

•Pin your dependencies and run pip-audit / npm audit on lock files regularly.
•Run a static analyzer against your own agent code — not just the framework you build on.
•The framework's security posture sets a floor, not a ceiling. Your agent code adds more surface.

Audit Data is Open

All scan results are in our public audit repository:

github.com/VitorLourenco/ai-sdk-security-audits

autogen-clean.json

crewai-clean.json

agent-frameworks-summary.json

Run This on Your Own Agent Code

The framework's security posture is a baseline. Your code adds more surface.

# CLI — free, no account needed

npx codeslick scan --all ./my-agent-project

Try the Web Tool Read the AI SDK Audit

Back to Blog

Security ResearchAI AgentsAutoGenCrewAISAST

We Audited Every Major AI Agent Framework.Here's What We Found.