We ran CodeSlick against the source code of the most-used AI agent frameworks. All of them had serious issues. But agent frameworks compound risk: they run your agents, execute generated code, handle secrets, and persist state between autonomous actions.
New audits: AutoGen and CrewAI. Combined with previously published results for LangChain, Vercel AI, OpenAI, and MCP Servers.
| Framework | Stars | Files | Total | Critical | High |
|---|---|---|---|---|---|
| microsoft/autogenNEW | ~40k | 422 | 490 | 61 | 85 |
| crewAIInc/crewAINEW | ~25k | 759 | 426 | 75 | 81 |
| langchain-ai/langchainjs | ~13k | ~3,200 | 8,650 | 200 | — |
| vercel/ai | ~9k | ~2,100 | 10,460 | 17 | — |
| openai/openai-node | ~8k | — | 1,105 | 1 | — |
| modelcontextprotocol/servers | — | — | 140 | 1 | — |
Scan date: March 2026. CodeSlick v20260319 (306 security checks). Shallow clone, full repo surface.
exec() as ArchitectureAutoGen is Microsoft's multi-agent framework built around one idea: LLM agents write Python code, and the framework runs it. CodeSlick flagged 16 uses of eval() / exec() / compile() as critical — and they're not bugs. They're the entire point of AutoGen. The CodeExecutorAgent literally calls exec() on code generated by the LLM.
exec(generated_code, namespace) # CodeSlick: CRITICAL — eval-usage (CVSS 9.8, CWE-78)
Why this matters when you build on AutoGen: any prompt injection that reaches the code generator can produce exec()payloads that run in your environment. AutoGen has sandboxing options (Docker execution), but many deployments skip them. If your AutoGen agent has file system or network access and you're not using Docker isolation — an adversarial prompt that reaches the LLM can execute arbitrary code.
CrewAI is the "role-playing agents" framework — you define crew members with roles and they collaborate on tasks. It's the second most popular agent framework after AutoGen. The most surprising finding: SQL injection in the framework's storage layer.
query = f"SELECT * FROM tasks WHERE id = {task_id}"
# CodeSlick: CRITICAL — sql-injection (CVSS 9.8, CWE-89)CrewAI stores crew memory, task outputs, and tool results. SQL injection here means an attacker who can influence task outputs — through a malicious tool response, for example — can manipulate the crew's memory database. Agent output becomes an injection vector.
eval() / exec() calls — tool execution and code evaluationThis is the finding that worries us most for production agent systems.
try:
result = agent.run(task)
except Exception:
pass # CodeSlick: silent-exception-suppression (CWE-390)In a normal web app, a swallowed exception means one request fails silently. In an agent pipeline, it means the agent loop continues with corrupted state. The next agent in the chain receives a None result, infers a default, and the pipeline completes — looking successful to the orchestrator while producing garbage output.
Worse: silent failures in agent loops can create retry storms. If an agent tool silently fails, the LLM may retry indefinitely, consuming tokens and time, before the orchestrator times out.
request packageBoth repos triggered the known-malicious-package check for "request" (16× in AutoGen, 9× in CrewAI). This is CodeSlick flagging request (singular) — a known typosquat of the legitimate requests library — found in example scripts and test fixtures. Worth flagging even in test code, but not production dependencies. Post-triage adjusted critical counts: AutoGen 45 critical (down from 61), CrewAI 66 critical (down from 75). Still severe.
exec() surface is real. Sandboxed execution is not optional for production.pip-audit / npm audit on lock files regularly.All scan results are in our public audit repository:
github.com/VitorLourenco/ai-sdk-security-auditsThe framework's security posture is a baseline. Your code adds more surface.