Layered Architecture • Stack • Swap • Built for Scale
- 🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
- 📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
- 📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
- 🔧 Tool Support: Manage tool selection via pluggable wrappers.
- 🧩 Easy Extensions: Add agents via subclassing—no core changes.
- 📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.
See MASArena in action! This demo showcases the framework's visualization capabilities:
visualization.mp4
We recommend using uv for dependency and virtual environment management.
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activate
Create a .env
file in the project root and set the following:
OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1
./run_benchmark.sh
- Supported benchmarks:
- Math:
math
,aime
- Code:
humaneval
,mbpp
- Reasoning:
drop
,bbh
,mmlu_pro
,ifeval
- Math:
- Supported agent systems:
- Single Agent:
single_agent
- Multi-Agent:
supervisor_mas
,swarm
,agentverse
,chateval
,evoagent
,jarvis
,metagpt
- Single Agent:
For comprehensive guides, tutorials, and API references, visit our complete documentation.
- Add asynchronous support for model calls
- Implement failure detection in MAS workflows
- Add more benchmarks emphasizing tool usage
- Improve configuration for MAS and tool integration
- Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
- Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
- Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
- Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework
We warmly welcome contributions from the community!
You can contribute in many ways:
-
🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.
-
📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.
-
🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.
-
⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.