MASArena 🏟️

Layered Architecture • Stack • Swap • Built for Scale

🌟 Core Features

🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
🔧 Tool Support: Manage tool selection via pluggable wrappers.
🧩 Easy Extensions: Add agents via subclassing—no core changes.
📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.

🎬 Demo

See MASArena in action! This demo showcases the framework's visualization capabilities:

visualization.mp4

🚀 Quick Start

1. Setup

We recommend using uv for dependency and virtual environment management.

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

2. Configure Environment Variables

Create a .env file in the project root and set the following:

OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1

3. Running Benchmarks

./run_benchmark.sh

Supported benchmarks:
- Math: math, aime
- Code: humaneval, mbpp
- Reasoning: drop, bbh, mmlu_pro, ifeval
Supported agent systems:
- Single Agent: single_agent
- Multi-Agent: supervisor_mas, swarm, agentverse, chateval, evoagent, jarvis, metagpt

📚 Documentation

For comprehensive guides, tutorials, and API references, visit our complete documentation.

✅ TODOs

Add asynchronous support for model calls
Implement failure detection in MAS workflows
Add more benchmarks emphasizing tool usage
Improve configuration for MAS and tool integration
Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework

🙌 Contributing

We warmly welcome contributions from the community!

You can contribute in many ways:

🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.
📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.
🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.
⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
.github/workflows		.github/workflows
data		data
docs		docs
mas_arena		mas_arena
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

📚 Documentation

✅ TODOs

🙌 Contributing

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

LINs-lab/MASArena

Folders and files

Latest commit

History

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

📚 Documentation

✅ TODOs

🙌 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages