Skip to content

A comprehensive framework for benchmarking single and multi-agent systems across a wide range of tasks—evaluating performance, accuracy, and efficiency with built-in visualization and tool integration.

License

Notifications You must be signed in to change notification settings

LINs-lab/MASArena

Repository files navigation

MASArena 🏟️

Python 3.11+ License: MIT Documentation Ask DeepWiki

Layered ArchitectureStackSwapBuilt for Scale

MASArena Architecture

🌟 Core Features

  • 🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
  • 📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
  • 📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
  • 🔧 Tool Support: Manage tool selection via pluggable wrappers.
  • 🧩 Easy Extensions: Add agents via subclassing—no core changes.
  • 📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.

🎬 Demo

See MASArena in action! This demo showcases the framework's visualization capabilities:

visualization.mp4

🚀 Quick Start

1. Setup

We recommend using uv for dependency and virtual environment management.

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

2. Configure Environment Variables

Create a .env file in the project root and set the following:

OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1

3. Running Benchmarks

./run_benchmark.sh
  • Supported benchmarks:
    • Math: math, aime
    • Code: humaneval, mbpp
    • Reasoning: drop, bbh, mmlu_pro, ifeval
  • Supported agent systems:
    • Single Agent: single_agent
    • Multi-Agent: supervisor_mas, swarm, agentverse, chateval, evoagent, jarvis, metagpt

📚 Documentation

For comprehensive guides, tutorials, and API references, visit our complete documentation.

✅ TODOs

  • Add asynchronous support for model calls
  • Implement failure detection in MAS workflows
  • Add more benchmarks emphasizing tool usage
  • Improve configuration for MAS and tool integration
  • Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
  • Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
  • Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
  • Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework

🙌 Contributing

We warmly welcome contributions from the community!

You can contribute in many ways:

  • 🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.

  • 📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.

  • 🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.

  • ⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.

About

A comprehensive framework for benchmarking single and multi-agent systems across a wide range of tasks—evaluating performance, accuracy, and efficiency with built-in visualization and tool integration.

Resources

License

Stars

Watchers

Forks

Packages

No packages published