
gittech. site
for different kinds of informations and explorations.
ML-Dev-Bench β Testing AI Agents on Real-World ML Workflows
ML-Dev-Bench
Ever wondered if AI agents can reliably develop new AI models? Look no further!
ML-Dev-Bench is a benchmark for evaluating AI agents on real world ML development tasks.
The benchmark currently includes 30 tasks covering various aspects of model development, including dataset management, debugging model and code failures, and implementing new ideas to achieve strong performance on various machine learning tasks.
We also introduce Calipers, a framework for evaluating AI agents, providing tools and infrastructure for systematic assessment of AI model performance.
Table of Contents
- Highlights
- Features
- Adding New Evaluation Tasks
- Requirements
- Installation
- Usage
- Development
- Project Structure
- Adding new Evaluation Cases
- Adding New Agents
- Contributing
- Evaluation Traces
- License
- Acknowledgments
- Citation
Highlights
What kind of tasks are currently in ml-dev-bench?
ml-dev-bench currently includes 30 tasks across the following categories.
Category | Description |
---|---|
Dataset Handling | Downloading and preprocessing datasets |
Model Training | Loading pretrained models, fine-tuning |
Debugging | Addressing errors in training files, exploding gradients, and incorrect implementations |
Model Implementation | Modifying and implementing on top of existing model architectures |
API Integration | Integrating logging tools like WandB |
Performance | Improving baselines and achieving competitive results |
What kind of ML problems do these tasks cover?
The tasks cover ML development in problem domains like image classification, segmentation, question answering, image generation, LLM finetuning and alignment, etc.
What is the performance of different agents on these tasks?
We currently evaluate 3 agents (ReAct, OpenHands, and AIDE) using 3 models (Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash) on 30 tasks.
What are the common failures across agents?
Agents perform well in easier and well-defined categories like dataset handling and basic debugging with clear instructions, but struggle in open-ended and long-running tasks like model performance improvement where no agent succeeded. Agents also fail in debugging and implementation tasks which need modifications to large existing codebases.
Features
- Flexible evaluation framework for AI agents
- Comprehensive metrics tracking and reporting
- Integration with LiteLLM and LangChain
- Configurable task-based evaluation system using Hydra
- Support for parameter sweeps and multi-run evaluations
Adding New Evaluation Tasks
We welcome contributions of new evaluation tasks! The process is:
Propose Your Task
- Create a new issue using our New Evaluation Task template
- This helps gather feedback and ensure the task fits our evaluation framework
Implement Your Task
- After discussion and approval, implement your task following our examples:
- hello_world for basic task structure
- nan_losses for tasks with setup files and test scripts
- After discussion and approval, implement your task following our examples:
Submit Your Implementation
- Create a pull request using our New Evaluation Task template
- Ensure all validation criteria and tests are implemented
Requirements
- Python 3.12+
- Poetry 1.8+
- Linux, macOS, or Windows Subsystem for Linux (WSL)
Installation
- Clone the repository:
git clone https://github.com/ml-dev-bench/ml-dev-bench.git
cd ml-dev-bench
- Install dependencies:
make build
This will:
- Check system requirements
- Install Python dependencies
- Set up pre-commit hooks
- Configure the development environment
- Install runtime dependencies:
This is needed for running evaluations locally.
make install-runtime-dependencies
Usage
The evaluation framework uses Hydra for configuration management, allowing flexible task and agent configurations.
Basic Usage
Run a single task with a specific agent:
./scripts/eval.sh task=hello_world agent=openhands
Run with configuration overrides:
./scripts/eval.sh task=hello_world agent=openhands num_runs=3
Multi-run Evaluations
Create a .env file to store the API keys for the agents you are using.
Activate the virtual environment for that agent from the root directory (e.g. for OpenHands):
source .venv-openhands/<ml-dev-bench-version>/bin/activate
Run all available tasks with a specific agent:
./scripts/eval.sh --multirun "task=glob(*)" agent=openhands
Run a list of tasks with a specific agent:
./scripts/eval.sh --multirun task=hello_world,shape_mismatch_train agent=react
Development
- Format and lint code:
make lint
Project Structure
.
βββ calipers/
β βββ agents/ # Agent implementations
β βββ callbacks/ # Callback handlers
β βββ framework/ # Core evaluation framework
β βββ metrics/ # Metrics tracking
β βββ scripts/ # CLI tools
β
βββ runtime/
βββ backends/ # Runtime backend implementations
βββ environments/ # Environment configurations
βββ tools/ # Runtime tools
Adding new Evaluation Cases
Use the structure of the existing cases in the ml_dev_bench/cases
directory.
You need to create a new directory in the ml_dev_bench/cases
directory and add the new case files.
A case includes a task.txt file that lists the tasks to be run, a config.yaml file that lists the configuration for the case, and a python file that evaluates the case. Optionally, you can add a setup_workspace directory that will be cloned into the workspace for the case.
Adding New Agents
Setting up Agent Dependencies using Poetry
- Add a new group in
pyproject.toml
:
[tool.poetry.group.{your-agent-name}.dependencies]
dependency1 = "^version"
dependency2 = "^version"
- Add a corresponding make target in
Makefile
:
install-{your-agent}-dependencies:
@echo "$(GREEN)Installing Python dependencies with {your-agent} in new environment...$(RESET)"
POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry env use python$(PYTHON_VERSION)
POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry install --with {your-agent}
This creates a separate virtual environment with a suffix matching your agent name (e.g., .venv-{your-agent}
).
Example: The react-agent group is set up with:
make install-react-agent-dependencies
This creates a dedicated environment at .venv-react
with all react-agent specific dependencies.
Adding Agents Code
- Create a new directory under
agents/
with your agent name (e.g.,agents/my_agent/
) - Add your agent implementation files in this directory
- Create a
Dockerfile
in your agent directory that extends the base image - Add agent configuration in
ml_dev_bench/conf/agent/
Example structure:
agents/
βββ my_agent/
β βββ __init__.py
β βββ my_agent.py # Your agent implementation
β βββ Dockerfile # Agent-specific Dockerfile
βββ utils.py # Shared utilities
Agent Docker Setup
The project uses a two-stage Docker build:
- A base image with core dependencies
- Agent-specific images that extend the base image
Building Images
- Build the base image (from project root):
docker build -t ml-dev-bench-base -f docker_base/base.Dockerfile .
- Build your agent's image (from project root):
docker build -t ml-dev-bench-myagent -f agents/my_agent/Dockerfile .
Creating Agent Dockerfile
Your agent's Dockerfile should:
- Extend the base image
- Copy agent-specific code
- Install agent-specific dependencies
Example agent Dockerfile:
FROM ml-dev-bench-base:latest
# Copy the agent code
COPY agents/my_agent/ ./agents/my_agent/
COPY agents/__init__.py ./agents/
COPY agents/utils.py ./agents/
# Install agent-specific dependencies
RUN poetry install --with my-agent
# Set working directory
WORKDIR $WORKDIR/agents/my_agent
# Default command - open a shell with poetry env
CMD ["poetry", "shell"]
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run linters and tests
- Submit a pull request
Evaluation Traces
Coming Soon
License
MIT License - see the LICENSE file for details
Acknowledgments
- LiteLLM for LLM integration
- Composio for runtime management
- Hydra for configuration management
Citation
If you use ML-Dev-Bench in your research, please cite our paper:
@misc{mldevbench,
title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows},
author={Harshith Padigela and Chintan Shah and Dinkar Juyal},
year={2025},
eprint={2502.00964},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.00964},
}