CaseStudies

MassGen v0.0.28: AG2 Framework Integration with External Agent Adapters

MassGen is focused on case-driven development. MassGen v0.0.28 introduces AG2 framework integration through a comprehensive external agent adapter system, enabling MassGen to orchestrate agents from other established frameworks while maintaining its unique multi-agent consensus architecture. This case study demonstrates how AG2 agents with code execution capabilities can be seamlessly integrated into MassGen’s collaborative environment.

:depth: 3
:local:

📋 PLANNING PHASE

📝 Evaluation Design

Prompt

The prompt tests whether MassGen can effectively integrate and orchestrate agents from external frameworks (specifically AG2) while maintaining its collaborative consensus mechanism:

Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents.

Baseline Config

Prior to v0.0.28, MassGen could only use agents from its native backends (OpenAI, Anthropic, Gemini, etc.). There was no way to integrate agents from external frameworks like AG2 or other multi-agent systems.

A baseline would use only native MassGen agents without external framework integration.

Baseline Command

# Pre-v0.0.28: No AG2 support available
# Would need to use only native backends
uv run python -m massgen.cli --config basic/single/single_gemini2.5pro.yaml "Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents."

🔧 Evaluation Analysis

Results & Failure Modes

Before v0.0.28, users attempting to work with external frameworks like AG2 would face:

Cannot Leverage Existing AG2 Work:

Users who already built AG2 agents and workflows have no path to integrate them into MassGen
Forces users to choose between their AG2 investments OR MassGen’s consensus - cannot combine both

Limited Framework Options:

Restricted to MassGen’s native backends (OpenAI, Anthropic, Gemini, etc.)
Cannot bring specialized agents from other established frameworks into the collaboration

Success Criteria

AG2 Workflows Work in MassGen: Users can bring their existing AG2 agents into MassGen orchestration
AG2 Features Preserved: AG2 capabilities (code execution, tool calling) function correctly within MassGen
Full Collaboration: AG2 agents participate in MassGen’s consensus mechanism (generate, vote, refine)
Beyond AG2: The adapter architecture supports other external frameworks, not just AG2
Easy Configuration: Simple YAML setup for AG2 agents without complex integration work

🎯 Desired Features

With these goals defined, the next step was to design a system capable of bridging MassGen with external frameworks.

Adapter Architecture: Base adapter system bridging MassGen and external frameworks
AG2 Adapter: Implementation supporting AG2’s ConversableAgent and AssistantAgent types
External Agent Backend: New backend type routing requests through framework adapters
Code Execution Support: Integration of AG2’s executor types (LocalCommandLine, Docker, Jupyter, YepCode)
Async Execution: Proper handling of AG2’s async operations for autonomous agent behavior
Framework-Agnostic Config: YAML configuration extensible to any external framework

🚀 TESTING PHASE

📦 Implementation Details

Version

MassGen v0.0.28 (October 6, 2025)

✨ New Features

The AG2 integration was realized through an adapter registry pattern that bridges MassGen’s orchestration with external frameworks. The implementation consists of three core components:

1. Adapter Architecture

The integration introduces a base adapter interface (massgen/adapters/base.py) that defines how external frameworks communicate with MassGen. The AG2-specific implementation (massgen/adapters/ag2_adapter.py) supports:

AG2’s ConversableAgent and AssistantAgent types
Multiple code execution environments: local, Docker, Jupyter, and YepCode
Async operations through AG2’s a_generate_reply for autonomous agent behavior
Tool/function calling capabilities

2. External Agent Backend

A new ExternalAgentBackend class (massgen/backends/external.py) serves as the routing layer between MassGen and framework adapters. It handles:

Framework detection and adapter selection
Configuration parsing and validation
Request forwarding to appropriate adapters
Extensibility for future frameworks beyond AG2

3. Validation & Testing

Comprehensive test coverage ensures reliability across the integration:

test_ag2_adapter.py - AG2 adapter functionality validation
test_agent_adapter.py - Base adapter interface compliance
test_external_agent_backend.py - External backend integration tests

Additional v0.0.28 improvements include enhanced MCP circuit breaker logic and better error handling. See the full v0.0.28 release notes for complete details.

New Config

Configuration file: massgen/configs/ag2/ag2_coder_case_study.yaml

Key breakthrough - type: ag2 backend enables AG2 framework integration:

agents:
  - id: "ag2_coder"
    backend:
      type: ag2  # NEW: External agent backend using AG2 adapter
      agent_config:
        type: assistant  # AG2 AssistantAgent
        name: "AG2_coder"
        system_message: |
          You are a helpful coding assistant. When asked to create and run code:
          1. Write Python code in a markdown code block (```python ... ```)
          2. The code will be automatically executed
          3. Always print the results so they are visible

          If you need to access information from the web, create and run a web scraping script.
        llm_config:
          api_type: "openai"
          model: "gpt-5"
        code_execution_config:  # AG2's code execution feature
            executor:
              type: "LocalCommandLineCodeExecutor"
              timeout: 60
              work_dir: "./code_execution_workspace"

Key Innovation: This configuration shows AG2 agents working alongside native MassGen agents (Gemini) in the same coordination workflow!

Command

uv run python -m massgen.cli --config ag2/ag2_coder_case_study.yaml "Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents."

🤖 Agents

Agent 1 (ag2_coder): AG2 AssistantAgent with code execution capabilities
- Backend: AG2 framework (external adapter)
- Model: GPT-5 via AG2’s LLM config
- Special capabilities: Python code execution with LocalCommandLineCodeExecutor
Agent 2 (gemini2.5pro): Native Gemini agent with web search
- Backend: Gemini (native MassGen backend)
- Model: gemini-2.5-pro
- Special capabilities: Web search enabled

Both agents participate in MassGen’s collaborative consensus mechanism, with AG2 agent bringing code execution capabilities while Gemini provides web-enhanced reasoning.

🎥 Demo

Watch the v0.0.28 AG2 Framework Integration in action:

Key artifacts from the case study run:

AG2 agent successfully executed web scraping code (GitHub API + README parsing) to gather live repository data
Both AG2 and Gemini agents generated comprehensive comparisons
Agents engaged in multiple iterations of refinement (17 total restarts)
Final consensus reached through MassGen’s voting mechanism

📊 EVALUATION & ANALYSIS

Results

The v0.0.28 AG2 integration successfully achieved all success criteria and demonstrated powerful cross-framework collaboration:

✅ Seamless Integration: AG2 agent configured and executed within MassGen orchestration without issues

✅ Feature Preservation: AG2’s code execution capabilities worked correctly (agent executed web scraping scripts combining GitHub API calls, README parsing, and data analysis)

✅ Collaborative Participation: AG2 agent generated 2 answers, participated in voting, and engaged in 9 restart cycles for refinement

✅ Extensible Architecture: Adapter system (base.py + ag2_adapter.py) designed for easy extension to other frameworks

✅ Configuration Simplicity: Clean YAML configuration with framework-specific settings under backend.agent_config

The Collaborative Process

How agents generated answers with v0.0.28 AG2 integration:

Understanding Answer Labels: MassGen uses a labeling system agent{N}.{attempt} where:

N = Agent number (1, 2, etc.)
attempt = Answer iteration number (increments after each restart/refinement)

For example:

agent2.1 = Agent 2’s 1st answer
agent2.3 = Agent 2’s 3rd answer (after 2 restarts)
agent1.final = Agent 1’s final answer as the winner

Multi-Iteration Refinement Pattern: The coordination log reveals an intensive collaborative refinement process:

Total restarts: 17 (Agent 1: 9 restarts, Agent 2: 8 restarts)
Total answers: 2 final answers (agent1.2, agent2.8 after multiple refinements)
Voting iterations: Multiple voting iterations with agents critiquing and improving

Agent 1 (AG2 agent) - Code Execution Approach:

Initial attempt (agent1.1): Leveraged AG2’s code execution capability to write and execute a Python script that programmatically analyzed both frameworks through:
- GitHub REST API calls for repository metadata (fetched live stats: AG2 3624 stars, MassGen 465 stars)
- Raw README fetching and parsing from both repositories
- Keyword extraction and frequency analysis across documentation
- Heuristic comparison logic based on extracted signals
- Code executed successfully with exitcode: 0 in ./code_execution_workspace
Restart 1-9: Despite having execution results, observed Agent 2’s conceptual analysis and restarted 9 times, ultimately deciding to pivot away from the data-driven approach
Final answer (agent1.2): Abandoned the execution output and created a concise conceptual bullet-point comparison that won selection

Agent 2 (Gemini agent) - Web Search & Analysis:

Initial attempt (agent2.1): Used web search to research both frameworks, generated structured comparison with tables
Restart 1-8: Refined through 8 iterations, reviewing Agent 1’s code-based analysis and incorporating insights
Final answer (agent2.8): Comprehensive multi-layered comparison emphasizing architectural differences

Key v0.0.28 improvement: AG2’s code execution capabilities brought a programmatic, data-driven approach to information gathering, complementing Gemini’s web search capabilities within the same consensus workflow.

The Voting Pattern

Cross-Framework Voting Dynamics:

The coordination table shows sophisticated voting behavior:

Agent 1 (AG2) voting behavior:
- Voted for Agent 2’s answers initially (agent2.1, agent2.3, agent2.4) across multiple iterations
- Later voted for its own refined answer (agent1.2) after multiple improvements
- Reasoning: “Concise, accurate bullet-point comparison focused on differences”
Agent 2 (Gemini) voting behavior:
- Consistently voted for Agent 2’s own refined answer (agent2.8)
- Reasoning: “Excellent, multi-layered comparison”
Winner selection:
- Agent 1 (AG2 agent) selected as winner with answer agent1.2
- Both agents voted for themselves in final iterations, but Agent 1’s answer was selected as the winner

Key v0.0.28 improvement: AG2 agents can fully participate in MassGen’s voting mechanism, demonstrating that external framework agents understand and contribute to consensus building.

Final Results & Answer Comparison

Winner: Agent 1 (ag2_coder) - AG2 AssistantAgent

To provide concrete evidence of the quality difference, here are the actual answers from both agents:

Agent 1’s Answer (agent1.2):

Concise bullet-point format covering 9 key comparison dimensions
Direct technical differences without narrative overhead
Includes actionable “Choose X when…” guidance for decision-making
Scannable hierarchical structure ideal for quick reference

📄 Agent 1 (AG2) - Final Answer (agent1.2) ✓ Selected as Winner

``` Here's a concise differences summary between AG2 and MassGen for LLM agents: - Core purpose - AG2: General-purpose AgentOS/SDK for building, orchestrating, and running multi-agent applications with tools, memory, and human-in-the-loop. - MassGen: Test-time scaling/ensemble system; runs multiple agents/models on the same task in parallel with debate/consensus to improve a single result. - Collaboration pattern - AG2: Role-specialized agents (planner, executor, reviewer, etc.) coordinated via structured workflows or group chats; typically sequential or staged interactions. - MassGen: Peer agents work concurrently on the same prompt; share observations, critique, and converge via voting/aggregation. - Workflow primitives - AG2: Conversable agents, tool/function calling, memory/state, group-chat patterns, workflow orchestration, strong human-in-the-loop options. - MassGen: Round-based parallel runs, self-reflection/debate, cross-agent comparison, consensus/majority selection. - Scope and extensibility - AG2: Broad framework for end-to-end apps (tools, RAG, web/actions, integrations), suitable for production pipelines and complex automations. - MassGen: Focused on quality-boosting ensemble logic; often wrapped around existing prompts/tools to raise single-task answer quality. - Model/provider usage - AG2: Integrates with multiple model providers and tool ecosystems; emphasizes building blocks for complete agent systems. - MassGen: Encourages using diverse LLMs/providers in parallel to gain complementary strengths and better consensus. - Trade-offs - AG2: More scaffolding and control; best for multi-step workflows, reliability, and auditability. Latency depends on workflow depth. - MassGen: Higher per-task cost/latency due to many parallel runs; trades compute for improved accuracy/robustness on a single query. - Typical use cases - AG2: Automated coding/review loops, data/ETL agents, RAG assistants, customer support flows, complex task pipelines. - MassGen: Hard reasoning/coding challenges, creative generation, evaluations where ensemble consensus boosts final quality. - Quick guidance - Choose AG2 when you need to build, orchestrate, and operate agent workflows with tools, memory, and roles. - Choose MassGen when you want to boost the final answer quality on a single task via parallel multi-agent debate/consensus. ```

Agent 2’s Answer (agent2.8):

Rich narrative format with analogies (“Agent Operating System” vs “Parallel Study Group”)
Comprehensive comparison table covering key features
Multi-layered explanation with dedicated sections for each framework
Educational approach providing deeper conceptual understanding

📄 Agent 2 (Gemini) - Final Answer (agent2.8)

```markdown ### Core Distinction: Orchestration vs. Parallel Refinement The fundamental difference between AG2 and MassGen lies in their purpose and agent collaboration model. AG2 is a broad, general-purpose framework for **building and orchestrating** complex, often sequential, workflows with multiple specialized agents. In contrast, MassGen is a specialized system designed for multiple agents to tackle the **same task in parallel** to generate a single, higher-quality result through debate and consensus. * **Analogy for AG2:** Think of it as an **Agent Operating System**. It manages a team of specialists (like a planner, a coder, and a reviewer), each handling a different part of a larger, multi-step process. * **Analogy for MassGen:** Think of it as a **Parallel Study Group**. A group of diverse students simultaneously tackles the same difficult problem, sharing their approaches and critiquing each other's work to arrive at the best possible answer. ### Comparison Summary | Feature | AG2 (ag2ai/ag2) | MassGen (Leezekun/MassGen) | | :--- | :--- | :--- | | **Primary Goal** | A general framework to build, manage, and orchestrate multi-agent applications with complex, multi-step workflows. | A specialized system that runs multiple agents in parallel on a single task to achieve a superior, refined result. | | **Agent Collaboration** | Agents with specialized roles collaborate in structured, often sequential, conversational workflows (e.g., group chats, swarms). | Diverse agents work concurrently on the same problem, engaging in mutual observation, debate, and consensus-building. | | **Typical Use Case** | Automating end-to-end processes like software development, data analysis pipelines, trip planning, or customer support. | Improving the quality of a solution for a single complex query, such as a difficult coding challenge, creative writing, or hard reasoning problems. | | **Human Involvement** | Strong, built-in support for "human-in-the-loop" for oversight and intervention at various workflow stages. | Primarily autonomous agent collaboration, with the human typically receiving the final, converged output. | | **Trade-offs** | Requires more initial setup to define roles and workflows. Best for reliability, control, and multi-step processes. | Higher computational cost and latency per task due to parallel runs. Trades compute resources for higher accuracy on a single query. | | **Licensing** | Apache 2.0 license, evolving from the MIT-licensed AutoGen project. | Apache 2.0 license. | | **Project Origin**| Evolved from the popular Microsoft project AutoGen. | Inspired by advanced AI approaches like Grok and Gemini Deep Think. | --- ### **AG2: The General-Purpose Agent Operating System** AG2 (formerly AutoGen) is an open-source framework designed to be a foundational "AgentOS" for creating sophisticated multi-agent systems. It excels at defining structured workflows where agents with distinct roles and tools collaborate to solve complex problems, much like a team of human specialists. **Key Features of AG2:** * **Workflow Orchestration**: Its primary strength is creating complex, conversational workflows where specialized agents interact to automate a process. * **Conversable Agents**: The framework is built on the concept of a "Conversable Agent," a flexible building block for creating agents that can communicate, use LLMs, and interact with tools or humans. * **Human-in-the-Loop**: It features strong, built-in support for human oversight, allowing for systems that blend autonomous speed with expert human judgment. * **Broad Applicability**: Its general-purpose nature makes it suitable for a wide range of use cases, from automated coding to supply chain optimization and complex Q&A systems. **Best for:** Building applications that require multiple specialized agents to collaborate through a defined, multi-step process to automate a complex task. ### **MassGen: The Parallel Debate for High-Quality Results** Inspired by the methodologies of advanced AI, MassGen's core idea is not to divide a task into sequential parts, but to have multiple, diverse agents tackle the same task simultaneously. They work in parallel, observe each other's work, and refine their own approaches to converge on the best possible answer through a process of debate and consensus. **Key Features of MassGen:** * **Parallel Problem-Solving**: MassGen assigns a task to multiple AI agents who work in parallel, fostering a collaborative and iterative environment to find the best solution. * **Convergence and Refinement**: The system is designed for agents to share insights and debate solutions, intelligently detecting when they have reached a consensus on the best output. * **Diverse Model Support**: It is designed to leverage multiple different LLMs and providers (e.g., GPT-4, Gemini, Grok) at the same time to harness their complementary strengths. * **Enhanced Solution Quality**: The main goal is to achieve a superior, high-quality result for a single complex task by leveraging the diverse reasoning capabilities of multiple AI models working in tandem. **Best for:** Generating the highest quality output for a single, complex request by harnessing the diverse reasoning power of multiple top-tier LLMs working in parallel. ```

Final Presentation Answer:

Hybrid format combining Agent 1’s bullet-point structure with additional insights
Includes “Human involvement” dimension (from Agent 2) not present in Agent 1’s original
More general licensing statement than Agent 1’s original answer
Demonstrates MassGen’s consensus creates refined combinations, not just winner-take-all

🎤 Final Presentation Answer (What Users See)

This is the actual answer presented to users after MassGen's consensus mechanism selected Agent 1 as the winner: ``` Here's a concise differences summary between AG2 and MassGen for LLM agents: - Core purpose - AG2: General-purpose AgentOS/SDK for building, orchestrating, and running multi-agent applications with tools, memory, and human-in-the-loop. - MassGen: Test-time scaling/ensemble system; runs multiple agents/models on the same task in parallel with debate/consensus to improve a single result. - Collaboration pattern - AG2: Role-specialized agents (planner, executor, reviewer, etc.) coordinated via structured workflows or group chats; typically sequential or staged interactions. - MassGen: Peer agents work concurrently on the same prompt; share observations, critique, and converge via voting/aggregation. - Workflow primitives - AG2: Conversable agents, tool/function calling, memory/state, group-chat patterns, workflow orchestration, strong human-in-the-loop options. - MassGen: Round-based parallel runs, self-reflection/debate, cross-agent comparison, consensus/majority selection. - Scope and extensibility - AG2: Broad framework for end-to-end apps (tools, RAG, web/actions, integrations), suitable for production pipelines and complex automations. - MassGen: Focused on quality-boosting ensemble logic; often wrapped around existing prompts/tools to raise single-task answer quality. - Model/provider usage - AG2: Integrates with multiple model providers and tool ecosystems; emphasizes building blocks for complete agent systems. - MassGen: Encourages using diverse LLMs/providers (and settings) in parallel to gain complementary strengths and better consensus. - Human involvement - AG2: Built-in human-in-the-loop checkpoints, oversight, and guardrails. - MassGen: Generally autonomous until final selection; optional manual review. - Trade-offs - AG2: More scaffolding and control; best for multi-step workflows, reliability, and auditability. Latency depends on workflow depth. - MassGen: Higher per-task cost/latency due to many parallel runs; trades compute for improved accuracy/robustness on a single query. - Typical use cases - AG2: Automated coding/review loops, data/ETL agents, RAGassistants, customer support flows, complex task pipelines. - MassGen: Hard reasoning/coding challenges, creative generation, evaluations where ensemble consensus boosts final quality. - Quick guidance - Choose AG2 when you need to build, orchestrate, and operate agent workflows with tools, memory, and roles. - Choose MassGen when you want to boost the final answer quality on a single task via parallel multi-agent debate/consensus. - Licensing/community - Both are open-source; check each repository for current license details and ecosystem maturity. ``` --- **Key Insight:** This final answer is **not identical** to agent1.2 - it appears to be a refined combination: - Takes Agent 1's overall structure and bullet-point format - **Adds "Human involvement" comparison** (a dimension present in Agent 2's table but absent from Agent 1's original answer) - Updates licensing statement to be more general This suggests MassGen incorporate insights from multiple agents even after selecting a winner, creating a hybrid final answer.

Comparative Analysis:

The three answers above reveal MassGen’s sophisticated consensus mechanism:

Agent 1’s Strengths:

Best matched the prompt’s request for a “summary” through concise bullet-point format
Highly scannable structure allows quick reference and comparison
Actionable decision-making guidance with clear “Choose X when…” statements
Covered 9 comprehensive dimensions with balanced depth

Agent 2’s Strengths:

Superior conceptual framing with memorable analogies (“Agent Operating System” vs “Parallel Study Group”)
Rich educational value through detailed explanations and comparison tables
Multi-layered approach providing deeper understanding for learners
Strong narrative context that clarifies architectural differences

Final Answer Innovation:

MassGen didn’t simply select Agent 1’s answer verbatim
Instead, created a hybrid incorporating Agent 2’s “Human involvement” dimension
Refined the licensing statement to be more general
Demonstrates that consensus goes beyond winner-take-all to synthesize best insights

Why Agent 1 Was Selected as Primary: Agent 1’s streamlined, scannable format best matched the user’s request for a “summary comparing differences.” While Agent 2 provided richer context and excellent educational framing, Agent 1’s directness made it more suitable as the foundation for the final answer. Importantly, MassGen’s consensus mechanism still incorporated Agent 2’s valuable insights into the final presentation.

Key v0.0.28 validation: An AG2 agent with code execution successfully served as the final answer provider, proving that external framework agents can deliver high-quality results within MassGen’s orchestration.

Anything Else

Restart Dynamics - Evidence of Deep Refinement:

The case study shows unprecedented refinement activity:

17 total restarts
Both agents continuously improved their answers by:
- Reviewing competitor answers
- Gathering additional information (AG2 via code, Gemini via web)
- Refining clarity and structure

This intensive refinement pattern suggests:

The task complexity warranted multiple iterations
AG2’s code execution enabled programmatic information gathering
Cross-framework collaboration triggered competitive improvement
MassGen’s restart mechanism effectively drove quality enhancement

Framework Compatibility:

AG2’s async execution (a_generate_reply) integrated smoothly with MassGen’s orchestration
Code execution workspace (./code_execution_workspace) managed without conflicts
AG2’s system messages and LLM config preserved within MassGen’s agent configuration
No compatibility issues between AG2’s conversation model and MassGen’s coordination

🎯 Conclusion

The AG2 Framework Integration in v0.0.28 successfully solves the framework interoperability challenge that users faced when trying to combine specialized agent capabilities with MassGen’s consensus approach. The key user benefits specifically enabled by this feature include:

Best of Both Worlds: Users can now leverage AG2’s specialized capabilities (code execution, tool calling) within MassGen’s multi-agent consensus framework
Extensible Future: The adapter architecture opens the door to integrating other frameworks (LangGraph, CrewAI), exponentially expanding MassGen’s capabilities

Broader Implications:

The adapter system represents a paradigm shift for MassGen:

From isolated ecosystem to framework orchestrator: MassGen evolves from a standalone multi-agent system to a meta-orchestration layer
Accelerated capability expansion: Instead of reimplementing features, MassGen can now integrate entire specialized frameworks
Community contribution potential: Users can create adapters for their favorite frameworks, expanding MassGen’s reach

What This Enables:

With v0.0.28, users can now build multi-agent systems that:

Run AG2 code executors alongside Gemini’s web search and Claude’s reasoning
Integrate LangGraph’s stateful workflows within MassGen’s debate mechanism
Mix and match the best agents from different frameworks in a single collaborative workflow

This case study validates that cross-framework multi-agent collaboration is not only possible but highly effective, with AG2’s code execution capabilities complementing native agents’ strengths within MassGen’s proven consensus architecture.

📌 Status Tracker

[✓] Planning phase completed
[✓] Features implemented (v0.0.28)
[✓] Testing completed
[✓] Demo recorded (logs available)
[✓] Results analyzed
[✓] Case study reviewed

Case study conducted: October 5, 2025 MassGen Version: v0.0.28 Configuration: massgen/configs/ag2/ag2_coder_case_study.yaml