CaseStudies

MassGen v0.0.28: AG2 Framework Integration with External Agent Adapters

MassGen is focused on case-driven development. MassGen v0.0.28 introduces AG2 framework integration through a comprehensive external agent adapter system, enabling MassGen to orchestrate agents from other established frameworks while maintaining its unique multi-agent consensus architecture. This case study demonstrates how AG2 agents with code execution capabilities can be seamlessly integrated into MassGen’s collaborative environment.

:depth: 3
:local:

📋 PLANNING PHASE

📝 Evaluation Design

Prompt

The prompt tests whether MassGen can effectively integrate and orchestrate agents from external frameworks (specifically AG2) while maintaining its collaborative consensus mechanism:

Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents.

Baseline Config

Prior to v0.0.28, MassGen could only use agents from its native backends (OpenAI, Anthropic, Gemini, etc.). There was no way to integrate agents from external frameworks like AG2 or other multi-agent systems.

A baseline would use only native MassGen agents without external framework integration.

Baseline Command

# Pre-v0.0.28: No AG2 support available
# Would need to use only native backends
uv run python -m massgen.cli --config basic/single/single_gemini2.5pro.yaml "Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents."

🔧 Evaluation Analysis

Results & Failure Modes

Before v0.0.28, users attempting to work with external frameworks like AG2 would face:

Cannot Leverage Existing AG2 Work:

Limited Framework Options:

Success Criteria

  1. AG2 Workflows Work in MassGen: Users can bring their existing AG2 agents into MassGen orchestration
  2. AG2 Features Preserved: AG2 capabilities (code execution, tool calling) function correctly within MassGen
  3. Full Collaboration: AG2 agents participate in MassGen’s consensus mechanism (generate, vote, refine)
  4. Beyond AG2: The adapter architecture supports other external frameworks, not just AG2
  5. Easy Configuration: Simple YAML setup for AG2 agents without complex integration work

🎯 Desired Features

With these goals defined, the next step was to design a system capable of bridging MassGen with external frameworks.


🚀 TESTING PHASE

📦 Implementation Details

Version

MassGen v0.0.28 (October 6, 2025)

✨ New Features

The AG2 integration was realized through an adapter registry pattern that bridges MassGen’s orchestration with external frameworks. The implementation consists of three core components:

1. Adapter Architecture

The integration introduces a base adapter interface (massgen/adapters/base.py) that defines how external frameworks communicate with MassGen. The AG2-specific implementation (massgen/adapters/ag2_adapter.py) supports:

2. External Agent Backend

A new ExternalAgentBackend class (massgen/backends/external.py) serves as the routing layer between MassGen and framework adapters. It handles:

3. Validation & Testing

Comprehensive test coverage ensures reliability across the integration:

Additional v0.0.28 improvements include enhanced MCP circuit breaker logic and better error handling. See the full v0.0.28 release notes for complete details.

New Config

Configuration file: massgen/configs/ag2/ag2_coder_case_study.yaml

Key breakthrough - type: ag2 backend enables AG2 framework integration:

agents:
  - id: "ag2_coder"
    backend:
      type: ag2  # NEW: External agent backend using AG2 adapter
      agent_config:
        type: assistant  # AG2 AssistantAgent
        name: "AG2_coder"
        system_message: |
          You are a helpful coding assistant. When asked to create and run code:
          1. Write Python code in a markdown code block (```python ... ```)
          2. The code will be automatically executed
          3. Always print the results so they are visible

          If you need to access information from the web, create and run a web scraping script.
        llm_config:
          api_type: "openai"
          model: "gpt-5"
        code_execution_config:  # AG2's code execution feature
            executor:
              type: "LocalCommandLineCodeExecutor"
              timeout: 60
              work_dir: "./code_execution_workspace"

Key Innovation: This configuration shows AG2 agents working alongside native MassGen agents (Gemini) in the same coordination workflow!

Command

uv run python -m massgen.cli --config ag2/ag2_coder_case_study.yaml "Output a summary comparing the differences between AG2 (https://github.com/ag2ai/ag2) and MassGen (https://github.com/Leezekun/MassGen) for LLM agents."

🤖 Agents

Both agents participate in MassGen’s collaborative consensus mechanism, with AG2 agent bringing code execution capabilities while Gemini provides web-enhanced reasoning.

🎥 Demo

Watch the v0.0.28 AG2 Framework Integration in action:

MassGen v0.0.28 AG2 Integration Demo

Key artifacts from the case study run:


📊 EVALUATION & ANALYSIS

Results

The v0.0.28 AG2 integration successfully achieved all success criteria and demonstrated powerful cross-framework collaboration:

Seamless Integration: AG2 agent configured and executed within MassGen orchestration without issues

Feature Preservation: AG2’s code execution capabilities worked correctly (agent executed web scraping scripts combining GitHub API calls, README parsing, and data analysis)

Collaborative Participation: AG2 agent generated 2 answers, participated in voting, and engaged in 9 restart cycles for refinement

Extensible Architecture: Adapter system (base.py + ag2_adapter.py) designed for easy extension to other frameworks

Configuration Simplicity: Clean YAML configuration with framework-specific settings under backend.agent_config

The Collaborative Process

How agents generated answers with v0.0.28 AG2 integration:

Understanding Answer Labels: MassGen uses a labeling system agent{N}.{attempt} where:

For example:

Multi-Iteration Refinement Pattern: The coordination log reveals an intensive collaborative refinement process:

Agent 1 (AG2 agent) - Code Execution Approach:

  1. Initial attempt (agent1.1): Leveraged AG2’s code execution capability to write and execute a Python script that programmatically analyzed both frameworks through:
    • GitHub REST API calls for repository metadata (fetched live stats: AG2 3624 stars, MassGen 465 stars)
    • Raw README fetching and parsing from both repositories
    • Keyword extraction and frequency analysis across documentation
    • Heuristic comparison logic based on extracted signals
    • Code executed successfully with exitcode: 0 in ./code_execution_workspace
  2. Restart 1-9: Despite having execution results, observed Agent 2’s conceptual analysis and restarted 9 times, ultimately deciding to pivot away from the data-driven approach
  3. Final answer (agent1.2): Abandoned the execution output and created a concise conceptual bullet-point comparison that won selection

Agent 2 (Gemini agent) - Web Search & Analysis:

  1. Initial attempt (agent2.1): Used web search to research both frameworks, generated structured comparison with tables
  2. Restart 1-8: Refined through 8 iterations, reviewing Agent 1’s code-based analysis and incorporating insights
  3. Final answer (agent2.8): Comprehensive multi-layered comparison emphasizing architectural differences

Key v0.0.28 improvement: AG2’s code execution capabilities brought a programmatic, data-driven approach to information gathering, complementing Gemini’s web search capabilities within the same consensus workflow.

The Voting Pattern

Cross-Framework Voting Dynamics:

The coordination table shows sophisticated voting behavior:

  1. Agent 1 (AG2) voting behavior:
    • Voted for Agent 2’s answers initially (agent2.1, agent2.3, agent2.4) across multiple iterations
    • Later voted for its own refined answer (agent1.2) after multiple improvements
    • Reasoning: “Concise, accurate bullet-point comparison focused on differences”
  2. Agent 2 (Gemini) voting behavior:
    • Consistently voted for Agent 2’s own refined answer (agent2.8)
    • Reasoning: “Excellent, multi-layered comparison”
  3. Winner selection:
    • Agent 1 (AG2 agent) selected as winner with answer agent1.2
    • Both agents voted for themselves in final iterations, but Agent 1’s answer was selected as the winner

Key v0.0.28 improvement: AG2 agents can fully participate in MassGen’s voting mechanism, demonstrating that external framework agents understand and contribute to consensus building.

Final Results & Answer Comparison

Winner: Agent 1 (ag2_coder) - AG2 AssistantAgent

To provide concrete evidence of the quality difference, here are the actual answers from both agents:

Agent 1’s Answer (agent1.2):

📄 Agent 1 (AG2) - Final Answer (agent1.2) ✓ Selected as Winner ``` Here's a concise differences summary between AG2 and MassGen for LLM agents: - Core purpose - AG2: General-purpose AgentOS/SDK for building, orchestrating, and running multi-agent applications with tools, memory, and human-in-the-loop. - MassGen: Test-time scaling/ensemble system; runs multiple agents/models on the same task in parallel with debate/consensus to improve a single result. - Collaboration pattern - AG2: Role-specialized agents (planner, executor, reviewer, etc.) coordinated via structured workflows or group chats; typically sequential or staged interactions. - MassGen: Peer agents work concurrently on the same prompt; share observations, critique, and converge via voting/aggregation. - Workflow primitives - AG2: Conversable agents, tool/function calling, memory/state, group-chat patterns, workflow orchestration, strong human-in-the-loop options. - MassGen: Round-based parallel runs, self-reflection/debate, cross-agent comparison, consensus/majority selection. - Scope and extensibility - AG2: Broad framework for end-to-end apps (tools, RAG, web/actions, integrations), suitable for production pipelines and complex automations. - MassGen: Focused on quality-boosting ensemble logic; often wrapped around existing prompts/tools to raise single-task answer quality. - Model/provider usage - AG2: Integrates with multiple model providers and tool ecosystems; emphasizes building blocks for complete agent systems. - MassGen: Encourages using diverse LLMs/providers in parallel to gain complementary strengths and better consensus. - Trade-offs - AG2: More scaffolding and control; best for multi-step workflows, reliability, and auditability. Latency depends on workflow depth. - MassGen: Higher per-task cost/latency due to many parallel runs; trades compute for improved accuracy/robustness on a single query. - Typical use cases - AG2: Automated coding/review loops, data/ETL agents, RAG assistants, customer support flows, complex task pipelines. - MassGen: Hard reasoning/coding challenges, creative generation, evaluations where ensemble consensus boosts final quality. - Quick guidance - Choose AG2 when you need to build, orchestrate, and operate agent workflows with tools, memory, and roles. - Choose MassGen when you want to boost the final answer quality on a single task via parallel multi-agent debate/consensus. ```

Agent 2’s Answer (agent2.8):

📄 Agent 2 (Gemini) - Final Answer (agent2.8)
```markdown ### Core Distinction: Orchestration vs. Parallel Refinement The fundamental difference between AG2 and MassGen lies in their purpose and agent collaboration model. AG2 is a broad, general-purpose framework for **building and orchestrating** complex, often sequential, workflows with multiple specialized agents. In contrast, MassGen is a specialized system designed for multiple agents to tackle the **same task in parallel** to generate a single, higher-quality result through debate and consensus. * **Analogy for AG2:** Think of it as an **Agent Operating System**. It manages a team of specialists (like a planner, a coder, and a reviewer), each handling a different part of a larger, multi-step process. * **Analogy for MassGen:** Think of it as a **Parallel Study Group**. A group of diverse students simultaneously tackles the same difficult problem, sharing their approaches and critiquing each other's work to arrive at the best possible answer. ### Comparison Summary | Feature | AG2 (ag2ai/ag2) | MassGen (Leezekun/MassGen) | | :--- | :--- | :--- | | **Primary Goal** | A general framework to build, manage, and orchestrate multi-agent applications with complex, multi-step workflows. | A specialized system that runs multiple agents in parallel on a single task to achieve a superior, refined result. | | **Agent Collaboration** | Agents with specialized roles collaborate in structured, often sequential, conversational workflows (e.g., group chats, swarms). | Diverse agents work concurrently on the same problem, engaging in mutual observation, debate, and consensus-building. | | **Typical Use Case** | Automating end-to-end processes like software development, data analysis pipelines, trip planning, or customer support. | Improving the quality of a solution for a single complex query, such as a difficult coding challenge, creative writing, or hard reasoning problems. | | **Human Involvement** | Strong, built-in support for "human-in-the-loop" for oversight and intervention at various workflow stages. | Primarily autonomous agent collaboration, with the human typically receiving the final, converged output. | | **Trade-offs** | Requires more initial setup to define roles and workflows. Best for reliability, control, and multi-step processes. | Higher computational cost and latency per task due to parallel runs. Trades compute resources for higher accuracy on a single query. | | **Licensing** | Apache 2.0 license, evolving from the MIT-licensed AutoGen project. | Apache 2.0 license. | | **Project Origin**| Evolved from the popular Microsoft project AutoGen. | Inspired by advanced AI approaches like Grok and Gemini Deep Think. | --- ### **AG2: The General-Purpose Agent Operating System** AG2 (formerly AutoGen) is an open-source framework designed to be a foundational "AgentOS" for creating sophisticated multi-agent systems. It excels at defining structured workflows where agents with distinct roles and tools collaborate to solve complex problems, much like a team of human specialists. **Key Features of AG2:** * **Workflow Orchestration**: Its primary strength is creating complex, conversational workflows where specialized agents interact to automate a process. * **Conversable Agents**: The framework is built on the concept of a "Conversable Agent," a flexible building block for creating agents that can communicate, use LLMs, and interact with tools or humans. * **Human-in-the-Loop**: It features strong, built-in support for human oversight, allowing for systems that blend autonomous speed with expert human judgment. * **Broad Applicability**: Its general-purpose nature makes it suitable for a wide range of use cases, from automated coding to supply chain optimization and complex Q&A systems. **Best for:** Building applications that require multiple specialized agents to collaborate through a defined, multi-step process to automate a complex task. ### **MassGen: The Parallel Debate for High-Quality Results** Inspired by the methodologies of advanced AI, MassGen's core idea is not to divide a task into sequential parts, but to have multiple, diverse agents tackle the same task simultaneously. They work in parallel, observe each other's work, and refine their own approaches to converge on the best possible answer through a process of debate and consensus. **Key Features of MassGen:** * **Parallel Problem-Solving**: MassGen assigns a task to multiple AI agents who work in parallel, fostering a collaborative and iterative environment to find the best solution. * **Convergence and Refinement**: The system is designed for agents to share insights and debate solutions, intelligently detecting when they have reached a consensus on the best output. * **Diverse Model Support**: It is designed to leverage multiple different LLMs and providers (e.g., GPT-4, Gemini, Grok) at the same time to harness their complementary strengths. * **Enhanced Solution Quality**: The main goal is to achieve a superior, high-quality result for a single complex task by leveraging the diverse reasoning capabilities of multiple AI models working in tandem. **Best for:** Generating the highest quality output for a single, complex request by harnessing the diverse reasoning power of multiple top-tier LLMs working in parallel. ```

Final Presentation Answer:

🎤 Final Presentation Answer (What Users See)
This is the actual answer presented to users after MassGen's consensus mechanism selected Agent 1 as the winner: ``` Here's a concise differences summary between AG2 and MassGen for LLM agents: - Core purpose - AG2: General-purpose AgentOS/SDK for building, orchestrating, and running multi-agent applications with tools, memory, and human-in-the-loop. - MassGen: Test-time scaling/ensemble system; runs multiple agents/models on the same task in parallel with debate/consensus to improve a single result. - Collaboration pattern - AG2: Role-specialized agents (planner, executor, reviewer, etc.) coordinated via structured workflows or group chats; typically sequential or staged interactions. - MassGen: Peer agents work concurrently on the same prompt; share observations, critique, and converge via voting/aggregation. - Workflow primitives - AG2: Conversable agents, tool/function calling, memory/state, group-chat patterns, workflow orchestration, strong human-in-the-loop options. - MassGen: Round-based parallel runs, self-reflection/debate, cross-agent comparison, consensus/majority selection. - Scope and extensibility - AG2: Broad framework for end-to-end apps (tools, RAG, web/actions, integrations), suitable for production pipelines and complex automations. - MassGen: Focused on quality-boosting ensemble logic; often wrapped around existing prompts/tools to raise single-task answer quality. - Model/provider usage - AG2: Integrates with multiple model providers and tool ecosystems; emphasizes building blocks for complete agent systems. - MassGen: Encourages using diverse LLMs/providers (and settings) in parallel to gain complementary strengths and better consensus. - Human involvement - AG2: Built-in human-in-the-loop checkpoints, oversight, and guardrails. - MassGen: Generally autonomous until final selection; optional manual review. - Trade-offs - AG2: More scaffolding and control; best for multi-step workflows, reliability, and auditability. Latency depends on workflow depth. - MassGen: Higher per-task cost/latency due to many parallel runs; trades compute for improved accuracy/robustness on a single query. - Typical use cases - AG2: Automated coding/review loops, data/ETL agents, RAGassistants, customer support flows, complex task pipelines. - MassGen: Hard reasoning/coding challenges, creative generation, evaluations where ensemble consensus boosts final quality. - Quick guidance - Choose AG2 when you need to build, orchestrate, and operate agent workflows with tools, memory, and roles. - Choose MassGen when you want to boost the final answer quality on a single task via parallel multi-agent debate/consensus. - Licensing/community - Both are open-source; check each repository for current license details and ecosystem maturity. ``` --- **Key Insight:** This final answer is **not identical** to agent1.2 - it appears to be a refined combination: - Takes Agent 1's overall structure and bullet-point format - **Adds "Human involvement" comparison** (a dimension present in Agent 2's table but absent from Agent 1's original answer) - Updates licensing statement to be more general This suggests MassGen incorporate insights from multiple agents even after selecting a winner, creating a hybrid final answer.

Comparative Analysis:

The three answers above reveal MassGen’s sophisticated consensus mechanism:

Agent 1’s Strengths:

Agent 2’s Strengths:

Final Answer Innovation:

Why Agent 1 Was Selected as Primary: Agent 1’s streamlined, scannable format best matched the user’s request for a “summary comparing differences.” While Agent 2 provided richer context and excellent educational framing, Agent 1’s directness made it more suitable as the foundation for the final answer. Importantly, MassGen’s consensus mechanism still incorporated Agent 2’s valuable insights into the final presentation.

Key v0.0.28 validation: An AG2 agent with code execution successfully served as the final answer provider, proving that external framework agents can deliver high-quality results within MassGen’s orchestration.

Anything Else

Restart Dynamics - Evidence of Deep Refinement:

The case study shows unprecedented refinement activity:

This intensive refinement pattern suggests:

  1. The task complexity warranted multiple iterations
  2. AG2’s code execution enabled programmatic information gathering
  3. Cross-framework collaboration triggered competitive improvement
  4. MassGen’s restart mechanism effectively drove quality enhancement

Framework Compatibility:

🎯 Conclusion

The AG2 Framework Integration in v0.0.28 successfully solves the framework interoperability challenge that users faced when trying to combine specialized agent capabilities with MassGen’s consensus approach. The key user benefits specifically enabled by this feature include:

  1. Best of Both Worlds: Users can now leverage AG2’s specialized capabilities (code execution, tool calling) within MassGen’s multi-agent consensus framework
  2. Extensible Future: The adapter architecture opens the door to integrating other frameworks (LangGraph, CrewAI), exponentially expanding MassGen’s capabilities

Broader Implications:

The adapter system represents a paradigm shift for MassGen:

What This Enables:

With v0.0.28, users can now build multi-agent systems that:

This case study validates that cross-framework multi-agent collaboration is not only possible but highly effective, with AG2’s code execution capabilities complementing native agents’ strengths within MassGen’s proven consensus architecture.


📌 Status Tracker


Case study conducted: October 5, 2025 MassGen Version: v0.0.28 Configuration: massgen/configs/ag2/ag2_coder_case_study.yaml