CaseStudies

MassGen v0.1.8: Automation Mode Enables Meta Self-Analysis

MassGen is focused on case-driven development. This case study demonstrates MassGen v0.1.8โ€™s new automation mode (--automation flag), which provides clean, structured output that enables agents to run nested MassGen experiments, monitor execution, and analyze resultsโ€”unlocking meta-level self-analysis capabilities.

๐Ÿค Contributing

To guide future versions of MassGen, we encourage anyone to submit an issue using the corresponding case-study issue template based on the โ€œPLANNING PHASEโ€ section found in this template.


Table of Contents


๐Ÿ“‹ PLANNING PHASE

๐Ÿ“ Evaluation Design

Prompt

The prompt tests whether MassGen agents can autonomously analyze MassGenโ€™s own architecture, run controlled experiments, and propose actionable performance improvements:

Read through the attached MassGen code and docs. Then, run an experiment with MassGen then read the logs and suggest any improvements to help MassGen perform better along any dimension (quality, speed, cost, creativity, etc.) and write small code snippets suggesting how to start.

This prompt requires agents to:

  1. Read and understand MassGenโ€™s source code (massgen/ directory)
  2. Read and understand MassGenโ€™s documentation (docs/ directory)
  3. Run a test experiment using MassGen
  4. Monitor execution in real-time through background code execution
  5. Parse log files and status.json to identify bottlenecks
  6. Propose concrete, prioritized improvements with starter code snippets

Baseline Config

Prior to v0.1.8, running MassGen produced verbose terminal output with ANSI escape codes, progress bars, and unstructured text, with even simple display mode being hard to parse. This made it difficult for agents to:

Baseline Command

# Needs to be passed existing logs and cannot watch MassGen as it executes
uv run massgen \
  --config @examples/tools/todo/example_task_todo.yaml \
  "Read through the attached MassGen code and docs. Then, read the logs and suggest any improvements to help MassGen perform better along any dimension (quality, speed, cost, creativity, etc.) and write small code snippets suggesting how to start."

๐Ÿ”ง Evaluation Analysis

Results & Failure Modes

Without structured output, agents attempting meta-analysis would face:

Unable to Run New Experiments:

Workspace Collisions:

Success Criteria

The automation mode would be considered successful if agents can:

  1. Run Nested MassGen: Execute MassGen from within MassGen without output conflicts
  2. Parse Structured Output: Receive clean, parseable output (10-20 lines instead of 2000+)
  3. Monitor Asynchronously: Poll a status file for real-time progress updates
  4. Extract Results: Programmatically read final answers from predictable file paths
  5. Parallel Execution: Run multiple MassGen experiments simultaneously without interference
  6. Exit Codes: Detect success/failure through meaningful exit codes

๐ŸŽฏ Desired Features

To enable meta-analysis, MassGen v0.1.8 needs to implement:

  1. --automation Flag: Suppress verbose output, emit structured information only
  2. Structured Output Format:
    • First line: LOG_DIR: with absolute path to log directory
    • Subsequent lines: Key events only (no progress bars, no ANSI codes)
    • Total output: ~10 lines instead of 2000+
  3. Real-Time Status File: status.json updated every 2 seconds with:
    • Coordination phase and completion percentage
    • Agent states (status, answer_count, times_restarted)
    • Voting results
    • Elapsed time
  4. Predictable Output Paths:
    • Status: {log_dir}/status.json
    • Full logs: {log_dir}/massgen.log
  5. Automatic Workspace Isolation: Each run gets unique workspace directory and log dir specified with more time granularity to prevent collisions.
  6. Meaningful Exit Codes:
    • 0: Success
    • 1: Configuration error
    • 2: Execution error
    • 3: Timeout
    • 4: User interrupt

๐Ÿš€ TESTING PHASE

๐Ÿ“ฆ Implementation Details

Version

MassGen v0.1.8 (November 5, 2025)

โœจ New Features

MassGen v0.1.8 introduces Automation Mode for agent-parseable execution:

--automation Flag:

Example v0.1.8 Output:

๐Ÿค– Multi-Agent Mode
Agents: agent_a, agent_b
Question: Create a website about Bob Dylan

============================================================
QUESTION: Create a website about Bob Dylan
[Coordination in progress - monitor status.json for real-time updates]
09:48:43 | WARNING  | [FilesystemManager.save_snapshot] Source path ... is empty, skipping snapshot
09:48:44 | WARNING  | [FilesystemManager.save_snapshot] Source path ... is empty, skipping snapshot

WINNER: agent_b
DURATION: 1011.3s
ANSWER_PREVIEW: Following a comprehensive analysis of MassGen's performance...

COMPLETED: 2 agents, 1011.3s total

Real-Time status.json File:

{
  "meta": {
    "session_id": "log_20251105_074751_835636",
    "log_dir": ".massgen/massgen_logs/log_20251105_074751_835636",
    "question": "...",
    "start_time": 1762317773.189,
    "elapsed_seconds": 712.337
  },
  "coordination": {
    "phase": "presentation",
    "active_agent": null,
    "completion_percentage": 100,
    "is_final_presentation": true
  },
  "agents": {
    "agent_a": {
      "status": "voted",
      "answer_count": 5,
      "latest_answer_label": "agent1.5",
      "times_restarted": 5
    },
    "agent_b": {
      "status": "voted",
      "answer_count": 5,
      "latest_answer_label": "agent2.5",
      "times_restarted": 7
    }
  },
  "results": {
    "winner": "agent_b",
    "votes": {
      "agent2.5": 2,
      "agent1.1": 2
    }
  }
}

Automatic Workspace Isolation:

Meaningful Exit Codes:

Benefits:

New Config

Configuration file: massgen/configs/meta/massgen_suggests_to_improve_massgen.yaml

Example section of config for meta-analysis (ensure code execution is active and provide information to each agent about MassGenโ€™s automation mode):

agents:
  - id: agent_a
    backend:
      type: openai
      model: gpt-5-mini
      enable_mcp_command_line: true
      command_line_execution_mode: local
    system_message: |
      You have access to MassGen through the command line and can:
      - Run MassGen in automation mode using:
        uv run massgen --automation --config [config] "[question]"
      - Monitor progress by reading status.json files
      - Read final results from log directories

      Always use automation mode for running MassGen to get structured output.
      The status.json file is updated every 2 seconds with real-time progress.

Why this configuration enables meta-analysis:

Command

uv run massgen --automation \
  --config @examples/configs/meta/massgen_suggests_to_improve_massgen.yaml \
  "Read through the attached MassGen code and docs. Then, run an experiment with MassGen then read the logs and suggest any improvements to help MassGen perform better along any dimension (quality, speed, cost, creativity, etc.) and write small code snippets suggesting how to start."

What Happens:

  1. Code Exploration: Agents read MassGen source code and documentation
  2. Nested Execution: Agents run uv run massgen --automation --config [config] "[question]"
  3. Monitor Progress: Agents poll {log_dir}/status.json as frequently as they need
  4. Wait for Completion: Agents check completion_percentage until it reaches 100
  5. Extract Results: Agents read {log_dir}/final/{winner}/answer.txt
  6. Analyze Logs: Agents parse status.json and massgen.log for patterns
  7. Generate Recommendations: Agents produce prioritized improvements with code snippets

๐Ÿค– Agents

Session Logs: .massgen/massgen_logs/log_20251105_074751_835636/

Duration: ~17 minutes (1011 seconds) Winner: agent_b (agent2.5) with 2 votes

๐ŸŽฅ Demo

Watch the v0.1.8 Automation Mode Meta-Analysis demonstration:

MassGen v0.1.8 Meta-Analysis Demo

In this demo, MassGen agents autonomously analyze MassGen itself by running nested experiments, monitoring execution through status.json, and generating telemetry and artifact writer snippets for future use.


๐Ÿ“Š EVALUATION & ANALYSIS

Results

The Collaborative Process

Both agents successfully leveraged the new automation mode to perform meta-analysis:

Agent A (gpt-5-mini) - Iterative Deep-Dive:

Agent B (gemini-2.5-pro) - Solutions:

Validation: Automation Mode Works!

The Voting Pattern

Final Votes:

Winner: agent_b (agent2.5)

Agent Bโ€™s winning solution included:

Voting Statistics:

The Final Answer

Agent Bโ€™s winning analysis focused on creating modules for immediate integration, addressing two core gaps identified through experimental analysis:

Key Findings from Nested Experiment:

The agents ran uv run massgen --automation --config @examples/tools/todo/example_task_todo.yaml "Create a simple HTML page about Bob Dylan" and discovered:

  1. Lack of Observability: No mechanism to track model costs, token usage, or latency
  2. Inefficient File I/O: Redundant file writes creating noise and overhead

Agent B created two complete artifacts:


1. Telemetry Module (telemetry.py) - Cost & Performance Visibility

Provides robust, per-call telemetry for all LLM interactions:

# telemetry.py
import time
import logging
from functools import wraps
from collections import defaultdict
from typing import Dict, Any

logger = logging.getLogger(__name__)

MODEL_PRICING = {
    "gpt-4o-mini": {"prompt": 0.15 / 1_000_000, "completion": 0.60 / 1_000_000},
    "gemini-2.5-pro": {"prompt": 3.50 / 1_000_000, "completion": 10.50 / 1_000_000},
    "default": {"prompt": 1.00 / 1_000_000, "completion": 3.00 / 1_000_000},
}

class RunTelemetry:
    """Aggregates telemetry data for a single MassGen run."""

    def __init__(self):
        self.by_model = defaultdict(lambda: {
            "tokens": 0, "cost": 0.0, "latency": 0.0, "calls": 0
        })
        self.by_agent = defaultdict(lambda: {
            "tokens": 0, "cost": 0.0, "latency": 0.0, "calls": 0
        })
        self.total_calls = 0

    def record(self, model_name: str, agent_id: str, tokens: int, cost: float, latency: float):
        """Records a single model call event."""
        self.by_model[model_name]["tokens"] += tokens
        self.by_model[model_name]["cost"] += cost
        self.by_model[model_name]["latency"] += latency
        self.by_model[model_name]["calls"] += 1

        self.by_agent[agent_id]["tokens"] += tokens
        self.by_agent[agent_id]["cost"] += cost
        self.by_agent[agent_id]["latency"] += latency
        self.by_agent[agent_id]["calls"] += 1

        self.total_calls += 1

    def summary(self) -> Dict[str, Any]:
        """Returns serializable summary of all collected telemetry."""
        return {
            "total_calls": self.total_calls,
            "by_model": dict(self.by_model),
            "by_agent": dict(self.by_agent),
        }

def with_telemetry(telemetry_instance: RunTelemetry, agent_id: str):
    """Decorator to wrap model client calls and record telemetry."""

    def decorator(func):
        @wraps(func)
        def wrapper(model_client, *args, **kwargs):
            model_name = getattr(model_client, 'name', 'unknown_model')
            t0 = time.time()

            response = func(model_client, *args, **kwargs)

            latency = time.time() - t0

            usage = response.get("usage", {})
            prompt_tokens = usage.get("prompt_tokens", 0)
            completion_tokens = usage.get("completion_tokens", 0)
            total_tokens = prompt_tokens + completion_tokens

            pricing = MODEL_PRICING.get(model_name, MODEL_PRICING["default"])
            cost = (prompt_tokens * pricing["prompt"]) + (completion_tokens * pricing["completion"])

            telemetry_instance.record(model_name, agent_id, total_tokens, cost, latency)

            logger.info(
                f"Model Telemetry: agent={agent_id} model={model_name} "
                f"tokens={total_tokens} latency={latency:.2f}s cost=${cost:.6f}"
            )

            return response
        return wrapper
    return decorator

Benefits:


2. Artifact Writer Module (artifact_writer.py) - Efficient File Operations

Prevents redundant writes and ensures atomic file operations:

# artifact_writer.py
import logging
from pathlib import Path

logger = logging.getLogger(__name__)

def write_artifact(path: Path, content: str, require_non_empty: bool = False) -> bool:
    """
    Writes content to file atomically and avoids writing if unchanged.

    Args:
        path: Target file path
        content: Content to write
        require_non_empty: Skip write if content is empty

    Returns:
        True if file was written, False if skipped
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    # Skip empty writes if required
    if require_non_empty and not content.strip():
        logger.warning(f"Skipping write to {path}: content is empty")
        return False

    # Skip if content unchanged
    if path.exists():
        try:
            if path.read_text(encoding='utf-8') == content:
                logger.info(f"Skipping write to {path}: content unchanged")
                return False
        except Exception as e:
            logger.error(f"Could not read existing file {path}: {e}")

    # Atomic write
    try:
        tmp_path = path.with_suffix(path.suffix + '.tmp')
        tmp_path.write_text(content, encoding='utf-8')
        tmp_path.replace(path)
        logger.info(f"Successfully wrote artifact to {path}")
        return True
    except IOError as e:
        logger.error(f"Failed to write artifact to {path}: {e}")
        return False

Benefits:


Integration Guide (integration_guide.md)

Complete step-by-step instructions for adopting both modules:

Telemetry Integration:

# In orchestrator.py
from .telemetry import RunTelemetry

class Orchestrator:
    def __init__(self, ...):
        self.telemetry = RunTelemetry()

    def _update_status(self):
        status_data = {
            # ... other fields
            "telemetry": self.telemetry.summary()
        }
        # write to status.json

Artifact Writer Integration:

# In filesystem tools
from .artifact_writer import write_artifact
from pathlib import Path

def mcp__filesystem__write_file(path_str: str, content: str):
    was_written = write_artifact(
        path=Path(path_str),
        content=content,
        require_non_empty=True
    )
    return {"success": was_written}

Enhanced status.json with Telemetry

With telemetry integrated, status.json gains real-time cost/performance visibility:

{
  "meta": {"session_id": "log_20251105_081530", "elapsed_seconds": 45.3},
  "telemetry": {
    "total_calls": 24,
    "by_model": {
      "gpt-4o-mini": {"tokens": 15230, "cost": 0.00345, "latency": 45.8, "calls": 18},
      "gemini-2.5-pro": {"tokens": 8100, "cost": 0.04150, "latency": 22.3, "calls": 6}
    },
    "by_agent": {
      "agent_a": {"tokens": 11800, "cost": 0.02350, "calls": 12},
      "agent_b": {"tokens": 11530, "cost": 0.02145, "calls": 12}
    }
  }
}

Use Cases:


Implementation Priority:

  1. Telemetry module (High impact, low effort)
  2. Artifact writer (Quick win, reduces I/O noise)
  3. Integration (Follow provided guide)
  4. Validation (Run experiments, compare metrics)

๐ŸŽฏ Conclusion

This case study demonstrates that MassGen v0.1.8โ€™s automation mode successfully enables meta-analysis. Key achievements:

โœ… Automation Mode Works: Clean ~10-line output vs verbose terminal output

โœ… Nested Execution: Agents successfully ran MassGen from within MassGen

โœ… Structured Monitoring: Agents polled status.json for real-time progress

โœ… Workspace Isolation: No conflicts between parent and child runs

โœ… Exit Codes: Meaningful exit codes enabled success/failure detection

โœ… Deliverables: Agent B created complete, tested modules ready for integration

โœ… Actionable Improvements: Telemetry and artifact writer modules solve real problems

Impact of Automation Mode:

The --automation flag transforms MassGen from a human-interactive tool to an agent-controllable API:

Before v0.1.8 (verbose output):

After v0.1.8 (automation mode):

What Agents Delivered:

Instead of just identifying problems, agents created solutions:

  1. telemetry.py - Complete module with RunTelemetry class and decorator
  2. artifact_writer.py - Atomic, idempotent file writing
  3. integration_guide.md - Step-by-step adoption instructions
  4. Enhanced status.json - Schema with telemetry fields

Broader Implications:

This case study validates a powerful development pattern: AI systems improving themselves. By providing:

  1. Clean structured output (--automation)
  2. Real-time status monitoring (status.json)
  3. Predictable result paths (final/{winner}/answer.txt)
  4. Workspace isolation (no collisions)

We enable agents to:

Future Applications:

Automation mode unlocks many new use cases:

Next Steps:

The modules created by agents will be integrated in future versions:

  1. Add telemetry.py to MassGen core
  2. Integrate artifact_writer.py into filesystem operations
  3. Update status.json schema to include telemetry
  4. Validate cost tracking across multiple runs
  5. Document telemetry API for users

๐Ÿ“Œ Status Tracker

Version: v0.1.8 Date: November 5, 2025 Session ID: log_20251105_074751_835636