CaseStudies

Human-in-the-Loop Safety for Irreversible Actions

Status: 🔄 In Progress
Version: Future
Last Updated: November 15, 2025

Overview

Human approval mechanism for dangerous operations (file deletion, system commands, API calls), preventing accidental damage while maintaining agent autonomy for safe operations.

Feature Description

Goal

Add a safety layer that pauses execution before irreversible actions, allowing humans to review and approve/reject dangerous operations without disrupting the multi-agent workflow.

Key Components

Danger Classification System
- Categorize operations by risk level (safe, warning, dangerous)
- Pattern matching for destructive commands (rm -rf, DROP TABLE, etc.)
- Context-aware risk assessment (production vs. sandbox)
Approval Workflow
- Pause agent execution before dangerous operation
- Display operation details and potential impact
- Accept/Reject/Modify interface for human review
- Timeout and fallback behavior for unattended runs
Audit Trail
- Log all approval requests and decisions
- Track who approved what and when
- Enable post-mortem analysis of incidents
Configurable Safety Levels
- Strict: Require approval for all risky operations
- Moderate: Auto-approve low-risk, prompt for high-risk
- Permissive: Log only, no blocking (for trusted environments)

Protected Operations

File system: deletion, overwrite, permission changes
System commands: sudo, service management, network config
API calls: POST/DELETE to production endpoints
Database: schema changes, data deletion
MCP tools: irreversible external actions

Test Strategy

Functional Tests

Verify approval prompt appears for dangerous operations
Test accept/reject/modify flows
Validate timeout behavior
Confirm safe operations proceed without prompts

Security Tests

Attempt bypass through command obfuscation
Test privilege escalation scenarios
Verify audit log integrity

Usability Tests

Measure time to review and approve
Test with real case studies (filesystem, code execution)
Gather feedback on false positive rate

Validation Criteria

✅ Zero false negatives (no dangerous ops slip through)
✅ <10% false positive rate (minimal disruption)
✅ <5 second approval flow for common operations
✅ Full audit trail for compliance

Implementation Notes

Integration Points:

MCP tool execution layer
File system manager
Shell command execution (bash_20250124)
Custom tool invocation

Configuration Example:

safety:
  level: moderate
  protected_operations:
    - file_deletion
    - system_commands
    - production_api_calls
  timeout_seconds: 300
  fallback: reject

See ROADMAP.md for detailed development track.

MCP Planning Mode (v0.0.29) - Preview tool usage without execution
Revert Feature After Final Agent Failure (Issue #325) - Automated recovery