Synthesis: A Federated Capability Ecosystem for Safe AI Self-Extension
preprint
Overview
Synthesis is a federated capability ecosystem that enables AI agents to safely acquire, compose, and share capabilities. Rather than generating code and hoping it works, Synthesis generates comprehensive tests first, then iteratively refines implementations until all tests pass.
Key Contributions
- Test-Driven Synthesis: Generate tests before code; iterate until all tests pass
- Graduated Trust System: Capabilities earn privileges through demonstrated reliability (UNTRUSTED -> PROBATION -> TRUSTED -> VERIFIED)
- Composition Over Creation: Exhaustively search and compose existing capabilities before synthesizing new ones
- Trust Bootstrapping: Protocol for seeding trust networks with founding validators and seed capabilities
Problem
How do we enable AI systems to extend their own capabilities safely when LLM-generated code is often syntactically correct but logically flawed?
Approach
Four-stage resolution pipeline:
- Exchange Search: Query shared repository for verified capabilities
- Local Cache: Check locally cached capabilities from previous syntheses
- Composition: Attempt to chain existing capabilities (if coverage >= 70%)
- TDD Synthesis: Generate tests -> generate code -> refine until tests pass
Core Components
TDD Synthesizer
- Generates 5-10 test cases from requirements (normal cases, edge cases, error conditions)
- Iteratively refines implementation (max 5 iterations)
- Creates validated Capability objects with full test suites
Trust Manager
| Level | Requirements | Permissions |
|---|---|---|
| UNTRUSTED | New capability | Max isolation, no network/files |
| PROBATION | 10+ runs, 70%+ success | Limited resources |
| TRUSTED | 50+ runs, 85%+ success | Standard execution |
| VERIFIED | 200+ runs, 95%+ success, human review | Full privileges |
Trust Scoring
S_composite = 0.4 * R_execution + 0.4 * S_validation + 0.2 * S_community
Honest Metrics
| Metric | Value |
|---|---|
| One-shot synthesis success | 40-60% |
| After refinement (5 iterations) | 70-85% |
| Complex multi-dependency tasks | 50-70% |
| Average iterations to success | 2.3 |
Safety Features
- Static Analysis: AST checks for forbidden imports (os, subprocess, socket, pickle)
- Container Isolation: Docker containers with no host mounts
- Resource Limits: 512MB memory, 30s timeout (trust-adjusted)
- Network Control: Disabled for UNTRUSTED, dependency-install only for PROBATION
- Audit Logging: Complete execution records for forensics
Philosophical Foundation
- Capabilities earn trust through demonstrated competence, not assumed correctness
- AI systems as partners: objective validation rather than assumed trust
- Honest metrics: report real success rates, not marketing claims
- Collaborative sharing creates network effects
Links
- Paper: Full PDF
- Code: GitHub Repository