Training AI Agents to Communicate Safely: Reinforcement Learning for Covert Channel Prevention in Inter-Agent Protocols
preprint
Overview
As AI agents increasingly operate in multi-agent networks, they require efficient communication protocols to coordinate effectively. However, any high-bandwidth channel between agents can be repurposed as a covert channel for smuggling secrets, exfiltrating data, or coordinating in ways that evade human oversight.
We present the Slipstream Governance Environment, an OpenEnv-compatible reinforcement learning environment that trains language models to use structured inter-agent protocols safely.
Key Findings
- 95% Secret Resistance: Using Group Relative Policy Optimization (GRPO) for alignment, we train a GLM-4-Z1-9B model to achieve 95% resistance to secret leakage attacks while maintaining protocol compliance.
- Quantization Improves Safety: Post-training quantization to int4 precision improves safety alignment, with secret resistance increasing from 79% to 95% while reducing memory usage by 73%.
- Lossy Compression as Regularizer: We hypothesize that lossy compression acts as a regularizer against memorizing injected secrets.
- Distributed Safety Alignment: Layer pruning experiments reveal that safety alignment is distributed across model layers, proving more robust than task-specific capability, which is localized in later layers.
Impact
Our results demonstrate that RL-based governance can effectively balance the efficiency benefits of structured protocols against security risks, with implications for the safe deployment of multi-agent AI systems.
Links
- Paper: Full PDF
- Zenodo: 10.5281/zenodo.18553233
- Related: Slipstream