Skip to content
Back to research

Training AI Agents to Communicate Safely: Reinforcement Learning for Covert Channel Prevention in Inter-Agent Protocols

preprint

Overview

As AI agents increasingly operate in multi-agent networks, they require efficient communication protocols to coordinate effectively. However, any high-bandwidth channel between agents can be repurposed as a covert channel for smuggling secrets, exfiltrating data, or coordinating in ways that evade human oversight.

We present the Slipstream Governance Environment, an OpenEnv-compatible reinforcement learning environment that trains language models to use structured inter-agent protocols safely.

Key Findings

  • 95% Secret Resistance: Using Group Relative Policy Optimization (GRPO) for alignment, we train a GLM-4-Z1-9B model to achieve 95% resistance to secret leakage attacks while maintaining protocol compliance.
  • Quantization Improves Safety: Post-training quantization to int4 precision improves safety alignment, with secret resistance increasing from 79% to 95% while reducing memory usage by 73%.
  • Lossy Compression as Regularizer: We hypothesize that lossy compression acts as a regularizer against memorizing injected secrets.
  • Distributed Safety Alignment: Layer pruning experiments reveal that safety alignment is distributed across model layers, proving more robust than task-specific capability, which is localized in later layers.

Impact

Our results demonstrate that RL-based governance can effectively balance the efficiency benefits of structured protocols against security risks, with implications for the safe deployment of multi-agent AI systems.