essay / ai
Multi-agent communication patterns: a technical deep dive
Individual agent intelligence matters less than seamless coordination. Three production-tested patterns for making agents work together.
The challenge of agent coordination
Building multi-agent workflows reveals a fundamental truth: individual agent intelligence matters less than seamless coordination. You can have the most sophisticated LLM agents in the world, but if they cannot communicate effectively, your workflow will fail in production.
Most failures happen at the boundaries between agents, not within the agents themselves. Agents lose context, create race conditions, or fail to pass information in usable formats.
After extensive testing in production environments, three communication patterns have proven to scale effectively.
Pattern 1: message queue with explicit state transitions
The most straightforward scalable approach uses a message queue with explicit state tracking. Each agent publishes results and subscribes to relevant updates, creating clear boundaries and easier debugging.
class WorkflowState(Enum):
RESEARCH_PENDING = "research_pending"
RESEARCH_COMPLETE = "research_complete"
ANALYSIS_PENDING = "analysis_pending"
ANALYSIS_COMPLETE = "analysis_complete"
SYNTHESIS_PENDING = "synthesis_pending"
SYNTHESIS_COMPLETE = "synthesis_complete"
FAILED = "failed"
class WorkflowCoordinator:
def __init__(self, queue: MessageQueue):
self.queue = queue
self.workflows: Dict[str, Dict] = {}
self.setup_subscriptions()
def setup_subscriptions(self):
self.queue.subscribe("research_complete", self.handle_research_complete)
self.queue.subscribe("analysis_complete", self.handle_analysis_complete)
self.queue.subscribe("synthesis_complete", self.handle_synthesis_complete)
self.queue.subscribe("agent_error", self.handle_error)
Each agent only needs to understand message formats. Explicit state transitions enable straightforward debugging, and the message queue provides natural decoupling.
Trade-offs: Linear workflow progression can create bottlenecks. Slow research agents cause downstream delays.
Pattern 2: shared state store with optimistic locking
For workflows requiring complex coordination, a shared state store with conflict resolution offers better flexibility.
class SharedStateStore:
async def update_workflow_state(
self,
workflow_id: str,
updates: Dict[str, Any],
expected_version: int,
agent_id: str
) -> bool:
async with self.locks[workflow_id]:
current_version = self.versions.get(workflow_id, 0)
if current_version != expected_version:
raise StateConflictError(
f"Version conflict: expected {expected_version}, "
f"got {current_version}"
)
self.state[workflow_id].update(updates)
self.versions[workflow_id] = current_version + 1
return True
Agents operate more independently and manage complex dependencies. Optimistic locking prevents race conditions while enabling concurrent operations. The retry-with-backoff pattern handles conflicts gracefully.
Trade-offs: More complex implementation and debugging. State conflicts can trigger cascading retries under high load.
Pattern 3: event-driven with circuit breakers
The most resilient pattern uses event-driven architecture with circuit breakers for comprehensive fault tolerance.
class CircuitBreaker:
def __init__(self, config: CircuitBreakerConfig):
self.config = config
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.config.timeout:
self.state = CircuitState.HALF_OPEN
return True
return False
return True
Circuit breakers prevent cascading failures and enable automatic recovery. Agents can fail independently without breaking the entire workflow.
Trade-offs: Most complex to implement and monitor. Event ordering can become challenging in complex workflows.
Key lessons from production
After implementing all three patterns in production environments:
- Start with Message Queues for MVP. Get something working first, optimize later.
- Shared State Stores scale better. But require careful conflict resolution design.
- Event-Driven is enterprise-ready. Use when you need maximum reliability.
- Always implement circuit breakers. LLM APIs will fail, plan for it.
- Context pollution is real. Pass only what agents actually need.
- Monitor state transitions. You cannot debug what you cannot observe.
Choosing the right pattern
- Message Queue: Linear workflows, small teams, rapid prototyping
- Shared State Store: Complex dependencies, medium scale, need flexibility
- Event-Driven: Mission-critical systems, large scale, multiple failure modes
Beautiful architecture diagrams mean nothing when your workflow breaks in production. These patterns work because they have survived real-world pressure and scale requirements.