Building Distributed FSMs with Event Buses

Intro

At Electric Era, we love diving deep into systems architecture, especially when it means making things more efficient and testable. This is at the core of how we enable industry-leading reliability for our drivers and customers.

Recently, we've been tackling a challenge that's close to the hearts of developers that work on complex distributed systems: how to scale our finite state machines (FSMs) beyond a single binary and into distributed processes (sometimes even on other computers) while keeping everything as clean and testable as possible. Here's how we’re doing it.

The Current Setup

Right now, we run FSMs within a single binary using shared memory structures for input and output. This works great because:

  1. Simplicity: A higher-level manager handles shared structs for FSMs, and our control loop is single-threaded.
  2. Reliability: For inter-binary communication, we use a simple inter-process communication (IPC) mechanism to exchange states. Missed packets aren't a problem because of a steady packet stream.

This setup has been great for components strictly following the FSM model—they're a breeze to test. But as our system grows more complex and distributed, the less-structured modules are starting to feel like outliers. We want every module to benefit from strict, testable interfaces.

The Next Big Step: Distributed FSMs with Event Buses

Enter the Event Bus. This new architecture is all about managing state flow cleanly within an application and extending that flow across distributed systems. Here’s the gist:

What’s an Event Bus?

An Event Bus handles state snapshots (aka “Events”) within an application, providing a clean way to publish and subscribe to state changes.

  • Publishers (like sensors) push updates to the bus.
  • Subscribers (like actuators or FSMs) receive the most recent state.

This ensures components always see the latest state every control cycle (e.g., every 20ms for our control loop).

Taking It Distributed

To scale this model across nodes, Event Buses can now talk to each other:

  1. Service Registry: Each node advertises its available events to a registry.
  2. Event Discovery: Nodes discover and subscribe to other nodes' events via the registry.
  3. UDP Multicast: Events are sent over the network with UDP multicast, making the system lightweight and fast.

Each Event includes metadata:

  • Node_ID: Who published it.
  • TimeReceived: When the event was received by the bus.

This metadata helps us detect stale state (and, indirectly, communication loss).

Below is what the whole system looks like in graph form, as a simple illustration.

Figure 1: Distributed FSM architecture block diagram.

Serialization: FlatBuffers vs. Protobuf

To efficiently handle Events both in memory and over the network, we’re exploring serialization options. The two contenders:

FlatBuffers

  1. There is minimal serialization/deserialization overhead. Importantly, FlatBuffer elements can be accessed in-place. Especially if more state is shared than is actually used, then a map-style access without a deserialization penalty is much preferred.
  2. FlatBuffers are ideal for memory-constrained environments using fixed ring buffer allocators. They can run performantly on even microcontrollers, which has potential use cases for us.

Example Schema:

1namespace Disco:struct EBMetaData {
2	__eb_node_id:string;  
3    __eb_time_rx_ns:uint64;
4}
5
6table VehicleRequestedValues {  
7	metadata:EBMetaData;  
8    current:float;  
9    voltage:float;
10}

Protobuf

  • Just based on usage alone, it is significantly more familiar to many developers.
  • At least for an embedded platform, it needs investigation into ArenaAllocator to avoid dynamic memory issues. Some of our platforms don't have a heap to begin with, running baremetal.

Performance testing will ultimately decide which one wins, but FlatBuffers has a slight edge due to its speed and zero-copy potential. It demonstrates a strong capability to perform well in a resource-constrained environment.

Why We’re Excited

This new approach unlocks several powerful benefits:

  • Testability: Strict FSM structures are now possible across distributed systems.
  • Scalability: Nodes can independently manage and share state. Synchronization and
  • Flexibility: Components can mix and match Events without compromising the system’s architecture.

We’re thrilled to see how this will streamline our systems and are eager to share our findings (and maybe even some open-source tools) with the developer community.

What's Next?

We’re still refining aspects like how to allocate and route Events across nodes and finalizing our serialization format. But we’re confident that this architecture will set a new standard for scalable, distributed FSMs.

If you’re also tackling distributed systems or have thoughts on serialization formats, we’d love to hear from you. Let’s build better systems together! 🚀

Stay tuned for updates, and happy coding!