This means that how one event is handled can depend on the accumulated effect of all the events that came before it.

How the stateful streaming processing works on a distributed cluster?

The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each parallel instance is responsible for handling events for a specific group of keys, and the state for those keys is kept locally.

  1. State are stored and accessed locally by sharded key-value store,
  2. A fully-connected network shuffle will be occurring between all the instances,
  3. All of the events that will be processed together.

State is always accessed locally, which helps Flink applications achieve high throughput and low-latency. You can choose to keep state on the JVM heap, or if it is too large, in efficiently organized on-disk data structures.

Some examples of stateful operations

  • When an application searches for certain event patterns, the state will store the sequence of events encountered so far.
  • When aggregating events per minute/hour/day, the state holds the pending aggregates.
  • When training a machine learning model over a stream of data points, the state holds the current version of the model parameters.
  • When historic data needs to be managed, the state allows efficient access to events that occurred in the past.