Stateful Stream Processing
- tags: Stream processing,Flink
- source:
This means that how one event is handled can depend on the accumulated effect of all the events that came before it.
How the stateful streaming processing works on a distributed cluster?⌗
The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each parallel instance is responsible for handling events for a specific group of keys, and the state for those keys is kept locally.
- State are stored and accessed locally by sharded key-value store,
- A fully-connected network shuffle will be occurring between all the instances,
- All of the events that will be processed together.
State is always accessed locally, which helps Flink applications achieve high throughput and low-latency. You can choose to keep state on the JVM heap, or if it is too large, in efficiently organized on-disk data structures.
Some examples of stateful operations⌗
- When an application searches for certain event patterns, the state will store the sequence of events encountered so far.
- When aggregating events per minute/hour/day, the state holds the pending aggregates.
- When training a machine learning model over a stream of data points, the state holds the current version of the model parameters.
- When historic data needs to be managed, the state allows efficient access to events that occurred in the past.