Workflow:

  1. Checkpoint coordinator (part of the job manager) instructs a task manager to begin a checkpoint.

  2. Insert numbered checkpoint barriers into their streams of all the sources record their offsets.

  3. checkpoint barriers flow through the job graph, indicating the part of the stream before and after each checkpoint.

    Checkpoint n will contain the state of each operator that resulted from having consumed every event before checkpoint barrier n, and none of the events after it.

  4. As each operator in the job graph receives one of these barriers, it records its state.

    Operators with two input streams (such as a CoProcessFunction) perform barrier alignment so that the snapshot will reflect the state resulting from consuming events from both input streams up to (but not past) both barriers.

Asynchronously snapshot

Flink’s state backends use a copy-on-write mechanism to allow stream processing to continue unimpeded while older versions of the state are being asynchronously snapshotted. Only when the snapshots have been durably persisted will these older versions of the state be garbage collected.