Streaming: a type of data processing engine that is designed with infinite data sets in mind.

Other common uses of “streaming” that will be avoid in the rest of the post:

  1. Unbounded data: A type of ever-growing, essentially infinite data set.
  2. Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data.
  3. Low-latency, approximate, and/or speculative results: These types of results are most often associated with streaming engines.

Limitations of streaming

To beat batch at its own game, you really only need two things:

  1. Correctness: exactly-once requires strongly consistent state.
  2. Tools for reasoning about time - This gets you beyond batch.Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

Event time vs. processing time

Within any data processing system, there are typically two domains of time we care about:

  • Event time, which is the time at which events actually occurred.
  • Processing time, which is the time at which events are abserved in the system.

Skew always exists between Event time and Processing time.

If you care about event times, you cannot analyzer your data solely within the contxt of when they are abserved in your pipeline.

If you care about correctness and are interested in analyzing your data in the context of their event times, you cannot define those temporal boundaries using processing time.

Data processing patterns

Bounded data

Unbounded data — batch

Fixed windows

Sessions

Sessions are typically defined as periods of activity (e.g., for a specific user) terminated by a gap of inactivity.

Unbounded data — streaming

Four groups of dealing data

Time-agnostic

Filtering

Just look at each record as it arrived, and drop the records that we are not interested. We don’t care about the time.

Inner-joins

When joining two unbounded data sources, if you only care about the results of a join when an element from both sources arrive, there’s no temporal element to the logic.

Approximation algorithms

Windowing by processing time

Windowing by event time