Member-only story
Handling Late Data in Apache Flink: Bounded Intervals vs. Allowed Lateness
In real-time data processing systems, handling late-arriving data is a common challenge. This is particularly true in Internet of Things (IoT) applications, where event times can significantly differ from ingestion times due to network delays, sensor synchronisation issues, or intermittent connectivity. For instance, when aggregating data from multiple sensors, each sensor might report its readings at different times, making it crucial to wait for all data to arrive before making unified decisions.
Apache Flink provides robust mechanisms to handle such late data, ensuring accurate and timely computations. Two primary methods to address lateness in a Flink application are:
- Using Bounded Watermarks (Bounded Intervals)
- Using Allowed Lateness in the Operator
This article explores these two methods, their benefits, and provides examples using PyFlink with tumbling windows.
Understanding Lateness in Flink
Before diving into the methods, it’s essential to understand how Flink processes time:
- Event Time: The time when an event actually occurred.
- Processing Time: The system time when an event is processed by Flink.
- Watermarks: Special timestamps that signal the progress of event time in the data stream. They help handle out-of-order events.