Addressing out-of-memory errors in batch Dataflow Prime jobs 

While Dataflow is a reliable service, and supports pipelines that have been running for years unchanged, there are situations where workers may experience memory issues, leading to out-of-memory (OOM) errors. In the current system, when a work item fails due to an OOM error, it is retried up to four times. If the work item still fails, the entire job fails, causing all successfully processed work to be discarded. This not only results in wasted processing costs, but also produces no output.

Moreover, resolving OOM errors often involves relaunching the job with increased memory capacity, which can be a time-consuming and costly trial-and-error process. 

Google developed vertical autoscaling to address the challenges associated with out-of-memory (OOM) errors in Dataflow Prime jobs, enabling you to focus on your application and business logic. We launched vertical autoscaling for streaming pipelines in August 2022, and are excited to announce the GA launch of vertical autoscaling for batch Dataflow Prime jobs.

With vertical autoscaling for batch Dataflow Prime, OOM events and memory usage are monitored over time, and memory upscaling is triggered automatically after four OOM errors to prevent job failures. This ensures that your batch Dataflow Prime jobs are resilient to memory errors without requiring any manual intervention.

By using vertical autoscaling, you can reduce the risk of job failures due to memory errors and improve the overall reliability and efficiency of your Dataflow Prime jobs.