Orchestrating Scalable Data Pipelines with Apache Toree, YuniKorn, Spark, and Airflow

This session explores the integrated use of Apache Toree, YuniKorn, Spark, and Airflow to create efficient, scalable data pipelines. We will start by discussing how Apache Toree provides an interactive analysis environment with Spark via Jupyter Notebook. Then, we’ll discuss using Apache YuniKorn to manage and schedule these computational resources, ensuring system efficiency. Central to our talk, we’ll delve into the role of Apache Spark in large-scale data processing, highlighting its integration with Toree and YuniKorn. Finally, we’ll demonstrate how Apache Airflow orchestrates this complex workflow, managing dependencies, and providing end-to-end processing solutions. Attendees will learn to leverage these Apache projects for optimized data processing.

Orchestrating Scalable Data Pipelines with Apache Toree, YuniKorn, Spark, and Airflow

Luciano Resende

Hongyue Zhang