Streaming and Batching Data Processing

tags:

data-engineering
software-engineering
cloud
data-processing
batching
streaming

Introduction to Data Processing Paradigms

What is Data Pipeline? Building a data warehouse pipeline can be complex sometimes. If you are starting in this world, you will soon realize there is no right or wrong way to do it. It always depends on your needs.

Yet, there are a couple of basic processes you should put in place when building a data pipeline to improve its operability and performance. This document will be used to share you with a roadmap that can help as a guide when building a data warehouse pipeline. This roadmap is intended to help people to implement DataOps when building a data warehouse pipeline through a set of processes.

In the roadmap section we talk about five processes that should be implemented to improve your data pipeline operability and performance — Orchestration, Monitoring, Version Control, CI/CD, and Configuration Management. Some of the data warehousing terminology — e.g., data lakes, data warehouses, batching, streaming, ETL, ELT, and so on.

Data Pipeline Roadmap: There are five processes we recommend you should put in place to improve your data pipeline operability and performance. These processes are 1.Orchestration, 2.Monitoring, 3.Version Control, 4.CI/CD, and 5.Configuration Management.

Focussing:

Understanding the difference between streaming and batching data processing.
Overview of common streaming frameworks (e.g., Apache Kafka, Apache Flink) and batch processing frameworks (e.g., Apache Spark, Apache Beam).
Discussing the advantages and disadvantages of each approach.

Choosing Between Stream and Batch Processing

Latency Requirement: Determine the urgency of insights required; real-time needs favor streaming while batch processing offers delayed but comprehensive results.
Data Volume and Frequency: Consider the volume and velocity of incoming data; streaming is ideal for continuous streams, while batch suits larger, accumulated datasets.
Use Case Scenario: Identify if immediate action or continuous monitoring is essential (streaming) or if historical analysis and reporting are the primary goals (batch).
Infrastructure and Resources: Assess your system’s capabilities to handle real-time data processing (streaming) or batch processing’s resource requirements.

Streaming Data Processing

Stream Processing Pipeline

Exploring the characteristics of streaming data and its challenges.
Introduction to stream processing concepts such as windowing, watermarking, and event time processing, late-event handling.
Understanding fault tolerance and exactly-once processing in streaming frameworks.

Building a real-time data pipeline using Apache Kafka and Apache Flink/Beam.

Writing a simple data producer to generate streaming data.

Python function to generate the streaming data to Kafka topic

Creating a basic Spark job to process batch data. This will using Python to consume data from Kafka topic payment_msg.

def produce_to_kafka(topic_name, bootstrap_servers='localhost:9092'):
    producer_conf = {
        'bootstrap.servers': bootstrap_servers,
        'client.id': 'python-producer'
    }

    producer = Producer(producer_conf)

    try:
        while True:
            payment_msg = generate_fake_payment()
            payment_serialized_msg = json.dumps(payment_msg).encode('utf-8')
            producer.produce(topic_name, payment_serialized_msg, callback=delivery_report)
            time.sleep(20)

    except KeyboardInterrupt:
        print("KeyboardInterrupt: Stopping Kafka producer")
    finally:
        producer.flush()
        producer.close()

Note: Make sure you had install Spark on your local machine Implementing windowed operations on streaming data. Handling late events and out-of-order data in a streaming pipeline.

Batch Data Processing

Batch Processing Pipeline

Overview of batch data processing concepts and its applications. Understanding batch processing workflows and data partitioning strategies. Discussing optimization techniques for batch processing jobs.

Developing batch processing jobs using Apache Spark.

Implementing common transformations and actions on batch data.

Optimizing Spark jobs for performance and scalability.


from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("Spark Snowflake Connector") \
    .config("spark.jars", "snowflake-jdbc-3.13.16.jar, \
        spark-snowflake_2.12-2.9.0-spark_3.1.jar") \
    .getOrCreate()

sfOptions = {
    "sfURL" : "<hostname>.snowflakecomputing.",
    "sfUser" : "",
    "sfPassword" : "",
    "sfDatabase" : "",
    "sfSchema" : "",
    "sfWarehouse" : "",
}
jdbc_url = f"jdbc:snowflake://{sfOptions['sfURL']}/"


df = spark.read \
    .format("snowflake") \
    .options(**sfOptions) \
    .option("query", "SELECT COUNT(1) FROM WAREHOUSE.PUBLIC.WEB_EVENTS") \
    .load()

df.show()

Alternatively with SaaS and Low-Code tool, we can use Mage AI for batch processing jobs, it supports batch, stream, cron, orchestrate pipeline

Integrating Streaming and Batch Processing

Exploring hybrid data processing architectures combining streaming and batch processing. Discussing use cases for integrating streaming and batch processing pipelines. Introduction to lambda architecture and kappa architecture.

Lambda architecture

Consists of three layers: batch layer, speed layer, and serving layer12. The batch layer is responsible for processing historical data using batch processing techniques, such as MapReduce or Spark. The speed layer is responsible for processing streaming data using stream processing techniques, such as Storm or Flink. The serving layer is responsible for providing a unified view of the data by combining the results from the batch and speed layers12.

Kappa architecture

Simplifies the lambda architecture by removing the batch layer and using only the stream processing layer13. In kappa architecture, all data is treated as a stream, regardless of whether it is historical or real time. The stream processing layer can handle both batch and stream processing tasks by replaying the data from the beginning or from a specific point in time13.

Comparing Lambda and Kappa architectures

	Advantages	Disadvantages
Lambda architecture	high accuracy, high completeness, scalable, fault-tolerant, flexible, adaptable	complexity, cost, required de-coupling, latency, inconsistency, take to to sync batch and speed data
Kappa architecture	simple, single source of truth, fast, consistent,realtime, near-realtime data, low latency	required expertise, exp with streaming in distributed sys, event ordering and tolerance, not for lake house solution or any batching processing

Designing and implementing a lambda architecture for real-time analytics.

Developing connectors to bridge streaming and batch processing systems.

Testing and monitoring the integrated data processing pipeline.

Is it correct to understand Realtime equally means Streaming?

Common: Real-time and streaming technologies are both used to transmit data continuously over the internet, but they have different characteristics and use cases.

Different:

Aspect	Real-Time	Streaming
Definition	Immediate processing and transmission of data	Continuous transmission of data for playback
Latency	Very low latency, almost immediate	Slightly higher latency due to buffering
Interactivity	High interactivity, suitable for two-way communication	Lower interactivity, typically one-way consumption
Buffering	Minimal to no buffering	Uses buffering to ensure smooth playback
Examples	Video conferencing, online gaming, live broadcasting	Video on demand, music streaming, live streaming services
Use Cases	Telemedicine, financial trading systems, live event coverage	Entertainment, media consumption, educational content delivery

In general, while both technologies are used for continuous data transmission, real-time focuses on immediate processing and interaction with minimal delay, making it suitable for interactive applications. Streaming focuses on delivering a continuous flow of data for consumption, with some buffering to ensure smooth playback, making it ideal for media and content delivery.

Advanced Topics and Project Work

Deep dive into advanced topics such as stateful stream processing, complex event processing, and stream-to-stream joins. Overview of deployment strategies for data processing applications.

Working on a capstone project to apply the concepts learned throughout the bootcamp.

Collaborating in teams to design and implement a data processing solution for a real-world scenario.

Presenting the project to peers and receiving feedback for improvement.

Summary

What is streaming and batching?
What is the upside of streaming vs batching?
Use-cases of batching and streaming
Batching and micro-batch processing
Can you sync the batch and streaming layer and if yes how?
Data reconciliation is done when data had migrated from on-prem to cloud storage
Handling data was changing during the migration
Process Dimension data loading process
Difference between batch and stream processing

Setting up a local development environment with Apache Kafka and Apache Spark.

Using python virtual environment or directly install on the machine, check the requirement, pip install -r requirements.txt

Using docker to create and deploy everything

Writing a simple data producer to generate streaming data, Python function to generate the streaming data to Kafka topic

Payment Producer produces data to Kafka topic Python Payment Producer
userClick Producer produces data to Kafka topic Python User Click Producer

Solid Data Foundations (SDF)