Data Engineering

Handle data skew in Spark joins

1. What is a Spark Join? When Spark joins two tables, it: This is how Spark runs fast — by dividing work across many machines 2. What is Data Skew? Data skew means some keys have much more data than others You are joining data on country Country Rows India 5,000 UK 5,000 Canada 5,000 […]

Repartition vs Coalesce in Spark

Many slow Spark jobs aren’t slow because of bad business logic, but because data is in the wrong shape at the wrong time. repartition() and coalesce() both change the number of partitions, but they solve different problems and have very different runtime behavior. Think of Spark performance as one core rule: Most Spark time is […]

Avro vs Parquet vs Iceberg – Detailed Comparison

Differences between Avro, Parquet, and Iceberg in a structured, comparison-table format. It covers technology aspects, properties, and real-world use cases to help architects, engineers, and decision-makers choose the right technology for their data platform. High-Level Classification Technology Category What It Solves Avro Serialization / Row-based file format Efficient data exchange & streaming with schema enforcement […]

Category: Data Engineering

Handle data skew in Spark joins

Repartition vs Coalesce in Spark

Avro vs Parquet vs Iceberg – Detailed Comparison