Many slow Spark jobs aren’t slow because of bad business logic, but because data is in the wrong shape at the wrong time.

repartition() and coalesce() both change the number of partitions, but they solve different problems and have very different runtime behavior.

Think of Spark performance as one core rule:

Most Spark time is spent moving data, not computing it.


Repartition — Force balance, pay the cost

What really happens

Why it’s expensive

When repartition is the correct choice

Use it when correctness or scalability depends on balance:

Data skew exists
Wide transformations coming (joins, aggregations, groupBy)
You need parallelism for heavy compute
Upstream source has few partitions (JDBC, S3, single file)

df.repartition(200)

Corner cases you must know

1️⃣ Repartition can fix silent skew

Skew often doesn’t show in row counts, but shows in:

repartition() forces redistribution and can turn a 40-minute job into 5 minutes.


2️⃣ repartition(col) is still a shuffle

df.repartition($"user_id")

This:

If one key = 40% of data → one partition becomes a hotspot


3️⃣ Too many repartitions kill performance

Doing this repeatedly:

df.repartition(200)
  .filter(...)
  .repartition(300)
  .join(...)

= multiple full shuffles → death by a thousand cuts

Rule:
➡ Repartition once, intentionally, near the compute boundary


Coalesce — Reduce work, avoid shuffle

What really happens

df.coalesce(10)

Partitions are merged as-is, keeping original data locality.


Why it’s cheap


When coalesce shines

Use it when data volume is already reduced:

✔ After heavy filter()
✔ After distinct()
✔ Before writing output
✔ To reduce small files
✔ When compute is already done

df
  .filter(...)
  .coalesce(20)
  .write.parquet(...)

The dangerous corner cases of coalesce

⚠️ 1️⃣ Uneven partitions (most common bug)

If input partitions are skewed:

df.coalesce(5)

Result:

This looks like “Spark is slow” — but it’s partition imbalance.


⚠️ 2️⃣ Coalesce before heavy compute = disaster

df.coalesce(10)
  .groupBy(...)
  .agg(...)

You just:

Never coalesce before joins or aggregations.


⚠️ 3️⃣ Coalesce doesn’t fix skew

If data is skewed:

Using coalesce to “fix performance” often makes skew worse.


⚠️ 4️⃣ coalesce(n, shuffle = true) ≠ default coalesce

df.coalesce(50, shuffle = true)

This:

If you want shuffle → use repartition explicitly.


The production-safe pattern (battle-tested)

✅ The pattern that works

df
  .repartition(optimalParallelism)  // before heavy compute
  .transformations(...)
  .coalesce(targetFileCount)         // before write

Why this works


How to decide quickly (mental checklist)

SituationCorrect choice
Join / groupBy / heavy computerepartition()
Skew suspectedrepartition()
After filter/distinctcoalesce()
Before writecoalesce()
Reduce small filescoalesce()
Fix performance without shuffle❌ (usually impossible)

Final production truth

Repartition buys correctness and parallelism at the cost of shuffle.
Coalesce buys efficiency by avoiding shuffle but assumes data is already balanced.

Leave a Reply

Your email address will not be published. Required fields are marked *