Repartition vs Coalesce in Spark

Many slow Spark jobs aren’t slow because of bad business logic, but because data is in the wrong shape at the wrong time.

repartition() and coalesce() both change the number of partitions, but they solve different problems and have very different runtime behavior.

Think of Spark performance as one core rule:

Most Spark time is spent moving data, not computing it.

Repartition — Force balance, pay the cost

What really happens

repartition() always triggers a full shuffle
All records are:
- Serialized
- Sent across the network
- Redistributed evenly across partitions
Spark tries hard to balance partition sizes

Why it’s expensive

Network I/O
Disk spills during shuffle
CPU cost of serialization
Garbage collection pressure

When repartition is the correct choice

Use it when correctness or scalability depends on balance:

✔ Data skew exists
✔ Wide transformations coming (joins, aggregations, groupBy)
✔ You need parallelism for heavy compute
✔ Upstream source has few partitions (JDBC, S3, single file)

df.repartition(200)

Corner cases you must know

1️⃣ Repartition can fix silent skew

Skew often doesn’t show in row counts, but shows in:

Join stages
Long-running single tasks
Executor OOMs

repartition() forces redistribution and can turn a 40-minute job into 5 minutes.

2️⃣ `repartition(col)` is still a shuffle

df.repartition($"user_id")

This:

Hashes on user_id
Sends data across the cluster
Can still be skewed if key distribution is skewed

If one key = 40% of data → one partition becomes a hotspot

3️⃣ Too many repartitions kill performance

Doing this repeatedly:

df.repartition(200)
  .filter(...)
  .repartition(300)
  .join(...)

= multiple full shuffles → death by a thousand cuts

Rule:
➡ Repartition once, intentionally, near the compute boundary

Coalesce — Reduce work, avoid shuffle

What really happens

coalesce() merges existing partitions
No full shuffle (unless you explicitly ask)
Spark does not rebalance data

df.coalesce(10)

Partitions are merged as-is, keeping original data locality.

Why it’s cheap

No network shuffle
Minimal serialization
Mostly metadata-level changes

When coalesce shines

Use it when data volume is already reduced:

✔ After heavy filter()
✔ After distinct()
✔ Before writing output
✔ To reduce small files
✔ When compute is already done

df
  .filter(...)
  .coalesce(20)
  .write.parquet(...)

The dangerous corner cases of coalesce

⚠️ 1️⃣ Uneven partitions (most common bug)

If input partitions are skewed:

df.coalesce(5)

Result:

Partition 1 → 80% of data
Other partitions → almost empty
One task runs forever
CPUs sit idle

This looks like “Spark is slow” — but it’s partition imbalance.

⚠️ 2️⃣ Coalesce before heavy compute = disaster

df.coalesce(10)
  .groupBy(...)
  .agg(...)

You just:

Reduced parallelism
Forced massive compute into few tasks
Increased memory pressure
Risked executor OOM

Never coalesce before joins or aggregations.

⚠️ 3️⃣ Coalesce doesn’t fix skew

If data is skewed:

coalesce() preserves skew
It cannot redistribute records
It only reduces task count

Using coalesce to “fix performance” often makes skew worse.

⚠️ 4️⃣ `coalesce(n, shuffle = true)` ≠ default coalesce

df.coalesce(50, shuffle = true)

This:

Becomes similar to repartition
Triggers shuffle
But less optimized than repartition()

If you want shuffle → use repartition explicitly.

The production-safe pattern (battle-tested)

✅ The pattern that works

df
  .repartition(optimalParallelism)  // before heavy compute
  .transformations(...)
  .coalesce(targetFileCount)         // before write

Why this works

Compute runs on balanced partitions
CPU utilization stays high
Shuffle cost is paid once
Output files are controlled

How to decide quickly (mental checklist)

Situation	Correct choice
Join / groupBy / heavy compute	`repartition()`
Skew suspected	`repartition()`
After filter/distinct	`coalesce()`
Before write	`coalesce()`
Reduce small files	`coalesce()`
Fix performance without shuffle	❌ (usually impossible)

Final production truth

Repartition buys correctness and parallelism at the cost of shuffle.
Coalesce buys efficiency by avoiding shuffle but assumes data is already balanced.

Repartition — Force balance, pay the cost

What really happens

Why it’s expensive

When repartition is the correct choice

Corner cases you must know

1️⃣ Repartition can fix silent skew

2️⃣ repartition(col) is still a shuffle

3️⃣ Too many repartitions kill performance

Coalesce — Reduce work, avoid shuffle

What really happens

Why it’s cheap

When coalesce shines

The dangerous corner cases of coalesce

⚠️ 1️⃣ Uneven partitions (most common bug)

⚠️ 2️⃣ Coalesce before heavy compute = disaster

⚠️ 3️⃣ Coalesce doesn’t fix skew

⚠️ 4️⃣ coalesce(n, shuffle = true) ≠ default coalesce

The production-safe pattern (battle-tested)

✅ The pattern that works

Why this works

How to decide quickly (mental checklist)

Final production truth

Leave a Reply Cancel reply

2️⃣ `repartition(col)` is still a shuffle

⚠️ 4️⃣ `coalesce(n, shuffle = true)` ≠ default coalesce