Many slow Spark jobs aren’t slow because of bad business logic, but because data is in the wrong shape at the wrong time.
repartition() and coalesce() both change the number of partitions, but they solve different problems and have very different runtime behavior.
Think of Spark performance as one core rule:
Most Spark time is spent moving data, not computing it.
Repartition — Force balance, pay the cost
What really happens
repartition()always triggers a full shuffle- All records are:
- Serialized
- Sent across the network
- Redistributed evenly across partitions
- Spark tries hard to balance partition sizes
Why it’s expensive
- Network I/O
- Disk spills during shuffle
- CPU cost of serialization
- Garbage collection pressure
When repartition is the correct choice
Use it when correctness or scalability depends on balance:
✔ Data skew exists
✔ Wide transformations coming (joins, aggregations, groupBy)
✔ You need parallelism for heavy compute
✔ Upstream source has few partitions (JDBC, S3, single file)
df.repartition(200)
Corner cases you must know
1️⃣ Repartition can fix silent skew
Skew often doesn’t show in row counts, but shows in:
- Join stages
- Long-running single tasks
- Executor OOMs
repartition() forces redistribution and can turn a 40-minute job into 5 minutes.
2️⃣ repartition(col) is still a shuffle
df.repartition($"user_id")
This:
- Hashes on
user_id - Sends data across the cluster
- Can still be skewed if key distribution is skewed
If one key = 40% of data → one partition becomes a hotspot
3️⃣ Too many repartitions kill performance
Doing this repeatedly:
df.repartition(200)
.filter(...)
.repartition(300)
.join(...)
= multiple full shuffles → death by a thousand cuts
Rule:
➡ Repartition once, intentionally, near the compute boundary
Coalesce — Reduce work, avoid shuffle
What really happens
coalesce()merges existing partitions- No full shuffle (unless you explicitly ask)
- Spark does not rebalance data
df.coalesce(10)
Partitions are merged as-is, keeping original data locality.
Why it’s cheap
- No network shuffle
- Minimal serialization
- Mostly metadata-level changes
When coalesce shines
Use it when data volume is already reduced:
✔ After heavy filter()
✔ After distinct()
✔ Before writing output
✔ To reduce small files
✔ When compute is already done
df
.filter(...)
.coalesce(20)
.write.parquet(...)
The dangerous corner cases of coalesce
⚠️ 1️⃣ Uneven partitions (most common bug)
If input partitions are skewed:
df.coalesce(5)
Result:
- Partition 1 → 80% of data
- Other partitions → almost empty
- One task runs forever
- CPUs sit idle
This looks like “Spark is slow” — but it’s partition imbalance.
⚠️ 2️⃣ Coalesce before heavy compute = disaster
df.coalesce(10)
.groupBy(...)
.agg(...)
You just:
- Reduced parallelism
- Forced massive compute into few tasks
- Increased memory pressure
- Risked executor OOM
Never coalesce before joins or aggregations.
⚠️ 3️⃣ Coalesce doesn’t fix skew
If data is skewed:
coalesce()preserves skew- It cannot redistribute records
- It only reduces task count
Using coalesce to “fix performance” often makes skew worse.
⚠️ 4️⃣ coalesce(n, shuffle = true) ≠ default coalesce
df.coalesce(50, shuffle = true)
This:
- Becomes similar to repartition
- Triggers shuffle
- But less optimized than
repartition()
If you want shuffle → use repartition explicitly.
The production-safe pattern (battle-tested)
✅ The pattern that works
df
.repartition(optimalParallelism) // before heavy compute
.transformations(...)
.coalesce(targetFileCount) // before write
Why this works
- Compute runs on balanced partitions
- CPU utilization stays high
- Shuffle cost is paid once
- Output files are controlled
How to decide quickly (mental checklist)
| Situation | Correct choice |
|---|---|
| Join / groupBy / heavy compute | repartition() |
| Skew suspected | repartition() |
| After filter/distinct | coalesce() |
| Before write | coalesce() |
| Reduce small files | coalesce() |
| Fix performance without shuffle | ❌ (usually impossible) |
Final production truth
Repartition buys correctness and parallelism at the cost of shuffle.
Coalesce buys efficiency by avoiding shuffle but assumes data is already balanced.