What is the ForkJoinPool and how does work-stealing work?

Divide-and-conquer parallelism: the work-stealing ForkJoinPool, RecursiveTask vs RecursiveAction, the common pool trap, and when parallel streams actually help.

Cracked Java

ForkJoinPool is a thread pool tuned for divide-and-conquer workloads where tasks recursively spawn subtasks. Its defining feature is work-stealing: each worker has a private deque, and idle workers steal tasks from busy ones to keep all cores saturated.

Per-worker deques

A standard ThreadPoolExecutor has a single shared queue — every thread contends on it. ForkJoinPool gives each worker thread its own double-ended queue. This is the key to its scalability: most task scheduling touches only the local deque, with no cross-thread synchronization.

When a task calls fork(), the subtask is pushed onto the head of the current worker's own deque. The worker also pops from the head — so its own work is LIFO. LIFO is deliberate: the most recently forked subtask is the one most likely still hot in cache, and in recursive splitting it's usually the smaller, deeper piece.

Stealing from the tail

When a worker's deque is empty, it doesn't sit idle. It picks another worker and steals from the tail of that worker's deque — FIFO from the thief's perspective.

// Inside a RecursiveTask's compute():
if (hi - lo <= THRESHOLD) {
    return computeDirectly(lo, hi);
}
int mid = (lo + hi) >>> 1;
SumTask left  = new SumTask(arr, lo, mid);
SumTask right = new SumTask(arr, mid, hi);
left.fork();                 // pushed to THIS worker's deque head
int rightResult = right.compute();  // run inline, no thread handoff
int leftResult  = left.join();      // help run / wait for the forked half
return leftResult + rightResult;

Stealing the oldest (tail) task is smart: in recursive decomposition, the oldest unstarted task represents the largest remaining chunk, so one steal hands the thief a big slice of work and avoids frequent re-stealing.

Why two ends matter

Owner and thieves operate on opposite ends, so they rarely collide — contention only occurs near an almost-empty deque. The result is near-linear scaling for balanced CPU-bound work, with automatic load-balancing when subtasks turn out uneven.