The Fork/Join framework is Java's answer to divide-and-conquer parallelism: split a big task into smaller subtasks, run them across a pool of worker threads, and combine the results. Its engine is the ForkJoinPool, which uses work-stealing to keep cores busy even when subtasks are unevenly sized. Parallel streams are the high-level, declarative face of the same machinery — under the hood, stream().parallel() schedules its work on a ForkJoinPool.
Work-stealing, not work-sharing
A classic thread pool has one shared queue that every worker pulls from — a contention point. A ForkJoinPool instead gives each worker its own double-ended queue (deque). A worker pushes newly forked subtasks onto the head of its own deque and pops from the head too (LIFO, which is cache-friendly and keeps the most recently split work hot). When a worker runs dry, it steals from the tail of another worker's deque (FIFO), grabbing the oldest, typically largest, unstarted chunk. This minimizes contention and balances load automatically.
Worker A deque: [head] t5 t4 t3 [tail]
^own push/pop ^steal target
Worker B deque: [head] t9 t8 [tail] (B is busy)
Worker C: empty ---> steals t3 from A's tailRecursiveTask, RecursiveAction, and the common pool
You model work as a RecursiveTask<V> (returns a value) or RecursiveAction (no result). The idiom inside compute(): if the problem is below a threshold, solve it directly; otherwise split, fork() one half (push it for async execution), compute() the other half inline, then join() the forked half. Forking both and joining both wastes a thread — always compute one side directly.
Parallel streams: the same engine, declarative
Parallel streams help when data is large, CPU-bound, and cheaply splittable (arrays, ArrayList, IntStream.range). They hurt on small datasets, poorly-splittable sources (LinkedList, Iterator-backed), I/O-bound bodies, or ordered/stateful operations where merge and ordering costs eat the gains. The default rule of thumb: measure before you parallelize.