1. Functional requirements
- Register a job: one-shot (run at time
T) or recurring (cron expression + timezone). - Reliably trigger each due job by invoking a handler (HTTP webhook, message on a queue, or in-process task).
- Retry failed jobs with a backoff policy and a max-attempts cap; route exhausted jobs to a dead-letter queue (DLQ).
- At-least-once execution guaranteed; idempotency support so handlers can dedupe.
- Query job status/history; pause/resume/cancel a job.
- Explicit missed-run policy per job (skip, fire-once-on-recovery, or backfill).
- Monitoring & alerting on failure, on missed SLA (job didn't run within its window), and on stuck/long-running jobs.
2. Non-functional requirements
- Scale: 10M registered jobs; 100K job triggers/minute at peak (≈ 1,600/s).
- Timing accuracy: a job fires within a few seconds of its scheduled time (p99 < 5 s) — this is a scheduler, not a hard-real-time system.
- Durability: a registered job must never be lost; a fired job must never be silently dropped.
- Availability: 99.95%; the scheduler surviving a node failure is the whole point.
- No duplicate side-effects for idempotent handlers; bounded duplicates otherwise.
3. Capacity estimation
- Triggers: 100K/min ≈ ~1,600/s (×3 burst ≈ 5K/s) — modest; the hard part is correctness, not raw QPS.
- Scan load: with a 1-second tick and a B-tree index on
next_run_at, each tick reads only the due slice (jobs wherenext_run_at <= now), typically a few thousand rows — an index range scan, not a full scan. - Storage: 10M jobs × ~1 KB definition ≈ ~10 GB (fits one node, replicated). Execution history at 100K/min × 60 × 24 ≈ 144M rows/day × ~300 B ≈ ~43 GB/day → history goes to a partitioned/TTL'd table or columnar store, never the hot path.
- Workers: if a job averages 200 ms, then 5K/s ÷ (1/0.2) = ~1,000 concurrent slots; size the worker pool and queue accordingly.
4. High-level architecture
5. API design
POST /api/v1/jobs
Body: { "name":"daily-report", "schedule":"0 2 * * *", "timezone":"UTC",
"target":{"type":"http","url":"https://svc/run","method":"POST"},
"retry":{"maxAttempts":5,"backoff":"exponential","baseMs":1000},
"missedRunPolicy":"fire_once" }
201: { "jobId":"job_abc", "nextRunAt":"2026-06-08T02:00:00Z" }
GET /api/v1/jobs/{jobId} -> definition + nextRunAt + status
GET /api/v1/jobs/{jobId}/runs -> execution history (paginated)
POST /api/v1/jobs/{jobId}/pause -> stop scheduling future runs
POST /api/v1/jobs/{jobId}/resume
DELETE /api/v1/jobs/{jobId}
POST /api/v1/jobs/{jobId}/trigger -> manual ad-hoc run (idempotency-key header)
Handler invocations carry a unique runId used by the handler as the idempotency key.
6. Data model
CREATE TABLE job (
job_id UUID PRIMARY KEY,
name TEXT NOT NULL,
cron_expr TEXT, -- null for one-shot
timezone TEXT NOT NULL DEFAULT 'UTC',
target JSONB NOT NULL, -- http/queue handler spec
retry_policy JSONB NOT NULL,
missed_policy TEXT NOT NULL, -- skip | fire_once | backfill
status TEXT NOT NULL, -- active | paused | cancelled
next_run_at TIMESTAMPTZ NOT NULL,
version BIGINT NOT NULL DEFAULT 0
);
CREATE INDEX idx_due ON job (next_run_at) WHERE status = 'active'; -- the due-index
CREATE TABLE job_run (
run_id UUID PRIMARY KEY, -- == idempotency key for the handler
job_id UUID NOT NULL REFERENCES job(job_id),
scheduled_for TIMESTAMPTZ NOT NULL,
attempt INT NOT NULL,
state TEXT NOT NULL, -- claimed | running | succeeded | failed | dead
worker_id TEXT,
lease_until TIMESTAMPTZ, -- fencing: reclaim if exceeded
finished_at TIMESTAMPTZ
);
-- job_run is range-partitioned by scheduled_for and TTL'd; archive to OLAP.
7. Detailed component design
Claiming due jobs. Two designs:
- Leader + partitions (FAANG answer). Schedulers register in Zookeeper/etcd; a leader (or a consistent-hash assignment) owns a subset of the job-id space so two nodes never claim the same job. Each owner ticks every second, range-scans its slice of
idx_due, and enqueues due jobs. Leader election (Raft/ZAB) plus heartbeats handle node death; on a lease expiry the partition is reassigned. - Leaderless
SKIP LOCKED(EU/regional answer). No leader at all: every scheduler runsSELECT ... FROM job WHERE next_run_at <= now() AND status='active' FOR UPDATE SKIP LOCKED LIMIT N, claims disjoint rows atomically, enqueues them, and advancesnext_run_atto the next cron occurrence in the same transaction. The DB is the distributed lock. Simple, correct, scales to thousands/s.
Execution & leases. Workers pull a job_run from the queue, set state='running' with a lease (lease_until = now + timeout), invoke the handler, and on success mark succeeded. A reaper scans for running rows whose lease_until has passed (worker died) and re-queues them — this is the recovery mechanism, and the reason handlers must be idempotent.
Idempotency / exactly-once. True exactly-once is impossible end-to-end; we deliver at-least-once and make effects idempotent: the handler stores processed runIds (or uses a unique constraint) and no-ops on replay. A fencing token (monotonic attempt/lease epoch) lets a downstream system reject a write from a stale, presumed-dead worker that came back to life.
Missed-run handling. Computed from scheduled_for vs now when a scheduler recovers: skip advances next_run_at past the gap; fire_once runs the most recent missed occurrence once; backfill enqueues every missed tick (dangerous for frequent jobs — cap it).
8. Scaling considerations
- Partition the job space by hash(job_id) across N schedulers; each owns a shard, so scan work and lock contention scale horizontally.
- Decouple scan from execution via the queue — bursty fire times are absorbed by Kafka/SQS; workers autoscale on queue depth independently of the scheduler.
- Time-bucketing for huge job counts: maintain per-minute (or per-second) buckets / a hashed timing wheel so a tick reads only the current bucket instead of scanning a global index.
- History offload to partitioned/TTL'd tables or an OLAP store keeps the hot
job/job_runtables small.
9. Trade-offs and alternatives
- At-least-once vs exactly-once. At-least-once + idempotent handlers is the pragmatic, correct default; chasing exactly-once delivery is a rabbit hole — push uniqueness to the handler instead.
- Leaderless
SKIP LOCKEDvs leader election.SKIP LOCKEDis dead simple and needs no Zookeeper, but couples you to one DB and its write throughput; leader+partitions scales further and survives DB sharding but adds ZK/etcd operational weight. - Build vs buy. Quartz (clustered, JDBC job store) or a managed/workflow engine (Temporal, AWS EventBridge Scheduler, Airflow for DAGs) covers most needs. Say so — a strong answer is knowing when not to build a scheduler.
- DB poll vs timing wheel. Polling a due-index is simple and good to thousands/s; a hierarchical timing wheel gives O(1) insert/expire for very high job counts at the cost of in-memory complexity and durability care.
10. Common follow-up questions
- "The scheduler was down for 10 minutes — what runs?" → driven by
missedRunPolicy; show skip vs fire-once vs backfill. - "A worker hangs forever." → lease + reaper re-queues after
lease_until; cap attempts → DLQ. - "How do you guarantee a payment job runs exactly once?" → at-least-once delivery + idempotency key in the payment service; fencing token rejects stale retries.
- "Two schedulers fired the same job." →
SKIP LOCKED/partition ownership prevents double-claim; even so, idempotency makes it safe. - "How do you alert that a job didn't run?" → a watchdog comparing
next_run_at/scheduled_foragainst now, emitting a missed-SLA metric to Alertmanager. - "Cron across DST / timezones?" → store timezone per job; compute next occurrence with a TZ-aware library; document fold/gap behaviour.