Push Notifications at Scale: Infrastructure for Millions of Users

Sending one push notification is trivial. Sending a million in 30 seconds is an infrastructure challenge that breaks everything not designed for the load. The typical picture: a marketer hits "send to all," and within 10 seconds the database crashes, the queue swells to 40 GB, and APNs starts returning 429 on every other request. A monolithic service that handled 10,000 users just fine simply dies at 1,000,000 — it doesn't degrade, it collapses.

APNs and FCM Architecture on the Server Side

Apple Push Notification Service and Firebase Cloud Messaging are not just "send a JSON." Both systems rely on long-lived TCP connections, which fundamentally shapes how you design the server side.

APNs uses HTTP/2 with multiplexing: a single connection can hold up to 1,500 concurrent streams. Authentication is either via JWT token (refreshed every 60 minutes, no certificate storage needed on the server) or TLS certificate tied to a bundle ID. JWT is the preferred option for large-scale systems because one key works for all apps in a developer account.

FCM is different. The legacy API (XMPP and old HTTP) was officially deprecated in June 2024. The current approach is HTTP v1 API with OAuth 2.0. FCM HTTP v1 doesn't support batching directly — each notification is a separate HTTP request. That's Google's architectural decision, and it immediately dictates your worker pool requirements.

# Push service structure /push-service  /workers    apns_worker.go      # HTTP/2 client with connection pool    fcm_worker.go       # HTTP/1.1 with OAuth token refresh    huawei_worker.go    # HMS Push Kit for Chinese devices  /queue    consumer.go         # Kafka consumer group  /token_store    redis_client.go     # Hot tokens    pg_client.go        # Persistent storage

Key point for APNs: the connection must be established in advance, not at the moment of sending. Every new TLS handshake costs ~200ms and puts load on APNs, which doesn't tolerate connection storms. If you try to open 500 connections simultaneously, APNs will start dropping them.

The Fan-Out Problem: From One Event to a Million Deliveries

This is where the real pain begins. You have one event — "send a promo notification to all users" — and you need to turn it into a million individual delivery tasks. The synchronous approach kills the system instantly: one HTTP request hangs until all million recipients are processed. This doesn't work under any circumstances.

The right approach is a two-level fan-out through a message queue. First level: the broadcast service reads user segments from the database paginated and publishes tasks to Kafka in batches of 1,000–5,000 records at a time.

kafka-topics.sh --create   --bootstrap-server kafka:9092   --topic push.delivery   --partitions 64   --replication-factor 3   --config retention.ms=3600000 kafka-consumer-groups.sh   --bootstrap-server kafka:9092   --describe   --group push-workers

64 partitions isn't an arbitrary number — it defines maximum parallelism. More workers than partitions won't help. At 50,000 RPS you'll hit FCM batch limits before Kafka. Work from target throughput: delivering a million notifications in 60 seconds means ~16,700 messages per second. Multiply by average send latency (~80ms for FCM) and you get the minimum number of parallel workers: around 1,340.

Second level: workers read from partitions and send in batches. RabbitMQ is an alternative if you already have it, but Kafka's advantage is the ability to re-read messages after a failure without data loss — critical for push.

Worker Pools and Horizontal Scaling

APNs workers and FCM workers are different beasts — don't mix them in one process. APNs requires HTTP/2 with persistent connections, FCM uses HTTP/1.1. Different connection pools, different retry logic, different limits.

# /etc/systemd/system/push-worker-fcm@.service [Unit] Description=FCM Push Worker instance %i After=network.target [Service] Type=simple User=push WorkingDirectory=/opt/push-service ExecStart=/opt/push-service/bin/push-worker   --platform=fcm   --kafka-brokers=kafka-01:9092,kafka-02:9092,kafka-03:9092   --concurrency=200   --connection-pool-size=50 Restart=always RestartSec=5 LimitNOFILE=65536 [Install] WantedBy=multi-user.targetsystemctl enable push-worker-fcm@{1..8} systemctl start push-worker-fcm@{1..8} systemctl status 'push-worker-fcm@*'

Horizontal scaling through Kubernetes with HPA on consumer lag metric: when the queue grows, the controller adds pods. Not magic — just a Prometheus metric with a custom HPA.

Token Storage

A device token is 64 bytes for APNs and ~150 bytes for FCM. A million users with an average of 1.5 devices is 1.5 million records. Tokens go stale when users reinstall apps, change phones, or revoke notification permission. A store without invalidation quickly becomes a garbage dump.

For persistent storage — PostgreSQL with sharding by user_id:

CREATE TABLE device_tokens (    id          BIGSERIAL PRIMARY KEY,    user_id     BIGINT NOT NULL,    platform    VARCHAR(10) NOT NULL,    token       VARCHAR(256) NOT NULL,    created_at  TIMESTAMPTZ DEFAULT NOW(),    last_seen   TIMESTAMPTZ DEFAULT NOW(),    is_active   BOOLEAN DEFAULT TRUE ) PARTITION BY HASH (user_id); CREATE TABLE device_tokens_0 PARTITION OF device_tokens    FOR VALUES WITH (MODULUS 16, REMAINDER 0); CREATE INDEX CONCURRENTLY idx_tokens_user_active    ON device_tokens (user_id, is_active)    WHERE is_active = TRUE;

Redis for hot-cache of active tokens — only users active in the last 30 days. This cuts Redis volume by 3–5x. Reading tokens from PostgreSQL directly during a fan-out to a million users will kill the database — preload tokens in batches into workers beforehand.

redis-cli SET user:12345:tokens:fcm "token_value" EX 2592000 redis-cli SET user:12345:tokens:apns "token_value" EX 2592000

Delivery Guarantees and Dead Letter Queues

At-least-once or at-most-once — the choice depends on notification type. Transactional ("your payment was processed") — at-least-once, duplicates beat loss. Marketing promo — at-most-once, users shouldn't get the same message three times. This is an architectural decision that needs to be made explicitly.

PUSH_RETRY_MAX_ATTEMPTS=3 PUSH_RETRY_INITIAL_DELAY=1s PUSH_RETRY_MAX_DELAY=30s kafka-topics.sh --create   --bootstrap-server kafka:9092   --topic push.delivery.dlq   --partitions 8   --replication-factor 3   --config retention.ms=86400000

Dead letter queue is mandatory. Messages end up there when: a token is invalid (FCM 404 UNREGISTERED, APNs 410 BadDeviceToken), the user revoked permissions, or the worker crashed after three retries. DLQ isn't for retry — it's for analysis and dead token invalidation.

kafka-console-consumer.sh   --bootstrap-server kafka:9092   --topic push.delivery.dlq   --group dlq-processor   --from-beginning |   jq -r '.token' |   xargs -I{} psql -c "UPDATE device_tokens SET is_active=FALSE WHERE token='{}'"

APNs at 410 returns the timestamp of the last valid registration — a signal to deactivate the token immediately. Ignoring these responses means accumulating garbage and wasting resources on provably invalid deliveries.

Delivery Monitoring

Metrics you can't operate blind without:

  • delivery_rate — percentage of successful deliveries. Below 85% is an alarm. Normal for a mature product is 92–96%
  • queue_depth per Kafka topic — linear growth means not enough workers
  • p99 latency from event to delivery — under 5 seconds for transactional notifications
  • fcm_error_rate by error type — UNREGISTERED, INVALID_ARGUMENT, QUOTA_EXCEEDED tracked separately
  • apns_connection_pool_exhausted — if non-zero, the pool is too small for the current load
  • token_cache_hit_rate — should be above 80%

groups:  - name: push_delivery    rules:      - alert: PushDeliveryRateLow        expr: |          (sum(rate(push_delivered_total[5m])) /           sum(rate(push_sent_total[5m]))) < 0.85        for: 2m        labels:          severity: critical      - alert: PushQueueDepthHigh        expr: kafka_consumer_lag_sum{group="push-workers-fcm"} > 500000        for: 5m        labels:          severity: warning      - alert: PushLatencyP99High        expr: histogram_quantile(0.99, rate(push_e2e_latency_seconds_bucket[5m])) > 10        for: 3m        labels:          severity: warning

Grafana dashboard with three main panels: throughput (sent/delivered per second), error breakdown by platform and error type, consumer lag by partition. Without error type breakdown you won't understand why delivery rate dropped — FCM throttling or a burst of dead tokens after an app update.

Common Mistakes

  • Synchronous sending inside an HTTP handler. The request hangs until all million recipients are processed — thread pool exhausted — 502 across the whole service. The only correct pattern: the HTTP handler puts the task in a queue and immediately returns 202 Accepted
  • Ignoring FCM 400 INVALID_ARGUMENT. The payload is invalid — retry is pointless, but many systems make three attempts and waste FCM quota
  • One giant batch instead of pagination. SELECT without LIMIT on a million rows — full sequential scan that holds the database connection open for minutes and blocks autovacuum
  • No rate limiting on the sender side. FCM gives 600,000 messages per minute by default. Without speed control, a million-message broadcast hits this limit in 1–2 minutes and gets 429 for the next 58 seconds
  • Ignoring Retry-After from APNs. When APNs returns 429, the header contains Retry-After. Most hand-rolled clients ignore it and keep hammering the service, making things worse
  • Tokens without TTL or cleanup. An uncleaned store grows 5–10x in a year. 30–40% of records are dead tokens. A daily cron-job that deactivates tokens based on DLQ responses is not optimization — it's basic hygiene

One last thing that's often forgotten: load testing before production. There's no point designing a system for a million users if you don't know where it will break at 200,000. A local test bench with real Kafka, Redis, and mock APNs/FCM servers lets you find bottlenecks before your users do. Pay special attention to partial failure behavior: if one of three Kafka brokers is unavailable during fan-out, your workers should reconnect — not hang.