| Document Owner | Jordan Lee — SRE Lead, Shipping Services |
| Department | Platform Engineering — Shipping |
| Date | |
| Document Type | Document |
| Classification | Internal · Engineering On-call |
When to use this runbook. You're the on-call SRE for Shipping Services, and one of the following is happening:
Expected runtime. Detect & triage: ~5 min. Mitigation: 5–30 min depending on which carrier and whether their incident is acknowledged. Comms close: ~10 min.
On-call contacts. Primary: SRE Shipping rotation (PagerDuty service shipping-sre). Secondary: Tech Lead Priya Mehta. Escalation: VP Eng Marcus Tan after 30 min unresolved.
From PagerDuty, click the alert link straight to the Grafana dashboard shipping/services-overview. The four-up grid shows error rate and p95 latency per carrier. Whichever panel is red is the carrier in trouble — note it for the rest of the runbook.
Open the carrier status URL (UPS: status.ups.com; FedEx: developer.fedex.com/status; DHL: status.dhl.com; USPS: usps.com/business/web-tools-apis/status). If the carrier has an open incident matching your symptoms, the resolution will likely come from their side — your job becomes mitigation and comms (Phase 2 and Phase 3). If their status page is green, treat it as an Acme-side issue and escalate to the Tech Lead.
The shipping router supports per-carrier circuit breakers via the carrier_modes ConfigMap. Setting a carrier to degraded does two things: shipping requests for that carrier short-circuit with a 503 (so we stop hammering an already-down API), and the routing logic prefers other carriers for any shipment whose service-level allows substitution.
Run scripts/drain-carrier.sh ups. The script reads the request queue from Redis, marks any UPS-bound request older than 60 seconds as failed (with retryable=true), and emits a Kafka event so the calling service can decide whether to retry against a different carrier. Expected duration ≈ 90 seconds.
This step requires Tech Lead approval. Page Priya Mehta from PagerDuty. Once approved, set the feature flag shipping.fallback.paper_labels = true in LaunchDarkly. The flag tells the warehouse handheld app to print a 4×6 placeholder label with the shipment ID and barcode; the carrier label gets reconciled by Warehouse Ops later.
From status.acmecorporation.com admin, create a new incident with template carrier-degraded. Set the impacted component to the carrier name and severity to Partial outage. Post a one-line summary in #shipping-ops and #carrier-ops, then email the on-call list ops-leadership@.
The carrier's own status page is the source of truth for whether they've recovered. Once they post "Resolved", run the probe: scripts/probe-carrier.sh ups --count 50 --concurrency 5. The probe issues 50 small label-creation requests over five concurrent connections and reports the success rate. Require ≥ 98% success across three consecutive runs before re-enabling.
Reverse step 03: edit the carrier-modes ConfigMap and set the carrier back to healthy. Restart the shipping-router deployment. Watch the Grafana dashboard from step 01 — error rate should return to baseline within 2 minutes.
If paper-label fallback was enabled (step 05), Warehouse Ops will have a queue of placeholder labels to reconcile. Trigger the replay job in CI: shipping/jobs/replay-fallback-labels. It re-issues the carrier label for each placeholder, attaches the real tracking number, and notifies the warehouse to re-label the affected packages.
Move the status-page incident to Resolved. Include the start/end timestamps, the carrier that was impacted, and a one-line summary of customer impact. Post the same in #shipping-ops and #carrier-ops.
Within 24 hours, open a PIR document using the carrier-outage template. Include: detection timestamp, carrier, total customer-visible duration, peak error rate, mitigation steps taken, and any action items (e.g. "tune the carrier-api-error-rate alert threshold"). Tag the PIR with carrier-{name} for trend analysis.