Carrier API outage — on-call runbook

Document

What to do when one or more of Acme's downstream carrier APIs (UPS, FedEx, DHL, USPS) start failing label generation or rate lookup. Covers detection, triage, mitigation, and customer communications.

Document Owner	Jordan Lee — SRE Lead, Shipping Services
Department	Platform Engineering — Shipping
Date
Document Type	Document
Classification	Internal · Engineering On-call

Overview

When to use this runbook. You're the on-call SRE for Shipping Services, and one of the following is happening:

PagerDuty fired carrier-api-error-rate or carrier-api-latency-p95.
The customer status page is showing > 200 5xx/minute on /api/shipping/label or /api/shipping/rate.
Vendor Operations has reported in #carrier-ops that labels aren't printing for one or more warehouses.

Expected runtime. Detect & triage: ~5 min. Mitigation: 5–30 min depending on which carrier and whether their incident is acknowledged. Comms close: ~10 min.

On-call contacts. Primary: SRE Shipping rotation (PagerDuty service shipping-sre). Secondary: Tech Lead Priya Mehta. Escalation: VP Eng Marcus Tan after 30 min unresolved.

Step 01

Acknowledge the page, open the Shipping Services dashboard, and identify the impacted carrier.

From PagerDuty, click the alert link straight to the Grafana dashboard shipping/services-overview. The four-up grid shows error rate and p95 latency per carrier. Whichever panel is red is the carrier in trouble — note it for the rest of the runbook.

https://grafana.acmecorporation.net/d/shipping/services-overview?from=now-1h

Grafana › Shipping › Services overview

Carrier API health — last 1h

UPS — 5xx rate

214 / min

▲ +2080% vs baseline

FedEx — 5xx rate

3 / min

normal

DHL — 5xx rate

1 / min

normal

USPS — 5xx rate

0 / min

normal

Fig 1 — UPS is the impacted carrier in this scenario; the other three are healthy.

Step 02

Check the carrier's own status page to determine whether they've acknowledged the incident.

Open the carrier status URL (UPS: status.ups.com; FedEx: developer.fedex.com/status; DHL: status.dhl.com; USPS: usps.com/business/web-tools-apis/status). If the carrier has an open incident matching your symptoms, the resolution will likely come from their side — your job becomes mitigation and comms (Phase 2 and Phase 3). If their status page is green, treat it as an Acme-side issue and escalate to the Tech Lead.

Step 03

Flip the impacted carrier to "degraded" mode in the shipping router config.

The shipping router supports per-carrier circuit breakers via the carrier_modes ConfigMap. Setting a carrier to degraded does two things: shipping requests for that carrier short-circuit with a 503 (so we stop hammering an already-down API), and the routing logic prefers other carriers for any shipment whose service-level allows substitution.

kubectl -n shipping edit configmap carrier-modes

sre@laptop:~$ kubectl -n shipping edit configmap carrier-modes
  # change mode for ups: healthy → degraded
  data:
    ups:    "degraded"   # was healthy
    fedex:  "healthy"
    dhl:    "healthy"
    usps:   "healthy"
sre@laptop:~$ kubectl -n shipping rollout restart deployment/shipping-router
  deployment.apps/shipping-router restarted

Fig 2 — Flipping UPS to "degraded" via the router ConfigMap. New pods pick up the mode within ~30 sec.

Note: Do not flip more than one carrier to degraded at the same time without Tech Lead sign-off. Two simultaneous degraded carriers can saturate the remaining carriers and cascade.

Step 04

Drain any in-flight requests stuck waiting on the impacted carrier.

Run scripts/drain-carrier.sh ups. The script reads the request queue from Redis, marks any UPS-bound request older than 60 seconds as failed (with retryable=true), and emits a Kafka event so the calling service can decide whether to retry against a different carrier. Expected duration ≈ 90 seconds.

Step 05

If the carrier outage is expected to last more than 30 minutes, enable label fallback to a paper-based process for the affected warehouses.

This step requires Tech Lead approval. Page Priya Mehta from PagerDuty. Once approved, set the feature flag shipping.fallback.paper_labels = true in LaunchDarkly. The flag tells the warehouse handheld app to print a 4×6 placeholder label with the shipment ID and barcode; the carrier label gets reconciled by Warehouse Ops later.

Step 06

Publish a status-page incident and notify internal channels.

From status.acmecorporation.com admin, create a new incident with template carrier-degraded. Set the impacted component to the carrier name and severity to Partial outage. Post a one-line summary in #shipping-ops and #carrier-ops, then email the on-call list ops-leadership@.

https://manage.statuspage.io/pages/acme/incidents/new

Status page › New incident

UPS label generation degraded

Impact

Partial outage

Affected components

Shipping · UPS label generation

Message

We're seeing elevated errors generating UPS labels. Other carriers (FedEx, DHL, USPS) are unaffected. We're investigating.

Publish incident

Fig 3 — Status-page incident template. Keep wording factual and short for the first post.

Step 07

Watch the carrier's recovery and run a probe before flipping the carrier back to healthy.

The carrier's own status page is the source of truth for whether they've recovered. Once they post "Resolved", run the probe: scripts/probe-carrier.sh ups --count 50 --concurrency 5. The probe issues 50 small label-creation requests over five concurrent connections and reports the success rate. Require ≥ 98% success across three consecutive runs before re-enabling.

Step 08

Flip the carrier back to healthy and roll out the router.

Reverse step 03: edit the carrier-modes ConfigMap and set the carrier back to healthy. Restart the shipping-router deployment. Watch the Grafana dashboard from step 01 — error rate should return to baseline within 2 minutes.

Do not skip the cool-down. Carriers often have rolling recovery — they declare "Resolved" while still bringing nodes back. The 3× probe gate above is what protects us from a relapse.

Step 09

Replay any work that was deferred during the outage.

If paper-label fallback was enabled (step 05), Warehouse Ops will have a queue of placeholder labels to reconcile. Trigger the replay job in CI: shipping/jobs/replay-fallback-labels. It re-issues the carrier label for each placeholder, attaches the real tracking number, and notifies the warehouse to re-label the affected packages.

Step 10

Close the status-page incident and post final comms.

Move the status-page incident to Resolved. Include the start/end timestamps, the carrier that was impacted, and a one-line summary of customer impact. Post the same in #shipping-ops and #carrier-ops.

Step 11

File the post-incident review.

Within 24 hours, open a PIR document using the carrier-outage template. Include: detection timestamp, carrier, total customer-visible duration, peak error rate, mitigation steps taken, and any action items (e.g. "tune the carrier-api-error-rate alert threshold"). Tag the PIR with carrier-{name} for trend analysis.

Reference

Grafana dashboard — shipping/services-overview · grafana.acmecorporation.net/d/shipping/services-overview
PagerDuty service — shipping-sre
Slack channels — #shipping-ops · #carrier-ops · #change-control
Carrier status pages — UPS, FedEx, DHL, USPS (see Phase 1, step 02)
LaunchDarkly flag — shipping.fallback.paper_labels (Tech Lead approval required)
Related runbooks — RB-PLAT-007 (Redis queue drain) · RB-PLAT-012 (router rollback)