← StepBinderSample export · fictional data
Carrier API outage — on-call runbook · Document
Document
Carrier API outage — on-call runbook
Acme Corporation
What to do when one or more of Acme's downstream carrier APIs (UPS, FedEx, DHL, USPS) start failing label generation or rate lookup. Covers detection, triage, mitigation, and customer communications.

Document OwnerJordan Lee — SRE Lead, Shipping Services
DepartmentPlatform Engineering — Shipping
Date
Document TypeDocument
ClassificationInternal · Engineering On-call
Carrier API outage — on-call runbook · Document
Overview

When to use this runbook. You're the on-call SRE for Shipping Services, and one of the following is happening:

  • PagerDuty fired carrier-api-error-rate or carrier-api-latency-p95.
  • The customer status page is showing > 200 5xx/minute on /api/shipping/label or /api/shipping/rate.
  • Vendor Operations has reported in #carrier-ops that labels aren't printing for one or more warehouses.

Expected runtime. Detect & triage: ~5 min. Mitigation: 5–30 min depending on which carrier and whether their incident is acknowledged. Comms close: ~10 min.

On-call contacts. Primary: SRE Shipping rotation (PagerDuty service shipping-sre). Secondary: Tech Lead Priya Mehta. Escalation: VP Eng Marcus Tan after 30 min unresolved.

Step 01
Acknowledge the page, open the Shipping Services dashboard, and identify the impacted carrier.

From PagerDuty, click the alert link straight to the Grafana dashboard shipping/services-overview. The four-up grid shows error rate and p95 latency per carrier. Whichever panel is red is the carrier in trouble — note it for the rest of the runbook.

https://grafana.acmecorporation.net/d/shipping/services-overview?from=now-1h
Grafana › Shipping › Services overview
Carrier API health — last 1h
UPS — 5xx rate
214 / min
▲ +2080% vs baseline
FedEx — 5xx rate
3 / min
normal
DHL — 5xx rate
1 / min
normal
USPS — 5xx rate
0 / min
normal
Fig 1 — UPS is the impacted carrier in this scenario; the other three are healthy.
Step 02
Check the carrier's own status page to determine whether they've acknowledged the incident.

Open the carrier status URL (UPS: status.ups.com; FedEx: developer.fedex.com/status; DHL: status.dhl.com; USPS: usps.com/business/web-tools-apis/status). If the carrier has an open incident matching your symptoms, the resolution will likely come from their side — your job becomes mitigation and comms (Phase 2 and Phase 3). If their status page is green, treat it as an Acme-side issue and escalate to the Tech Lead.

Carrier API outage — on-call runbook · Document
Step 03
Flip the impacted carrier to "degraded" mode in the shipping router config.

The shipping router supports per-carrier circuit breakers via the carrier_modes ConfigMap. Setting a carrier to degraded does two things: shipping requests for that carrier short-circuit with a 503 (so we stop hammering an already-down API), and the routing logic prefers other carriers for any shipment whose service-level allows substitution.

kubectl -n shipping edit configmap carrier-modes
sre@laptop:~$ kubectl -n shipping edit configmap carrier-modes
# change mode for ups: healthy → degraded
data:
ups: "degraded" # was healthy
fedex: "healthy"
dhl: "healthy"
usps: "healthy"
sre@laptop:~$ kubectl -n shipping rollout restart deployment/shipping-router
deployment.apps/shipping-router restarted
Fig 2 — Flipping UPS to "degraded" via the router ConfigMap. New pods pick up the mode within ~30 sec.
Note: Do not flip more than one carrier to degraded at the same time without Tech Lead sign-off. Two simultaneous degraded carriers can saturate the remaining carriers and cascade.
Step 04
Drain any in-flight requests stuck waiting on the impacted carrier.

Run scripts/drain-carrier.sh ups. The script reads the request queue from Redis, marks any UPS-bound request older than 60 seconds as failed (with retryable=true), and emits a Kafka event so the calling service can decide whether to retry against a different carrier. Expected duration ≈ 90 seconds.

Step 05
If the carrier outage is expected to last more than 30 minutes, enable label fallback to a paper-based process for the affected warehouses.

This step requires Tech Lead approval. Page Priya Mehta from PagerDuty. Once approved, set the feature flag shipping.fallback.paper_labels = true in LaunchDarkly. The flag tells the warehouse handheld app to print a 4×6 placeholder label with the shipment ID and barcode; the carrier label gets reconciled by Warehouse Ops later.

Step 06
Publish a status-page incident and notify internal channels.

From status.acmecorporation.com admin, create a new incident with template carrier-degraded. Set the impacted component to the carrier name and severity to Partial outage. Post a one-line summary in #shipping-ops and #carrier-ops, then email the on-call list ops-leadership@.

https://manage.statuspage.io/pages/acme/incidents/new
Status page › New incident
UPS label generation degraded
Impact
Partial outage
Affected components
Shipping · UPS label generation
Message
We're seeing elevated errors generating UPS labels. Other carriers (FedEx, DHL, USPS) are unaffected. We're investigating.
Publish incident
Fig 3 — Status-page incident template. Keep wording factual and short for the first post.
Carrier API outage — on-call runbook · Document
Step 07
Watch the carrier's recovery and run a probe before flipping the carrier back to healthy.

The carrier's own status page is the source of truth for whether they've recovered. Once they post "Resolved", run the probe: scripts/probe-carrier.sh ups --count 50 --concurrency 5. The probe issues 50 small label-creation requests over five concurrent connections and reports the success rate. Require ≥ 98% success across three consecutive runs before re-enabling.

Step 08
Flip the carrier back to healthy and roll out the router.

Reverse step 03: edit the carrier-modes ConfigMap and set the carrier back to healthy. Restart the shipping-router deployment. Watch the Grafana dashboard from step 01 — error rate should return to baseline within 2 minutes.

Do not skip the cool-down. Carriers often have rolling recovery — they declare "Resolved" while still bringing nodes back. The 3× probe gate above is what protects us from a relapse.
Step 09
Replay any work that was deferred during the outage.

If paper-label fallback was enabled (step 05), Warehouse Ops will have a queue of placeholder labels to reconcile. Trigger the replay job in CI: shipping/jobs/replay-fallback-labels. It re-issues the carrier label for each placeholder, attaches the real tracking number, and notifies the warehouse to re-label the affected packages.

Step 10
Close the status-page incident and post final comms.

Move the status-page incident to Resolved. Include the start/end timestamps, the carrier that was impacted, and a one-line summary of customer impact. Post the same in #shipping-ops and #carrier-ops.

Step 11
File the post-incident review.

Within 24 hours, open a PIR document using the carrier-outage template. Include: detection timestamp, carrier, total customer-visible duration, peak error rate, mitigation steps taken, and any action items (e.g. "tune the carrier-api-error-rate alert threshold"). Tag the PIR with carrier-{name} for trend analysis.

Reference
  • Grafana dashboard — shipping/services-overview · grafana.acmecorporation.net/d/shipping/services-overview
  • PagerDuty service — shipping-sre
  • Slack channels — #shipping-ops · #carrier-ops · #change-control
  • Carrier status pages — UPS, FedEx, DHL, USPS (see Phase 1, step 02)
  • LaunchDarkly flag — shipping.fallback.paper_labels (Tech Lead approval required)
  • Related runbooks — RB-PLAT-007 (Redis queue drain) · RB-PLAT-012 (router rollback)