Skip to main content

Incident Response

Severity Levels

LevelDescriptionExamplesResponse TimeEscalation
P1Full service outageAll APIs unreachable, database down, complete data lossImmediateAll hands, notify leadership within 15 min
P2Major degradationSingle API down, high error rate (>5%), billing failures< 30 minutesOn-call engineer + team lead
P3Partial degradationElevated latency, intermittent errors, single feature broken< 2 hoursOn-call engineer
P4Minor issueCosmetic bug, non-critical feature, documentation error< 24 hoursNormal sprint workflow

Escalation Criteria

Escalate from P3 to P2 if:

  • Issue persists for more than 30 minutes
  • More than 3 customers report the issue
  • Error rate exceeds 5%

Escalate from P2 to P1 if:

  • Multiple services are affected
  • Data integrity is at risk
  • Issue persists for more than 15 minutes without a mitigation path

Response Steps

  1. Acknowledge - Note the time and initial symptoms
  2. Assess - Check health endpoints, logs, metrics
  3. Communicate - Update stakeholders and status page
  4. Mitigate - Rollback, restart, or apply fix
  5. Resolve - Confirm service restored
  6. Post-mortem - Document root cause and action items

Step 1: Initial Assessment

Check Health Endpoints

export AWS_PROFILE=statux-main

# Check all API health endpoints
curl -s https://statuspage-api.statux.io/api/v1/health | jq .
curl -s https://alerts-api.statux.io/api/v1/health | jq .
curl -s https://synthetics-api.statux.io/api/v1/health | jq .
curl -s https://insights-api.statux.io/api/v1/health | jq .
curl -s https://platform-api.statux.io/api/v1/health | jq .

Check ALB Target Health

# Get target group ARNs (check each API)
aws elbv2 describe-target-groups \
--query 'TargetGroups[*].[TargetGroupName,TargetGroupArn]' \
--output table

# Check target health for a specific target group
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>

Check Docker Logs via SSM

# List running instances
aws ec2 describe-instances \
--filters "Name=tag:aws:autoscaling:groupName,Values=statux-prod-asg-api" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]' \
--output table

# Get Docker logs from an instance
aws ssm send-command \
--instance-ids <instance-id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-api --tail 200 --since 10m"]'

# Retrieve the command output
aws ssm get-command-invocation \
--command-id <command-id> \
--instance-id <instance-id> \
--query 'StandardOutputContent' \
--output text
Docker Container Names

Each API uses a different container name:

  • Statuspages: statux-api
  • Alerting: statux-alerts-api
  • Synthetics: statux-synthetics-api
  • Insights: statux-insights-api
  • Platform: statux-platform-api

Check RDS Connectivity

# Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier statux-prod-rds \
--query 'DBInstances[0].[DBInstanceStatus,Endpoint.Address,DBInstanceClass]' \
--output table

# Check active connections (via bastion or SSM)
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'statux';"

# Check for long-running queries
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;"

Step 2: Per-Service Troubleshooting

Statuspages API (Port 3000)

Common issues:

  • Subscriber email delivery failures (check Resend API status)
  • High traffic on public status pages
  • Incident webhook delivery timeouts

What to check:

# Check the ASG
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names statux-prod-asg-api \
--query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].HealthStatus]'

# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Alerting API (Port 3001)

Common issues:

  • Alert delivery delays (Twilio, push notifications)
  • Escalation chain failures
  • High alert volume causing queue backlog

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-alerts-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Synthetics API (Port 3002)

Common issues:

  • Check execution timeouts
  • Relay disconnections
  • False positives from network issues in check regions

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-synthetics-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Insights API (Port 3003)

Common issues:

  • AWS Bedrock throttling or timeouts
  • Webhook ingestion failures
  • Usage budget exceeded

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-insights-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Platform API (Port 3004)

Common issues:

  • Stripe webhook delivery failures
  • Cognito authentication issues
  • SCIM provisioning errors

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-platform-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Step 3: Rollback Procedures

Quick Rollback (ASG Instance Refresh)

If a recent deployment caused the issue, roll back to the previous Docker image:

# 1. Find the previous working image tag
aws ecr describe-images \
--repository-name <ecr-repo-name> \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:].imageTags' \
--output table

# 2. Tag the previous good image as "latest"
GOOD_TAG="<previous-sha>"
REPO="255982108053.dkr.ecr.us-east-1.amazonaws.com/<ecr-repo-name>"

# Pull, retag, and push
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 255982108053.dkr.ecr.us-east-1.amazonaws.com
docker pull $REPO:$GOOD_TAG
docker tag $REPO:$GOOD_TAG $REPO:latest
docker push $REPO:latest

# 3. Trigger instance refresh
aws autoscaling start-instance-refresh \
--auto-scaling-group-name <asg-name> \
--preferences '{"MinHealthyPercentage":50,"InstanceWarmup":120}'

# 4. Monitor the refresh
watch -n 10 'aws autoscaling describe-instance-refreshes \
--auto-scaling-group-name <asg-name> \
--query "InstanceRefreshes[0].[Status,PercentageComplete]" \
--output text'

ECR Repository and ASG Names

AppECR RepoASG Name
Statuspagesstatux-apistatux-prod-asg-api
Alertingstatux-alerts-apistatux-prod-asg-alerts-api
Syntheticsstatux-synthetics-apistatux-prod-asg-synthetics-api
Insightsstatux-insights-apistatux-prod-asg-insights-api
Platformstatux-platform-apistatux-prod-asg-platform-api

Database Rollback

If the issue is caused by a bad migration:

# 1. Take a snapshot before reverting
aws rds create-db-snapshot \
--db-instance-identifier statux-prod-rds \
--db-snapshot-identifier manual-pre-rollback-$(date +%Y%m%d%H%M)

# 2. Revert the migration
cd statux-api
npm run migration:revert:<app-name>

# 3. Verify the database state
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT * FROM <schema>.migrations ORDER BY id DESC LIMIT 5;"

Database Restore from Snapshot

For severe database issues, restore from the latest automated or manual snapshot:

# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier statux-prod-rds \
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-5:].[DBSnapshotIdentifier,SnapshotCreateTime,Status]' \
--output table

See the Database Restore runbook for the full restore procedure.


Step 4: Communication

Internal Communication

  1. Post in #incidents Slack channel with severity, affected service, and current status
  2. Tag the on-call engineer and team lead
  3. Update every 15 minutes for P1, every 30 minutes for P2

Status Page Updates

Use these templates for public status page updates:

Investigating

[Service Name] - Investigating Issues

We are currently investigating reports of [brief description of symptoms]. Our team is actively working to identify the root cause. We will provide an update within [15/30] minutes.

Identified

[Service Name] - Issue Identified

We have identified the cause of [brief description]. [Brief explanation of root cause]. Our team is implementing a fix. We expect resolution within [estimated time].

Monitoring

[Service Name] - Fix Deployed, Monitoring

A fix has been deployed for [brief description]. We are monitoring the system to confirm the issue is fully resolved. We will provide a final update once we are confident in the resolution.

Resolved

[Service Name] - Resolved

The issue affecting [brief description] has been resolved. [Brief explanation of what happened and what was done]. Total duration: [X hours Y minutes]. We apologize for any inconvenience and will be conducting a post-mortem to prevent recurrence.


Step 5: Post-Mortem

After the incident is resolved, create a post-mortem document within 48 hours:

  1. Timeline: Detailed chronological events
  2. Root cause: Technical explanation of what went wrong
  3. Impact: Number of affected users, duration, data impact
  4. Detection: How was the incident detected? Could we have detected it sooner?
  5. Response: What worked well? What could be improved?
  6. Action items: Concrete tasks to prevent recurrence, each with an owner and due date
RCA in Statux Insights

Use the Statux Insights RCA feature to create and track post-mortem documents with linked incidents and action items.