Incident Response

Severity Levels

Level	Description	Examples	Response Time	Escalation
P1	Full service outage	All APIs unreachable, database down, complete data loss	Immediate	All hands, notify leadership within 15 min
P2	Major degradation	Single API down, high error rate (>5%), billing failures	< 30 minutes	On-call engineer + team lead
P3	Partial degradation	Elevated latency, intermittent errors, single feature broken	< 2 hours	On-call engineer
P4	Minor issue	Cosmetic bug, non-critical feature, documentation error	< 24 hours	Normal sprint workflow

Escalation Criteria

Escalate from P3 to P2 if:

Issue persists for more than 30 minutes
More than 3 customers report the issue
Error rate exceeds 5%

Escalate from P2 to P1 if:

Multiple services are affected
Data integrity is at risk
Issue persists for more than 15 minutes without a mitigation path

Response Steps

Acknowledge - Note the time and initial symptoms
Assess - Check health endpoints, logs, metrics
Communicate - Update stakeholders and status page
Mitigate - Rollback, restart, or apply fix
Resolve - Confirm service restored
Post-mortem - Document root cause and action items

Step 1: Initial Assessment

Check Health Endpoints

export AWS_PROFILE=statux-main

# Check all API health endpoints
curl -s https://statuspage-api.statux.io/api/v1/health | jq .
curl -s https://alerts-api.statux.io/api/v1/health | jq .
curl -s https://synthetics-api.statux.io/api/v1/health | jq .
curl -s https://insights-api.statux.io/api/v1/health | jq .
curl -s https://platform-api.statux.io/api/v1/health | jq .

Check ALB Target Health

# Get target group ARNs (check each API)
aws elbv2 describe-target-groups \
  --query 'TargetGroups[*].[TargetGroupName,TargetGroupArn]' \
  --output table

# Check target health for a specific target group
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Check Docker Logs via SSM

# List running instances
aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=statux-prod-asg-api" \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]' \
  --output table

# Get Docker logs from an instance
aws ssm send-command \
  --instance-ids <instance-id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-api --tail 200 --since 10m"]'

# Retrieve the command output
aws ssm get-command-invocation \
  --command-id <command-id> \
  --instance-id <instance-id> \
  --query 'StandardOutputContent' \
  --output text

Docker Container Names

Each API uses a different container name:

Statuspages: statux-api
Alerting: statux-alerts-api
Synthetics: statux-synthetics-api
Insights: statux-insights-api
Platform: statux-platform-api

Check RDS Connectivity

# Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier statux-prod-rds \
  --query 'DBInstances[0].[DBInstanceStatus,Endpoint.Address,DBInstanceClass]' \
  --output table

# Check active connections (via bastion or SSM)
psql -h <rds-endpoint> -U statux_admin -d statux -c \
  "SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'statux';"

# Check for long-running queries
psql -h <rds-endpoint> -U statux_admin -d statux -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
   FROM pg_stat_activity
   WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '30 seconds'
   ORDER BY duration DESC;"

Step 2: Per-Service Troubleshooting

Statuspages API (Port 3000)

Common issues:

Subscriber email delivery failures (check Resend API status)
High traffic on public status pages
Incident webhook delivery timeouts

What to check:

# Check the ASG
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names statux-prod-asg-api \
  --query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].HealthStatus]'

# Docker logs
aws ssm send-command --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Alerting API (Port 3001)

Common issues:

Alert delivery delays (Twilio, push notifications)
Escalation chain failures
High alert volume causing queue backlog

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-alerts-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Synthetics API (Port 3002)

Common issues:

Check execution timeouts
Relay disconnections
False positives from network issues in check regions

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-synthetics-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Insights API (Port 3003)

Common issues:

AWS Bedrock throttling or timeouts
Webhook ingestion failures
Usage budget exceeded

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-insights-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Platform API (Port 3004)

Common issues:

Stripe webhook delivery failures
Cognito authentication issues
SCIM provisioning errors

What to check:

# Docker logs
aws ssm send-command --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["docker logs statux-platform-api --tail 100 --since 5m 2>&1 | grep -i error"]'

Step 3: Rollback Procedures

Quick Rollback (ASG Instance Refresh)

If a recent deployment caused the issue, roll back to the previous Docker image:

# 1. Find the previous working image tag
aws ecr describe-images \
  --repository-name <ecr-repo-name> \
  --query 'sort_by(imageDetails,&imagePushedAt)[-5:].imageTags' \
  --output table

# 2. Tag the previous good image as "latest"
GOOD_TAG="<previous-sha>"
REPO="255982108053.dkr.ecr.us-east-1.amazonaws.com/<ecr-repo-name>"

# Pull, retag, and push
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 255982108053.dkr.ecr.us-east-1.amazonaws.com
docker pull $REPO:$GOOD_TAG
docker tag $REPO:$GOOD_TAG $REPO:latest
docker push $REPO:latest

# 3. Trigger instance refresh
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name <asg-name> \
  --preferences '{"MinHealthyPercentage":50,"InstanceWarmup":120}'

# 4. Monitor the refresh
watch -n 10 'aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name <asg-name> \
  --query "InstanceRefreshes[0].[Status,PercentageComplete]" \
  --output text'

ECR Repository and ASG Names

App	ECR Repo	ASG Name
Statuspages	`statux-api`	`statux-prod-asg-api`
Alerting	`statux-alerts-api`	`statux-prod-asg-alerts-api`
Synthetics	`statux-synthetics-api`	`statux-prod-asg-synthetics-api`
Insights	`statux-insights-api`	`statux-prod-asg-insights-api`
Platform	`statux-platform-api`	`statux-prod-asg-platform-api`

Database Rollback

If the issue is caused by a bad migration:

# 1. Take a snapshot before reverting
aws rds create-db-snapshot \
  --db-instance-identifier statux-prod-rds \
  --db-snapshot-identifier manual-pre-rollback-$(date +%Y%m%d%H%M)

# 2. Revert the migration
cd statux-api
npm run migration:revert:<app-name>

# 3. Verify the database state
psql -h <rds-endpoint> -U statux_admin -d statux -c \
  "SELECT * FROM <schema>.migrations ORDER BY id DESC LIMIT 5;"

Database Restore from Snapshot

For severe database issues, restore from the latest automated or manual snapshot:

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier statux-prod-rds \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-5:].[DBSnapshotIdentifier,SnapshotCreateTime,Status]' \
  --output table

See the Database Restore runbook for the full restore procedure.

Step 4: Communication

Internal Communication

Post in #incidents Slack channel with severity, affected service, and current status
Tag the on-call engineer and team lead
Update every 15 minutes for P1, every 30 minutes for P2

Status Page Updates

Use these templates for public status page updates:

Investigating

[Service Name] - Investigating Issues

We are currently investigating reports of [brief description of symptoms]. Our team is actively working to identify the root cause. We will provide an update within [15/30] minutes.

Identified

[Service Name] - Issue Identified

We have identified the cause of [brief description]. [Brief explanation of root cause]. Our team is implementing a fix. We expect resolution within [estimated time].

Monitoring

[Service Name] - Fix Deployed, Monitoring

A fix has been deployed for [brief description]. We are monitoring the system to confirm the issue is fully resolved. We will provide a final update once we are confident in the resolution.

Resolved

[Service Name] - Resolved

The issue affecting [brief description] has been resolved. [Brief explanation of what happened and what was done]. Total duration: [X hours Y minutes]. We apologize for any inconvenience and will be conducting a post-mortem to prevent recurrence.

Step 5: Post-Mortem

After the incident is resolved, create a post-mortem document within 48 hours:

Timeline: Detailed chronological events
Root cause: Technical explanation of what went wrong
Impact: Number of affected users, duration, data impact
Detection: How was the incident detected? Could we have detected it sooner?
Response: What worked well? What could be improved?
Action items: Concrete tasks to prevent recurrence, each with an owner and due date

RCA in Statux Insights

Use the Statux Insights RCA feature to create and track post-mortem documents with linked incidents and action items.

Severity Levels​

Escalation Criteria​

Response Steps​

Step 1: Initial Assessment​

Check Health Endpoints​

Check ALB Target Health​

Check Docker Logs via SSM​

Check RDS Connectivity​

Step 2: Per-Service Troubleshooting​

Statuspages API (Port 3000)​

Alerting API (Port 3001)​

Synthetics API (Port 3002)​

Insights API (Port 3003)​

Platform API (Port 3004)​

Step 3: Rollback Procedures​

Quick Rollback (ASG Instance Refresh)​

ECR Repository and ASG Names​

Database Rollback​

Database Restore from Snapshot​

Step 4: Communication​

Internal Communication​

Status Page Updates​

Investigating​

Identified​

Monitoring​

Resolved​

Step 5: Post-Mortem​

Severity Levels

Escalation Criteria

Response Steps

Step 1: Initial Assessment

Check Health Endpoints

Check ALB Target Health

Check Docker Logs via SSM

Check RDS Connectivity

Step 2: Per-Service Troubleshooting

Statuspages API (Port 3000)

Alerting API (Port 3001)

Synthetics API (Port 3002)

Insights API (Port 3003)

Platform API (Port 3004)

Step 3: Rollback Procedures

Quick Rollback (ASG Instance Refresh)

ECR Repository and ASG Names

Database Rollback

Database Restore from Snapshot

Step 4: Communication

Internal Communication

Status Page Updates

Investigating

Identified

Monitoring

Resolved

Step 5: Post-Mortem