Incident Response
Severity Levels
| Level | Description | Examples | Response Time | Escalation |
|---|---|---|---|---|
| P1 | Full service outage | All APIs unreachable, database down, complete data loss | Immediate | All hands, notify leadership within 15 min |
| P2 | Major degradation | Single API down, high error rate (>5%), billing failures | < 30 minutes | On-call engineer + team lead |
| P3 | Partial degradation | Elevated latency, intermittent errors, single feature broken | < 2 hours | On-call engineer |
| P4 | Minor issue | Cosmetic bug, non-critical feature, documentation error | < 24 hours | Normal sprint workflow |
Escalation Criteria
Escalate from P3 to P2 if:
- Issue persists for more than 30 minutes
- More than 3 customers report the issue
- Error rate exceeds 5%
Escalate from P2 to P1 if:
- Multiple services are affected
- Data integrity is at risk
- Issue persists for more than 15 minutes without a mitigation path
Response Steps
- Acknowledge - Note the time and initial symptoms
- Assess - Check health endpoints, logs, metrics
- Communicate - Update stakeholders and status page
- Mitigate - Rollback, restart, or apply fix
- Resolve - Confirm service restored
- Post-mortem - Document root cause and action items
Step 1: Initial Assessment
Check Health Endpoints
export AWS_PROFILE=statux-main
# Check all API health endpoints
curl -s https://statuspage-api.statux.io/api/v1/health | jq .
curl -s https://alerts-api.statux.io/api/v1/health | jq .
curl -s https://synthetics-api.statux.io/api/v1/health | jq .
curl -s https://insights-api.statux.io/api/v1/health | jq .
curl -s https://platform-api.statux.io/api/v1/health | jq .
Check ALB Target Health
# Get target group ARNs (check each API)
aws elbv2 describe-target-groups \
--query 'TargetGroups[*].[TargetGroupName,TargetGroupArn]' \
--output table
# Check target health for a specific target group
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
Check Docker Logs via SSM
# List running instances
aws ec2 describe-instances \
--filters "Name=tag:aws:autoscaling:groupName,Values=statux-prod-asg-api" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]' \
--output table
# Get Docker logs from an instance
aws ssm send-command \
--instance-ids <instance-id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-api --tail 200 --since 10m"]'
# Retrieve the command output
aws ssm get-command-invocation \
--command-id <command-id> \
--instance-id <instance-id> \
--query 'StandardOutputContent' \
--output text
Each API uses a different container name:
- Statuspages:
statux-api - Alerting:
statux-alerts-api - Synthetics:
statux-synthetics-api - Insights:
statux-insights-api - Platform:
statux-platform-api
Check RDS Connectivity
# Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier statux-prod-rds \
--query 'DBInstances[0].[DBInstanceStatus,Endpoint.Address,DBInstanceClass]' \
--output table
# Check active connections (via bastion or SSM)
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'statux';"
# Check for long-running queries
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;"
Step 2: Per-Service Troubleshooting
Statuspages API (Port 3000)
Common issues:
- Subscriber email delivery failures (check Resend API status)
- High traffic on public status pages
- Incident webhook delivery timeouts
What to check:
# Check the ASG
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names statux-prod-asg-api \
--query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].HealthStatus]'
# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-api --tail 100 --since 5m 2>&1 | grep -i error"]'
Alerting API (Port 3001)
Common issues:
- Alert delivery delays (Twilio, push notifications)
- Escalation chain failures
- High alert volume causing queue backlog
What to check:
# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-alerts-api --tail 100 --since 5m 2>&1 | grep -i error"]'
Synthetics API (Port 3002)
Common issues:
- Check execution timeouts
- Relay disconnections
- False positives from network issues in check regions
What to check:
# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-synthetics-api --tail 100 --since 5m 2>&1 | grep -i error"]'
Insights API (Port 3003)
Common issues:
- AWS Bedrock throttling or timeouts
- Webhook ingestion failures
- Usage budget exceeded
What to check:
# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-insights-api --tail 100 --since 5m 2>&1 | grep -i error"]'
Platform API (Port 3004)
Common issues:
- Stripe webhook delivery failures
- Cognito authentication issues
- SCIM provisioning errors
What to check:
# Docker logs
aws ssm send-command --instance-ids <id> \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs statux-platform-api --tail 100 --since 5m 2>&1 | grep -i error"]'
Step 3: Rollback Procedures
Quick Rollback (ASG Instance Refresh)
If a recent deployment caused the issue, roll back to the previous Docker image:
# 1. Find the previous working image tag
aws ecr describe-images \
--repository-name <ecr-repo-name> \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:].imageTags' \
--output table
# 2. Tag the previous good image as "latest"
GOOD_TAG="<previous-sha>"
REPO="255982108053.dkr.ecr.us-east-1.amazonaws.com/<ecr-repo-name>"
# Pull, retag, and push
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 255982108053.dkr.ecr.us-east-1.amazonaws.com
docker pull $REPO:$GOOD_TAG
docker tag $REPO:$GOOD_TAG $REPO:latest
docker push $REPO:latest
# 3. Trigger instance refresh
aws autoscaling start-instance-refresh \
--auto-scaling-group-name <asg-name> \
--preferences '{"MinHealthyPercentage":50,"InstanceWarmup":120}'
# 4. Monitor the refresh
watch -n 10 'aws autoscaling describe-instance-refreshes \
--auto-scaling-group-name <asg-name> \
--query "InstanceRefreshes[0].[Status,PercentageComplete]" \
--output text'
ECR Repository and ASG Names
| App | ECR Repo | ASG Name |
|---|---|---|
| Statuspages | statux-api | statux-prod-asg-api |
| Alerting | statux-alerts-api | statux-prod-asg-alerts-api |
| Synthetics | statux-synthetics-api | statux-prod-asg-synthetics-api |
| Insights | statux-insights-api | statux-prod-asg-insights-api |
| Platform | statux-platform-api | statux-prod-asg-platform-api |
Database Rollback
If the issue is caused by a bad migration:
# 1. Take a snapshot before reverting
aws rds create-db-snapshot \
--db-instance-identifier statux-prod-rds \
--db-snapshot-identifier manual-pre-rollback-$(date +%Y%m%d%H%M)
# 2. Revert the migration
cd statux-api
npm run migration:revert:<app-name>
# 3. Verify the database state
psql -h <rds-endpoint> -U statux_admin -d statux -c \
"SELECT * FROM <schema>.migrations ORDER BY id DESC LIMIT 5;"
Database Restore from Snapshot
For severe database issues, restore from the latest automated or manual snapshot:
# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier statux-prod-rds \
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-5:].[DBSnapshotIdentifier,SnapshotCreateTime,Status]' \
--output table
See the Database Restore runbook for the full restore procedure.
Step 4: Communication
Internal Communication
- Post in
#incidentsSlack channel with severity, affected service, and current status - Tag the on-call engineer and team lead
- Update every 15 minutes for P1, every 30 minutes for P2
Status Page Updates
Use these templates for public status page updates:
Investigating
[Service Name] - Investigating Issues
We are currently investigating reports of [brief description of symptoms]. Our team is actively working to identify the root cause. We will provide an update within [15/30] minutes.
Identified
[Service Name] - Issue Identified
We have identified the cause of [brief description]. [Brief explanation of root cause]. Our team is implementing a fix. We expect resolution within [estimated time].
Monitoring
[Service Name] - Fix Deployed, Monitoring
A fix has been deployed for [brief description]. We are monitoring the system to confirm the issue is fully resolved. We will provide a final update once we are confident in the resolution.
Resolved
[Service Name] - Resolved
The issue affecting [brief description] has been resolved. [Brief explanation of what happened and what was done]. Total duration: [X hours Y minutes]. We apologize for any inconvenience and will be conducting a post-mortem to prevent recurrence.
Step 5: Post-Mortem
After the incident is resolved, create a post-mortem document within 48 hours:
- Timeline: Detailed chronological events
- Root cause: Technical explanation of what went wrong
- Impact: Number of affected users, duration, data impact
- Detection: How was the incident detected? Could we have detected it sooner?
- Response: What worked well? What could be improved?
- Action items: Concrete tasks to prevent recurrence, each with an owner and due date
Use the Statux Insights RCA feature to create and track post-mortem documents with linked incidents and action items.