Monitoring Setup
Configure comprehensive monitoring and alerting for your Supascale infrastructure with real-time metrics, custom alerts, and performance tracking.
Monitoring Setup
Supascale provides comprehensive monitoring capabilities to ensure your infrastructure operates reliably. This guide covers setting up monitoring, configuring alerts, and understanding the metrics available for your servers and instances.
Overview
Supascale monitoring includes:
- Real-time Metrics: CPU, memory, disk, and network monitoring
- Performance Tracking: Historical trends and baselines
- Alert Management: Configurable thresholds and notifications
- Health Checks: Service availability monitoring
- Custom Dashboards: Tailored views for different teams
Monitoring Architecture
Data Collection
Agent-Based Collection: The Supascale agent on each server collects metrics every 60 seconds:
monitoring: enabled: true interval: 60s # Collection frequency retention: 30d # Data retention period buffer_size: 1000 # Metric buffer for network issues metrics: system: true # CPU, memory, disk, network docker: true # Container metrics application: true # Instance-specific metrics
Collected Metrics:
System Metrics:
- CPU usage percentage (overall and per core)
- Memory usage (used, available, cached, buffers)
- Disk usage (used, available, I/O rates)
- Network I/O (bytes in/out, packets, errors)
- Load average (1m, 5m, 15m)
- System uptime
Container Metrics:
- Container CPU usage
- Container memory usage
- Container disk I/O
- Container network traffic
- Container status and health
Application Metrics:
- Database connections (active, idle, max)
- Database query performance
- API response times
- Authentication requests
- Storage usage and operations
Data Storage
Time-Series Database:
- Metrics stored in PostgreSQL with TimescaleDB
- Automatic data compression and archival
- Configurable retention policies
- Efficient querying for large datasets
Data Retention:
retention_policies: raw_data: 7d # Full resolution for 7 days hourly_aggregates: 30d # Hourly averages for 30 days daily_aggregates: 1y # Daily averages for 1 year
Setting Up Monitoring
Enable Monitoring
Monitoring is enabled by default, but you can customize settings:
Server-Level Configuration
# /etc/supascale/config.yaml monitoring: enabled: true metrics: system: true detailed_process: true # Per-process metrics docker_stats: true # Docker container stats collection: interval: 60s timeout: 30s retries: 3
Instance-Level Monitoring
instance_monitoring: database: enabled: true slow_query_log: true connection_tracking: true api: response_time: true error_rate: true throughput: true storage: usage_tracking: true upload_monitoring: true
Dashboard Overview
Access monitoring through Dashboard → Monitoring:
Infrastructure Overview:
- Server health status
- Resource utilization summary
- Active alerts count
- Performance trends
Server Details:
- Individual server metrics
- Instance distribution
- Resource consumption
- Historical performance
Instance Metrics:
- Per-instance performance
- Service health status
- Database metrics
- API performance
Alert Configuration
Alert Rules
Create custom alert rules based on metrics:
Navigate to Alerts
- Go to Monitoring → Alert Rules
- Click "Create Alert Rule"
Configure Alert Rule
Basic Settings:
name: "High CPU Usage" description: "Alert when CPU usage exceeds 90%" severity: "warning" # info, warning, critical enabled: true
Condition Configuration:
condition: metric: "cpu_usage_percent" aggregation: "avg" # avg, max, min, sum time_window: "5m" # Evaluation window threshold: 90 # Alert threshold operator: ">" # >, <, >=, <=, ==, !=
Alert Frequency:
evaluation: interval: "1m" # Check every minute for: "5m" # Must be true for 5 minutes resolve_timeout: "10m" # Auto-resolve after 10 minutes
Common Alert Rules
Server Health Alerts:
High CPU Usage:
alert_rules: high_cpu: condition: cpu_usage_percent > 90 duration: 5m severity: warning message: "Server {{server_name}} CPU usage is {{value}}%"
Memory Usage:
alert_rules: high_memory: condition: memory_usage_percent > 95 duration: 2m severity: critical message: "Server {{server_name}} memory usage is {{value}}%"
Disk Space:
alert_rules: low_disk_space: condition: disk_usage_percent > 90 duration: 1m severity: warning message: "Server {{server_name}} disk usage is {{value}}%"
Instance Health Alerts:
Database Connection Issues:
alert_rules: db_connection_limit: condition: db_active_connections > 80 duration: 3m severity: critical message: "Instance {{instance_name}} has {{value}} active connections"
API Response Time:
alert_rules: slow_api_response: condition: api_response_time_avg > 2000 duration: 5m severity: warning message: "Instance {{instance_name}} API response time is {{value}}ms"
Service Downtime:
alert_rules: service_down: condition: service_up == 0 duration: 1m severity: critical message: "Service {{service_name}} is down on {{instance_name}}"
Notification Channels
Configure how alerts are delivered:
Email Notifications:
notifications: email: enabled: true recipients: - "admin@company.com" - "ops-team@company.com" templates: subject: "[{{severity}}] {{alert_name}}" body: | Alert: {{alert_name}} Severity: {{severity}} Message: {{message}} Time: {{timestamp}} View in dashboard: {{dashboard_url}}
Slack Integration:
notifications: slack: enabled: true webhook_url: "https://hooks.slack.com/services/..." channel: "#alerts" username: "Supascale" message_format: | {{emoji}} *{{alert_name}}* Severity: {{severity}} {{message}} <{{dashboard_url}}|View Dashboard>
Webhook Notifications:
notifications: webhook: enabled: true url: "https://your-api.com/alerts" method: "POST" headers: Authorization: "Bearer token" Content-Type: "application/json" payload: | { "alert": "{{alert_name}}", "severity": "{{severity}}", "message": "{{message}}", "timestamp": "{{timestamp}}", "server": "{{server_name}}", "instance": "{{instance_name}}" }
Custom Dashboards
Creating Dashboards
Dashboard Builder
- Go to Monitoring → Dashboards
- Click "Create Dashboard"
- Choose from templates or create custom
Dashboard Types
Infrastructure Overview:
- Server health summary
- Resource utilization
- Performance trends
- Alert status
Application Performance:
- Instance metrics
- Database performance
- API response times
- User activity
Operational Dashboards:
- SLA tracking
- Capacity planning
- Incident management
- Business metrics
Dashboard Widgets
Metric Visualizations:
Time Series Charts:
widgets: cpu_chart: type: "line_chart" title: "CPU Usage Over Time" metrics: - "cpu_usage_percent" time_range: "24h" aggregation: "avg"
Gauge Charts:
widgets: memory_gauge: type: "gauge" title: "Current Memory Usage" metric: "memory_usage_percent" thresholds: - value: 80 color: "yellow" - value: 95 color: "red"
Status Indicators:
widgets: server_status: type: "status_grid" title: "Server Health" items: - server: "prod-01" status: "healthy" - server: "prod-02" status: "warning"
Performance Analysis
Baseline Establishment
Supascale automatically establishes performance baselines:
Learning Period: 7 days of normal operation Baseline Metrics:
- Normal CPU usage patterns
- Expected memory consumption
- Typical network traffic
- Database performance patterns
Anomaly Detection:
anomaly_detection: enabled: true sensitivity: "medium" # low, medium, high algorithms: - "seasonal_decomposition" - "isolation_forest" - "z_score" thresholds: cpu: 2.5 # Standard deviations from baseline memory: 2.0 disk_io: 3.0
Trend Analysis
Capacity Planning:
- Growth rate analysis
- Resource consumption trends
- Scalability recommendations
- Infrastructure optimization
Performance Optimization:
- Bottleneck identification
- Resource efficiency analysis
- Cost optimization opportunities
- Architecture recommendations
Troubleshooting Monitoring
Common Issues
Missing Metrics:
# Check agent monitoring status sudo supascale-agent status --monitoring # Verify monitoring configuration sudo supascale-agent config show monitoring # Check for collection errors sudo journalctl -u supascale-agent | grep monitoring
Alert Not Firing:
debugging: # Check alert rule syntax alert_rules: validate: true dry_run: true # Verify metric data metrics: query: "cpu_usage_percent[5m]" debug: true
Performance Impact:
optimization: monitoring: batch_size: 100 # Reduce batch size interval: 120s # Increase collection interval compression: true # Enable data compression sampling_rate: 0.1 # Sample 10% of high-frequency metrics
Log Analysis
Monitoring Logs:
# View monitoring activity sudo journalctl -u supascale-agent -f | grep -i monitoring # Check metric collection performance tail -f /var/log/supascale/metrics.log # Analyze alert evaluation tail -f /var/log/supascale/alerts.log
Integration with External Tools
Grafana Integration
Export metrics to Grafana for advanced visualization:
integrations: grafana: enabled: true prometheus_endpoint: "http://localhost:9090" datasource_url: "http://supascale-api/metrics" dashboards: - "infrastructure_overview" - "application_performance"
Prometheus Compatibility
Expose metrics in Prometheus format:
integrations: prometheus: enabled: true endpoint: "/metrics" port: 9100 labels: environment: "production" region: "us-west-2"
PagerDuty Integration
Configure PagerDuty for critical alerts:
notifications: pagerduty: enabled: true integration_key: "your-integration-key" severity_mapping: critical: "critical" warning: "warning" info: "info"
Next: Learn about Backup Management to protect your data with automated backups.