Skip to main content

Overview

UltraBalancer implements intelligent health checking with automatic failover and circuit breaker patterns to ensure high availability and prevent cascading failures across your infrastructure.

Active Health Checks

Periodic HTTP health probes to all backend servers

Circuit Breaker

Automatic circuit breaking to prevent cascading failures

Auto Failover

Seamless traffic redirection to healthy backends

Recovery Detection

Automatic backend recovery and traffic restoration

Health Check Mechanism

UltraBalancer performs active health checks by sending HTTP HEAD requests to each backend server at configurable intervals.

Health Check States

1

Healthy (Green)

Backend responds successfully to health checks. Traffic is routed normally.
2

Degraded (Yellow)

Some health checks fail but threshold not reached. Backend remains in rotation.
3

Unhealthy (Red)

Failure threshold exceeded. Backend removed from rotation automatically.
4

Recovery (Blue)

Previously failed backend passes health checks. Gradually restored to rotation.

Configuration

Basic Configuration

health_check:
  enabled: true
  interval_ms: 5000        # Check every 5 seconds
  timeout_ms: 2000         # 2 second timeout
  max_failures: 3          # Mark unhealthy after 3 failures
  path: "/health"          # Health check endpoint
  expected_status: 200     # Expected HTTP status code

Advanced Configuration

config.yaml
health_check:
  enabled: true
  interval_ms: 3000
  timeout_ms: 1500
  max_failures: 5
  path: "/api/health"
  expected_status: 200

  # Circuit breaker configuration
  circuit_breaker:
    enabled: true
    failure_threshold: 5      # Open circuit after 5 failures
    success_threshold: 2      # Close after 2 successes
    timeout_seconds: 60       # Try again after 60 seconds
    half_open_requests: 3     # Test with 3 requests in half-open

  # Custom health check headers
  headers:
    User-Agent: "UltraBalancer-HealthCheck/2.0"
    X-Health-Check: "true"

  # Expected response body (optional)
  expected_body: '{"status":"healthy"}'
Health checks are performed asynchronously and don’t block request processing. Failed health checks trigger immediate backend removal.

Circuit Breaker Pattern

UltraBalancer implements the circuit breaker pattern to prevent cascading failures and give failing backends time to recover.

Circuit Breaker States

State: Circuit is closed, requests flow normallyBehavior:
  • All requests are forwarded to the backend
  • Failures increment the failure counter
  • Successes reset the failure counter
Transition: Opens when failure threshold is reached
State: Circuit is open, requests fail immediatelyBehavior:
  • Requests fail fast without contacting backend
  • Prevents cascading failures
  • Timer starts for recovery attempt
Transition: Moves to Half-Open after timeout period
# Log output when circuit opens
WARN Backend 192.168.1.10:8080 marked DOWN after 5 failed checks
WARN Circuit breaker opened for backend 192.168.1.10:8080
State: Circuit is testing if backend has recoveredBehavior:
  • Limited number of test requests allowed through
  • If tests succeed, circuit closes
  • If tests fail, circuit reopens
Transition: Closes on success, reopens on failure
# Log output during recovery
INFO Circuit breaker half-open for backend 192.168.1.10:8080
INFO Testing with 3 requests...
INFO Backend 192.168.1.10:8080 is UP - circuit closed

Circuit Breaker Configuration

backends:
  - host: "192.168.1.10"
    port: 8080
    weight: 100
    circuit_breaker:
      failure_threshold: 5
      success_threshold: 2
      timeout_seconds: 60
      half_open_requests: 3

Failover Strategies

Immediate Failover

When a backend fails health checks, it’s immediately removed from the rotation:
INFO Health checker started (interval: 5s, max_failures: 3)
WARN Backend 192.168.1.10:8080 health check failed (1/3)
WARN Backend 192.168.1.10:8080 health check failed (2/3)
WARN Backend 192.168.1.10:8080 health check failed (3/3)
WARN Backend 192.168.1.10:8080 marked DOWN
INFO Removing backend from rotation
INFO Active backends: 2/3

Graceful Recovery

When a backend recovers, it’s gradually restored:
INFO Backend 192.168.1.10:8080 health check succeeded
INFO Backend 192.168.1.10:8080 health check succeeded
INFO Backend 192.168.1.10:8080 marked UP
INFO Adding backend to rotation with reduced weight
INFO Active backends: 3/3

Zero-Downtime Failover

1

Failure Detection

Health check fails, failure counter increments
2

Threshold Check

If failures ≥ max_failures, mark backend unhealthy
3

Immediate Removal

Remove from load balancer rotation (no new connections)
4

Existing Connections

Allow in-flight requests to complete gracefully
5

Traffic Redistribution

Redistribute traffic to healthy backends instantly

Health Check Endpoints

Backend Requirements

Your backend servers should expose a health check endpoint:
express.js
const express = require('express');
const app = express();

// Simple health check
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'healthy' });
});

// Advanced health check with dependencies
app.get('/health', async (req, res) => {
  try {
    // Check database connection
    await db.ping();

    // Check Redis connection
    await redis.ping();

    res.status(200).json({
      status: 'healthy',
      timestamp: new Date().toISOString(),
      services: {
        database: 'up',
        cache: 'up'
      }
    });
  } catch (error) {
    res.status(503).json({
      status: 'unhealthy',
      error: error.message
    });
  }
});

app.listen(8080);

Monitoring Health Status

Real-Time Health Monitoring

Query the /metrics endpoint to see backend health status:
curl http://localhost:8080/metrics
{
  "backends": [
    {
      "address": "192.168.1.10:8080",
      "status": "healthy",
      "failures": 0,
      "circuit_state": "closed",
      "last_check": "2024-01-09T10:30:45Z"
    },
    {
      "address": "192.168.1.11:8080",
      "status": "unhealthy",
      "failures": 5,
      "circuit_state": "open",
      "last_check": "2024-01-09T10:30:43Z"
    }
  ]
}

Prometheus Metrics

Monitor health check metrics in Prometheus:
# Health check success rate
ultrabalancer_health_check_success_total{backend="192.168.1.10:8080"}

# Health check failure rate
ultrabalancer_health_check_failure_total{backend="192.168.1.10:8080"}

# Backend availability
ultrabalancer_backend_available{backend="192.168.1.10:8080"}

# Circuit breaker state (0=closed, 1=open, 2=half-open)
ultrabalancer_circuit_breaker_state{backend="192.168.1.10:8080"}

Grafana Dashboard

Create alerts based on health metrics:
grafana-alert.yaml
alert: BackendDown
expr: ultrabalancer_backend_available == 0
for: 1m
labels:
  severity: critical
annotations:
  summary: "Backend {{ $labels.backend }} is down"
  description: "Backend has failed health checks for 1 minute"

Best Practices

Optimal Health Check Intervals
  • Low traffic: 10-30 seconds (reduce backend load)
  • Medium traffic: 5-10 seconds (balanced)
  • High traffic: 2-5 seconds (fast failover)
  • Critical systems: 1-2 seconds (maximum availability)
Avoid These Common Mistakes
  1. Too aggressive intervals - Can overload backends with health checks
  2. Too lenient thresholds - Delays failover and affects user experience
  3. Heavy health checks - Keep endpoints lightweight (avoid DB queries)
  4. Same path as main traffic - Use dedicated /health endpoints
  5. No circuit breaker - Can cause cascading failures

Production Recommendations

config.yaml
health_check:
  # Fast detection for user-facing services
  enabled: true
  interval_ms: 3000
  timeout_ms: 1500
  max_failures: 3

  # Conservative circuit breaker
  circuit_breaker:
    enabled: true
    failure_threshold: 5
    success_threshold: 3
    timeout_seconds: 30

backends:
  - host: "backend1.prod.example.com"
    port: 8080
    weight: 100

  - host: "backend2.prod.example.com"
    port: 8080
    weight: 100

Troubleshooting

Possible causes:
  • Firewall blocking health check requests
  • Health endpoint returning wrong status code
  • Timeout too aggressive for slow backends
Solutions:
# Test health endpoint manually
curl -I http://backend:8080/health

# Increase timeout
ultrabalancer -c config.yaml --health-check-timeout 5000

# Check firewall rules
iptables -L | grep 8080
Possible causes:
  • Threshold too low
  • Backend genuinely overloaded
  • Network issues
Solutions:
circuit_breaker:
  failure_threshold: 10  # Increase threshold
  timeout_seconds: 120   # Longer recovery time
Possible causes:
  • Circuit breaker timeout too long
  • Health check interval too long
  • Backend still returning errors
Solutions:
# Check backend logs
curl http://backend:8080/health -v

# Verify backend is truly healthy
for i in {1..10}; do curl -s -o /dev/null -w "%{http_code}\n" http://backend:8080/health; done