Overview
UltraBalancer implements intelligent health checking with automatic failover and circuit breaker patterns to ensure high availability and prevent cascading failures across your infrastructure.
Active Health Checks Periodic HTTP health probes to all backend servers
Circuit Breaker Automatic circuit breaking to prevent cascading failures
Auto Failover Seamless traffic redirection to healthy backends
Recovery Detection Automatic backend recovery and traffic restoration
Health Check Mechanism
UltraBalancer performs active health checks by sending HTTP HEAD requests to each backend server at configurable intervals.
Health Check States
Healthy (Green)
Backend responds successfully to health checks. Traffic is routed normally.
Degraded (Yellow)
Some health checks fail but threshold not reached. Backend remains in rotation.
Unhealthy (Red)
Failure threshold exceeded. Backend removed from rotation automatically.
Recovery (Blue)
Previously failed backend passes health checks. Gradually restored to rotation.
Configuration
Basic Configuration
config.yaml
config.toml
CLI
health_check :
enabled : true
interval_ms : 5000 # Check every 5 seconds
timeout_ms : 2000 # 2 second timeout
max_failures : 3 # Mark unhealthy after 3 failures
path : "/health" # Health check endpoint
expected_status : 200 # Expected HTTP status code
Advanced Configuration
health_check :
enabled : true
interval_ms : 3000
timeout_ms : 1500
max_failures : 5
path : "/api/health"
expected_status : 200
# Circuit breaker configuration
circuit_breaker :
enabled : true
failure_threshold : 5 # Open circuit after 5 failures
success_threshold : 2 # Close after 2 successes
timeout_seconds : 60 # Try again after 60 seconds
half_open_requests : 3 # Test with 3 requests in half-open
# Custom health check headers
headers :
User-Agent : "UltraBalancer-HealthCheck/2.0"
X-Health-Check : "true"
# Expected response body (optional)
expected_body : '{"status":"healthy"}'
Health checks are performed asynchronously and don’t block request processing. Failed health checks trigger immediate backend removal.
Circuit Breaker Pattern
UltraBalancer implements the circuit breaker pattern to prevent cascading failures and give failing backends time to recover.
Circuit Breaker States
Closed (Normal Operation)
State : Circuit is closed, requests flow normallyBehavior :
All requests are forwarded to the backend
Failures increment the failure counter
Successes reset the failure counter
Transition : Opens when failure threshold is reached
State : Circuit is open, requests fail immediatelyBehavior :
Requests fail fast without contacting backend
Prevents cascading failures
Timer starts for recovery attempt
Transition : Moves to Half-Open after timeout period# Log output when circuit opens
WARN Backend 192.168.1.10:8080 marked DOWN after 5 failed checks
WARN Circuit breaker opened for backend 192.168.1.10:8080
Half-Open (Testing Recovery)
State : Circuit is testing if backend has recoveredBehavior :
Limited number of test requests allowed through
If tests succeed, circuit closes
If tests fail, circuit reopens
Transition : Closes on success, reopens on failure# Log output during recovery
INFO Circuit breaker half-open for backend 192.168.1.10:8080
INFO Testing with 3 requests...
INFO Backend 192.168.1.10:8080 is UP - circuit closed
Circuit Breaker Configuration
config.yaml
main.rs
main.go
main.py
backends :
- host : "192.168.1.10"
port : 8080
weight : 100
circuit_breaker :
failure_threshold : 5
success_threshold : 2
timeout_seconds : 60
half_open_requests : 3
Failover Strategies
When a backend fails health checks, it’s immediately removed from the rotation:
INFO Health checker started (interval: 5s, max_failures: 3 )
WARN Backend 192.168.1.10:8080 health check failed (1/3)
WARN Backend 192.168.1.10:8080 health check failed (2/3)
WARN Backend 192.168.1.10:8080 health check failed (3/3)
WARN Backend 192.168.1.10:8080 marked DOWN
INFO Removing backend from rotation
INFO Active backends: 2/3
Graceful Recovery
When a backend recovers, it’s gradually restored:
INFO Backend 192.168.1.10:8080 health check succeeded
INFO Backend 192.168.1.10:8080 health check succeeded
INFO Backend 192.168.1.10:8080 marked UP
INFO Adding backend to rotation with reduced weight
INFO Active backends: 3/3
Zero-Downtime Failover
Failure Detection
Health check fails, failure counter increments
Threshold Check
If failures ≥ max_failures, mark backend unhealthy
Immediate Removal
Remove from load balancer rotation (no new connections)
Existing Connections
Allow in-flight requests to complete gracefully
Traffic Redistribution
Redistribute traffic to healthy backends instantly
Health Check Endpoints
Backend Requirements
Your backend servers should expose a health check endpoint:
const express = require ( 'express' );
const app = express ();
// Simple health check
app . get ( '/health' , ( req , res ) => {
res . status ( 200 ). json ({ status: 'healthy' });
});
// Advanced health check with dependencies
app . get ( '/health' , async ( req , res ) => {
try {
// Check database connection
await db . ping ();
// Check Redis connection
await redis . ping ();
res . status ( 200 ). json ({
status: 'healthy' ,
timestamp: new Date (). toISOString (),
services: {
database: 'up' ,
cache: 'up'
}
});
} catch ( error ) {
res . status ( 503 ). json ({
status: 'unhealthy' ,
error: error . message
});
}
});
app . listen ( 8080 );
from flask import Flask, jsonify
import psycopg2
app = Flask( __name__ )
@app.route ( '/health' )
def health ():
return jsonify( status = 'healthy' ), 200
# Advanced health check
@app.route ( '/health' )
def health_detailed ():
try :
# Check database
conn = psycopg2.connect( DATABASE_URL )
conn.close()
return jsonify({
'status' : 'healthy' ,
'timestamp' : datetime.now().isoformat(),
'services' : {
'database' : 'up'
}
}), 200
except Exception as e:
return jsonify({
'status' : 'unhealthy' ,
'error' : str (e)
}), 503
if __name__ == '__main__' :
app.run( host = '0.0.0.0' , port = 8080 )
package main
import (
" encoding/json "
" net/http "
" time "
)
type HealthResponse struct {
Status string `json:"status"`
Timestamp time . Time `json:"timestamp"`
}
func healthHandler ( w http . ResponseWriter , r * http . Request ) {
w . Header (). Set ( "Content-Type" , "application/json" )
w . WriteHeader ( http . StatusOK )
json . NewEncoder ( w ). Encode ( HealthResponse {
Status : "healthy" ,
Timestamp : time . Now (),
})
}
// Advanced health check
func healthDetailedHandler ( w http . ResponseWriter , r * http . Request ) {
// Check database
if err := db . Ping (); err != nil {
w . WriteHeader ( http . StatusServiceUnavailable )
json . NewEncoder ( w ). Encode ( map [ string ] string {
"status" : "unhealthy" ,
"error" : err . Error (),
})
return
}
w . WriteHeader ( http . StatusOK )
json . NewEncoder ( w ). Encode ( map [ string ] interface {}{
"status" : "healthy" ,
"timestamp" : time . Now (),
"services" : map [ string ] string {
"database" : "up" ,
},
})
}
func main () {
http . HandleFunc ( "/health" , healthHandler )
http . ListenAndServe ( ":8080" , nil )
}
use axum :: {
routing :: get,
Json , Router ,
};
use serde_json :: {json, Value };
async fn health () -> Json < Value > {
Json ( json! ({
"status" : "healthy"
}))
}
// Advanced health check
async fn health_detailed () -> ( StatusCode , Json < Value >) {
// Check database connection
match sqlx :: query ( "SELECT 1" ) . fetch_one ( & pool ) . await {
Ok ( _ ) => (
StatusCode :: OK ,
Json ( json! ({
"status" : "healthy" ,
"timestamp" : chrono :: Utc :: now (),
"services" : {
"database" : "up"
}
}))
),
Err ( e ) => (
StatusCode :: SERVICE_UNAVAILABLE ,
Json ( json! ({
"status" : "unhealthy" ,
"error" : e . to_string ()
}))
)
}
}
#[tokio :: main]
async fn main () {
let app = Router :: new ()
. route ( "/health" , get ( health ));
axum :: Server :: bind ( & "0.0.0.0:8080" . parse () . unwrap ())
. serve ( app . into_make_service ())
. await
. unwrap ();
}
Monitoring Health Status
Real-Time Health Monitoring
Query the /metrics endpoint to see backend health status:
curl http://localhost:8080/metrics
{
"backends" : [
{
"address" : "192.168.1.10:8080" ,
"status" : "healthy" ,
"failures" : 0 ,
"circuit_state" : "closed" ,
"last_check" : "2024-01-09T10:30:45Z"
},
{
"address" : "192.168.1.11:8080" ,
"status" : "unhealthy" ,
"failures" : 5 ,
"circuit_state" : "open" ,
"last_check" : "2024-01-09T10:30:43Z"
}
]
}
Prometheus Metrics
Monitor health check metrics in Prometheus:
# Health check success rate
ultrabalancer_health_check_success_total{backend="192.168.1.10:8080"}
# Health check failure rate
ultrabalancer_health_check_failure_total{backend="192.168.1.10:8080"}
# Backend availability
ultrabalancer_backend_available{backend="192.168.1.10:8080"}
# Circuit breaker state (0=closed, 1=open, 2=half-open)
ultrabalancer_circuit_breaker_state{backend="192.168.1.10:8080"}
Grafana Dashboard
Create alerts based on health metrics:
alert : BackendDown
expr : ultrabalancer_backend_available == 0
for : 1m
labels :
severity : critical
annotations :
summary : "Backend {{ $labels.backend }} is down"
description : "Backend has failed health checks for 1 minute"
Best Practices
Optimal Health Check Intervals
Low traffic : 10-30 seconds (reduce backend load)
Medium traffic : 5-10 seconds (balanced)
High traffic : 2-5 seconds (fast failover)
Critical systems : 1-2 seconds (maximum availability)
Avoid These Common Mistakes
Too aggressive intervals - Can overload backends with health checks
Too lenient thresholds - Delays failover and affects user experience
Heavy health checks - Keep endpoints lightweight (avoid DB queries)
Same path as main traffic - Use dedicated /health endpoints
No circuit breaker - Can cause cascading failures
Production Recommendations
health_check :
# Fast detection for user-facing services
enabled : true
interval_ms : 3000
timeout_ms : 1500
max_failures : 3
# Conservative circuit breaker
circuit_breaker :
enabled : true
failure_threshold : 5
success_threshold : 3
timeout_seconds : 30
backends :
- host : "backend1.prod.example.com"
port : 8080
weight : 100
- host : "backend2.prod.example.com"
port : 8080
weight : 100
Troubleshooting
Health checks failing but backend is healthy
Possible causes :
Firewall blocking health check requests
Health endpoint returning wrong status code
Timeout too aggressive for slow backends
Solutions :# Test health endpoint manually
curl -I http://backend:8080/health
# Increase timeout
ultrabalancer -c config.yaml --health-check-timeout 5000
# Check firewall rules
iptables -L | grep 8080
Circuit breaker opening too frequently
Possible causes :
Threshold too low
Backend genuinely overloaded
Network issues
Solutions :circuit_breaker :
failure_threshold : 10 # Increase threshold
timeout_seconds : 120 # Longer recovery time
Backend not recovering after coming back online
Possible causes :
Circuit breaker timeout too long
Health check interval too long
Backend still returning errors
Solutions :# Check backend logs
curl http://backend:8080/health -v
# Verify backend is truly healthy
for i in { 1..10} ; do curl -s -o /dev/null -w "%{http_code}\n" http://backend:8080/health ; done