High Availability Redis with Automatic Failover

Redis is a powerful tool to store key-value data in various formats. Here’s a simple way to create failover replication. Sometimes people call this a Redis cluster, but in reality it’s just a few servers (preferably 3 for Sentinel quorum) with one master and multiple slaves in different configurations (slave of slave, slave by priority, local slave, etc.).

Server Setup

We need 3 servers with Redis + Sentinel. Tested on Redis 2.x and 3.x:

1 - 10.1.1.1
2 - 10.1.1.2
3 - 10.1.1.3

Redis Configuration

# r1_redis.conf
#bind 127.0.0.1
protected-mode no
port 6379
...
# r2_redis.conf / r3_redis.conf
#bind 127.0.0.1
protected-mode no
port 6379
...
slaveof 10.1.1.1 6379

Sentinel Configuration

Commented lines are optional bootstrap settings and examples:

# r1_sentinel.conf / r2_sentinel.conf / r3_sentinel.conf
daemonize yes
pidfile "/var/run/redis/redis-sentinel.pid"
logfile "/var/log/redis/redis-sentinel.log"
port 16379
dir "/var/lib/redis"
protected-mode no
#sentinel myid ...
sentinel monitor redis-ha 10.1.1.1 6379 2
sentinel down-after-milliseconds redis-ha 3000
sentinel failover-timeout redis-ha 6000
# optional bootstrap lines commented

HAProxy Configuration

listen redis-rw
        bind 127.0.0.1:6379
        mode tcp
        balance leastconn
        option tcplog
        option tcp-check
        tcp-check connect
        tcp-check send PING\r\n
        tcp-check expect string +PONG
        tcp-check send info\ replication\r\n
        tcp-check expect string role:master
        tcp-check send QUIT\r\n
        tcp-check expect string +OK
        server redis-1 10.1.1.1:6379 check inter 2s backup
        server redis-2 10.1.1.2:6379 check inter 2s backup
        server redis-3 10.1.1.3:6379 check inter 2s backup

listen redis-ro
        bind 127.0.0.2:6379
        mode tcp
        balance leastconn
        option tcplog
        option tcp-check
        tcp-check connect
        tcp-check send PING\r\n
        tcp-check expect string +PONG
        tcp-check send info\ replication\r\n
        tcp-check expect string master_link_status:up
        tcp-check send QUIT\r\n
        tcp-check expect string +OK
        server redis-1 10.1.1.1:6379 check inter 2s
        server redis-2 10.1.1.2:6379 check inter 2s
        server redis-3 10.1.1.3:6379 check inter 2s
        server redis-rw 127.0.0.1:6379 backup

listen redis-local-ro
        bind 127.0.0.3:6379
        mode tcp
        balance leastconn
        option tcplog
        option tcp-check
        tcp-check connect
        tcp-check send PING\r\n
        tcp-check expect string +PONG
        tcp-check send info\ replication\r\n
        tcp-check expect string master_link_status:up
        tcp-check send QUIT\r\n
        tcp-check expect string +OK
        server redis-local 127.0.0.4:6379 check inter 2s
        server redis-ro 127.0.0.2:6379 backup

Node Mapping

127.0.0.1:6379 - Redis Cluster RW
127.0.0.2:6379 - Redis Cluster RO
127.0.0.3:6379 - Local Redis RO with fallback to Redis Cluster RO
127.0.0.4:6379 - Local Redis, slave of Redis Cluster RO

Usage

This configuration tolerates multiple node failures. The last surviving node automatically becomes master. HAProxy TCP checks detect node status within seconds and adjust traffic automatically.

Applications should connect to 127.0.0.1:6379 for writes and 127.0.0.3:6379 for reads.

Known Issues

1. Replication delay if reading immediately from a local slave.

2. Replication may crash under high RO/RW traffic (>200 Mbps on master, tested on AWS).

3. Problems occur with more than 5–6 slaves; replication may fail permanently with 10+ slaves.

Human Logic, AI Syntax... Note on Content: I'm a Systems Engineer, not a native English writer. To ensure my technical ideas are clear and accessible, I use AI tools to polish the grammar and style. The workflow is simple: I provide the logic, the code, and the real-world experience. The AI handles the "English-to-Human" translation layer. If you find a bug, that's on me. If you find a perfectly placed comma, that's probably the AI.

Comments

Popular posts from this blog

FreeRadius with Google Workspace LDAP

Fixing pssh (parallel-ssh) Problems on Debian 10 with Python 3.7