Week 12 — Backup, Recovery & High Availability¶

Course Objectives: CO8 | Focus: Backup, Recovery, Replication, HA | Difficulty: ⭐⭐⭐⭐☆

Learning Objectives¶

By the end of this week, you should be able to:

[ ] Distinguish logical backups from physical backups and select the appropriate type for a given scenario
[ ] Execute production-grade mysqldump commands with all critical flags for consistency and completeness
[ ] Describe the Percona XtraBackup full and incremental backup workflow including the prepare phase
[ ] Enable and interpret the MySQL binary log for point-in-time recovery
[ ] Walk through three real recovery scenarios: dropped table, corrupt data, and server crash
[ ] Define RTO and RPO and design a backup strategy that meets stated targets
[ ] Configure basic MySQL replication (source/replica) and verify its health
[ ] Explain GTID-based replication and its advantages over file-position replication
[ ] Describe the InnoDB Cluster architecture and how automatic failover works
[ ] Compare MySQL Router, ProxySQL, and Amazon RDS Multi-AZ for read/write splitting and failover

1. Backup Fundamentals — Types and Trade-offs¶

1.1 Logical vs Physical Backups¶

A logical backup exports data as SQL statements or delimited text — it is portable, human-readable, and schema-version-agnostic. A physical backup copies the raw data files from disk — it is faster for large databases but tied to the MySQL version and binary format.

Dimension	Logical (mysqldump)	Physical (XtraBackup)
Output format	SQL text / CSV	Binary InnoDB files
Speed (backup)	Slow (row-by-row)	Fast (block-level)
Speed (restore)	Slow (re-inserts all rows)	Fast (copy files back)
Portability	Any MySQL version	Same major version
Human readable	✅ Yes	❌ No
Selective restore	✅ Easy (grep/pipe)	❌ Requires full copy
Online (hot) backup	✅ (with `--single-transaction`)	✅ (XtraBackup native)
Point-in-time restore	Needs binlog on top	Needs binlog on top
Typical use case	Smaller DBs, migrations, dev refresh	Multi-GB/TB production

1.2 Hot, Warm, and Cold Backups¶

Type	Database state	Locking	Risk
Hot	Fully online, serving traffic	None (InnoDB) or minimal	Lowest downtime impact
Warm	Online but read-only	Tables locked for reads	Brief read interruption
Cold	Server stopped	N/A — files copied directly	Requires downtime

Cold Backups Are Almost Never Acceptable in Production

Even a 5-minute maintenance window at 2 AM costs money and violates SLAs for global systems. Use hot backup tooling (--single-transaction or XtraBackup) instead.

2. mysqldump Deep Dive¶

2.1 Essential Flags Explained¶

Production mysqldump command

mysqldump \
  --defaults-file=/etc/mysql/backup.cnf \
  --host=127.0.0.1 \
  --port=3306 \
  \
  --single-transaction \
  # InnoDB: START TRANSACTION WITH CONSISTENT SNAPSHOT.
  # No table locks; consistent point-in-time view.
  \
  --master-data=2 \
  # Writes a CHANGE MASTER TO comment at top of dump.
  # Value 2 = comment (safe); value 1 = active statement.
  # Records exact binlog file + position for PITR.
  \
  --flush-logs \
  # Flushes and rotates binary log immediately before dump.
  # Means recovery only needs logs AFTER this point.
  \
  --routines \
  # Export stored procedures and functions.
  \
  --triggers \
  # Export triggers (included by default, but explicit is safer).
  \
  --events \
  # Export scheduled events.
  \
  --add-drop-table \
  # Prepend DROP TABLE IF EXISTS before each CREATE TABLE.
  # Makes restore idempotent.
  \
  --hex-blob \
  # Export BLOB/BINARY columns as hexadecimal strings.
  # Prevents character encoding corruption.
  \
  --compress \
  # Compress client-server traffic (useful for remote dumps).
  \
  --verbose \
  university \
  | gzip -9 > /var/backups/mysql/university_$(date +%Y%m%d_%H%M%S).sql.gz

2.2 Dumping Specific Objects¶

Selective mysqldump

# Single table
mysqldump --defaults-file=/etc/mysql/backup.cnf \
          --single-transaction university students \
          > students_backup.sql

# Multiple databases
mysqldump --defaults-file=/etc/mysql/backup.cnf \
          --databases university hr_system \
          > multi_db_backup.sql

# Schema only (no data)
mysqldump --defaults-file=/etc/mysql/backup.cnf \
          --no-data university > schema_only.sql

# Data only (no schema)
mysqldump --defaults-file=/etc/mysql/backup.cnf \
          --no-create-info university > data_only.sql

2.3 mysqlpump — Parallel Logical Backup¶

mysqlpump (MySQL 5.7+) parallelises the dump across tables and databases:

mysqlpump parallel backup

mysqlpump \
  --default-parallelism=4 \        # 4 parallel dump threads
  --add-drop-database \
  --exclude-databases=sys,mysql \
  --compress-output=ZLIB \
  university > university_pump.sql.zlib

mysqldump vs mysqlpump

mysqlpump can be faster for large multi-table databases but does not support --single-transaction across all tables simultaneously the same way mysqldump does. For strict consistency, mysqldump --single-transaction remains the gold standard for logical backups.

3. Percona XtraBackup¶

XtraBackup is the open-source standard for hot physical backups of InnoDB databases. It works by copying InnoDB data files while MySQL is running and uses the redo log to make the copy consistent.

3.1 Full Backup¶

XtraBackup full backup

# Install
apt-get install percona-xtrabackup-80   # or yum

# Take full backup
xtrabackup \
  --backup \
  --target-dir=/var/backups/xtrabackup/full_$(date +%Y%m%d) \
  --user=xtrabackup_user \
  --password="$XB_PASS" \
  --host=127.0.0.1 \
  --parallel=4 \
  --compress \
  --compress-threads=4

# Output includes: LSN (Log Sequence Number) in xtrabackup_checkpoints
cat /var/backups/xtrabackup/full_20251001/xtrabackup_checkpoints
# backup_type = full-backuped
# from_lsn = 0
# to_lsn = 1234567
# last_lsn = 1234567

3.2 Incremental Backup¶

XtraBackup incremental backup

# After a full backup, take an incremental based on the full's LSN
xtrabackup \
  --backup \
  --target-dir=/var/backups/xtrabackup/inc1_$(date +%Y%m%d) \
  --incremental-basedir=/var/backups/xtrabackup/full_20251001 \
  --user=xtrabackup_user \
  --password="$XB_PASS" \
  --compress

# Second incremental (based on first incremental)
xtrabackup \
  --backup \
  --target-dir=/var/backups/xtrabackup/inc2_$(date +%Y%m%d) \
  --incremental-basedir=/var/backups/xtrabackup/inc1_20251002 \
  --user=xtrabackup_user \
  --password="$XB_PASS"

3.3 Prepare Phase and Restore¶

The prepare phase applies the redo log to the data files, making them consistent. This step happens before you copy files to the target server:

XtraBackup prepare and restore

# Step 1: Decompress
xtrabackup --decompress --target-dir=/var/backups/xtrabackup/full_20251001

# Step 2: Prepare full backup (apply redo log, but keep log open for incrementals)
xtrabackup --prepare --apply-log-only \
  --target-dir=/var/backups/xtrabackup/full_20251001

# Step 3: Apply each incremental in order
xtrabackup --prepare --apply-log-only \
  --target-dir=/var/backups/xtrabackup/full_20251001 \
  --incremental-dir=/var/backups/xtrabackup/inc1_20251002

# Step 4: Final prepare (no --apply-log-only on last incremental)
xtrabackup --prepare \
  --target-dir=/var/backups/xtrabackup/full_20251001 \
  --incremental-dir=/var/backups/xtrabackup/inc2_20251003

# Step 5: Stop MySQL, remove old data directory, copy files
systemctl stop mysql
rm -rf /var/lib/mysql/*
xtrabackup --copy-back \
  --target-dir=/var/backups/xtrabackup/full_20251001
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql

Never Skip the Prepare Phase

Restoring unprepared XtraBackup files will corrupt your database. The data files have partially-applied transactions in the redo log that must be rolled forward (and uncommitted transactions rolled back) before the files are usable.

4. Binary Log and Point-in-Time Recovery¶

4.1 Enabling the Binary Log¶

my.cnf — binary log configuration

[mysqld]
# Enable binary logging
log_bin                  = /var/log/mysql/mysql-bin
binlog_format            = ROW          # ROW | STATEMENT | MIXED
server_id                = 1            # Required; unique across cluster
expire_logs_days         = 7            # Auto-purge after 7 days (or:)
binlog_expire_logs_seconds = 604800     # MySQL 8.0 preferred
max_binlog_size          = 512M         # Rotate when file reaches 512 MB
sync_binlog              = 1            # Flush to disk every write (safest)
binlog_row_image         = FULL         # Log all columns (vs MINIMAL/NOBLOB)

`binlog_format`	What is logged	Pros	Cons
`STATEMENT`	SQL statement text	Small log size	Non-deterministic functions (NOW(), RAND()) can differ on replica
`ROW`	Before/after image of every row	Exact; safe for PITR	Larger log files
`MIXED`	STATEMENT normally; ROW for unsafe statements	Balanced	Complex behaviour

4.2 mysqlbinlog Tool¶

Reading and filtering binary logs

# Show all events in a binlog file
mysqlbinlog /var/log/mysql/mysql-bin.000042

# Filter by time window
mysqlbinlog \
  --start-datetime="2025-10-05 09:00:00" \
  --stop-datetime="2025-10-05 09:30:00" \
  /var/log/mysql/mysql-bin.000042

# Filter by position (from mysqldump --master-data=2 output)
mysqlbinlog \
  --start-position=4 \
  --stop-position=1048576 \
  /var/log/mysql/mysql-bin.000042

# Apply directly to MySQL
mysqlbinlog /var/log/mysql/mysql-bin.000042 \
  | mysql --defaults-file=/etc/mysql/root.cnf

4.3 Point-in-Time Recovery Workflow¶

Dropped Table RecoveryData Corruption RecoveryServer Crash Recovery

Scenario: DROP TABLE grades; was accidentally run at 14:37.

# 1. Restore last nightly backup (e.g., from 02:00)
gunzip < /var/backups/mysql/university_20251005_020000.sql.gz \
  | mysql --defaults-file=/etc/mysql/root.cnf university

# 2. Find the binlog position of the DROP TABLE event
mysqlbinlog \
  --start-datetime="2025-10-05 02:00:00" \
  /var/log/mysql/mysql-bin.000043 \
  | grep -n "DROP TABLE"
# Output: at position 892144 — the DROP TABLE statement

# 3. Replay binlog UP TO the moment just before the DROP
mysqlbinlog \
  --start-datetime="2025-10-05 02:00:01" \
  --stop-position=892140 \
  /var/log/mysql/mysql-bin.000043 \
  | mysql --defaults-file=/etc/mysql/root.cnf university

# 4. Verify recovery
mysql --defaults-file=/etc/mysql/root.cnf university \
  -e "SELECT COUNT(*) FROM grades;"

Scenario: A bad UPDATE statement corrupted 200 rows in enrollments.

# 1. Restore latest clean backup
gunzip < /var/backups/mysql/university_20251005_020000.sql.gz \
  | mysql university

# 2. Find bad UPDATE in binlog (look for suspicious patterns)
mysqlbinlog \
  --start-datetime="2025-10-05 02:00:00" \
  --database=university \
  /var/log/mysql/mysql-bin.000043 \
  | grep -A5 "UPDATE enrollments"
# Found at position 445922

# 3. Replay events BEFORE the bad update
mysqlbinlog \
  --start-datetime="2025-10-05 02:00:01" \
  --stop-position=445918 \
  /var/log/mysql/mysql-bin.000043 \
  | mysql university

# 4. Skip the bad update and continue from after it
mysqlbinlog \
  --start-position=446500 \
  /var/log/mysql/mysql-bin.000043 \
  | mysql university

Scenario: Server lost power; MySQL won't start cleanly.

# InnoDB crash recovery is usually automatic on restart.
# If not, force it:
# In /etc/mysql/mysql.conf.d/mysqld.cnf:
# innodb_force_recovery = 1  (try 1 first, up to 6 if needed)

systemctl start mysql
# Monitor error log:
tail -f /var/log/mysql/error.log

# Once stable, dump data and restore cleanly:
mysqldump --all-databases > emergency_dump.sql
# Set innodb_force_recovery back to 0
# Restart MySQL normally
mysql < emergency_dump.sql

innodb_force_recovery Warning

Never run in production with innodb_force_recovery > 0 longer than necessary to extract data. Higher values (4–6) skip undo logs and background threads — data may be inconsistent. Dump and restore ASAP.

5. RTO, RPO, and Backup Strategy Design¶

5.1 Definitions¶

Metric	Full Name	Meaning	Example
RPO	Recovery Point Objective	Maximum acceptable data loss measured in time	RPO = 1 hour means you can lose at most 1 hour of transactions
RTO	Recovery Time Objective	Maximum acceptable downtime for recovery	RTO = 30 min means system must be back up within 30 minutes

5.2 Mapping Requirements to Technology¶

RPO Target	RTO Target	Strategy
≤ 24 hours	Hours acceptable	Daily mysqldump + binlog
≤ 1 hour	< 2 hours	Hourly XtraBackup incrementals + binlog
≤ 5 minutes	< 30 minutes	Synchronous replication + rapid failover
≤ 0 (zero loss)	< 60 seconds	MySQL Group Replication (synchronous mode)

Calculating Effective RPO

Effective RPO = time of last backup + binlog replay time. If your last backup was at 02:00 and you have continuous binlogs, your RPO at any moment is essentially the time since the last binlog flush (sync_binlog=1 means every transaction).

6. MySQL Replication¶

6.1 Source/Replica Setup¶

Source ConfigurationReplica Configuration

source my.cnf

[mysqld]
server_id   = 1
log_bin     = /var/log/mysql/mysql-bin
binlog_format = ROW

Create replication user on source

CREATE USER 'repl_user'@'192.168.10.%'
  IDENTIFIED WITH caching_sha2_password BY 'repl_strong_pass';

GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'192.168.10.%';
FLUSH PRIVILEGES;

-- Get current binlog position
SHOW MASTER STATUS\G
-- File: mysql-bin.000001, Position: 157

replica my.cnf

[mysqld]
server_id        = 2
relay_log        = /var/log/mysql/relay-bin
read_only        = ON          # Prevent accidental writes to replica
log_slave_updates = ON         # Replica also writes binlog (for chaining)

Point replica at source

CHANGE MASTER TO
  MASTER_HOST      = '192.168.10.10',
  MASTER_PORT      = 3306,
  MASTER_USER      = 'repl_user',
  MASTER_PASSWORD  = 'repl_strong_pass',
  MASTER_LOG_FILE  = 'mysql-bin.000001',
  MASTER_LOG_POS   = 157;

START SLAVE;
SHOW SLAVE STATUS\G

6.2 GTID-Based Replication¶

Global Transaction Identifiers (GTIDs) replace file-position tracking with unique transaction IDs. Each transaction gets a GTID in the format source_uuid:transaction_id.

GTID configuration (both source and replica)

[mysqld]
gtid_mode            = ON
enforce_gtid_consistency = ON
log_slave_updates    = ON

GTID replica setup

CHANGE MASTER TO
  MASTER_HOST     = '192.168.10.10',
  MASTER_USER     = 'repl_user',
  MASTER_PASSWORD = 'repl_strong_pass',
  MASTER_AUTO_POSITION = 1;    -- GTID: no file/position needed!

START SLAVE;

Advantages of GTID replication:

Replicas can auto-reconnect to any source in the topology without knowing binlog coordinates
Easy to promote a replica to source (no coordinate calculation)
Enables multi-source replication with clear provenance per transaction
Required for InnoDB Cluster

6.3 Monitoring Replication Health¶

Key replication monitoring queries

-- Primary status check
SHOW SLAVE STATUS\G

-- Critical fields:
-- Slave_IO_Running: Yes          <- IO thread connected to source
-- Slave_SQL_Running: Yes         <- SQL thread applying events
-- Seconds_Behind_Master: 0       <- Lag in seconds (NULL = disconnected)
-- Last_IO_Error:                 <- Most recent IO error message
-- Last_SQL_Error:                <- Most recent SQL error message

-- Detailed monitoring from Performance Schema (MySQL 8.0)
SELECT
    CHANNEL_NAME,
    SERVICE_STATE,
    LAST_ERROR_MESSAGE,
    LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;

SELECT
    CHANNEL_NAME,
    SERVICE_STATE,
    LAST_APPLIED_TRANSACTION,
    APPLYING_TRANSACTION,
    LAST_ERROR_MESSAGE
FROM performance_schema.replication_applier_status_by_worker;

Replication Lag Cause	Symptom	Fix
Large transactions on source	Lag spikes then recovers	Batch large updates; increase `slave_parallel_workers`
Slow I/O on replica	Persistent lag	Upgrade replica disk (SSD); tune `innodb_flush_method`
Network congestion	IO thread falls behind	Compress replication: `MASTER_COMPRESSION_ALGORITHMS`
CPU-intensive SQL threads	SQL thread queue builds up	Enable parallel replication (`slave_parallel_type = LOGICAL_CLOCK`)

7. InnoDB Cluster and Group Replication¶

7.1 Group Replication Modes¶

MySQL Group Replication provides multi-master synchronous replication with automatic failure detection and primary election.

Single-Primary ModeMulti-Primary Mode

One writable primary at a time; all others are read-only secondaries
Automatic primary election on failure (using group membership protocol)
Simpler conflict resolution (no write-write conflicts possible)
Recommended for most production workloads

-- Check which node is primary
SELECT MEMBER_HOST, MEMBER_ROLE, MEMBER_STATE
FROM performance_schema.replication_group_members;

All nodes accept writes simultaneously
Requires application-level conflict avoidance (same row cannot be written to two nodes simultaneously)
Higher risk; suitable for geographically distributed writes
Certification-based conflict detection: transactions are rejected post-commit if conflicting

7.2 InnoDB Cluster Setup (MySQL Shell)¶

MySQL Shell — InnoDB Cluster setup

# Run in MySQL Shell (mysqlsh)
shell.connect('root@node1:3306')
dba.checkInstanceConfiguration()

# If checks pass:
cluster = dba.createCluster('UniversityCluster')
cluster.addInstance('root@node2:3306')
cluster.addInstance('root@node3:3306')
cluster.status()

7.3 MySQL Router and ProxySQL¶

Feature	MySQL Router	ProxySQL
Read/write splitting	✅ Basic	✅ Advanced (rule-based)
Connection multiplexing	❌	✅
Query caching	❌	✅
Topology discovery	✅ Auto (InnoDB Cluster)	Manual or heartbeat
Config complexity	Low	High
Performance overhead	Very low	Low
Best with	InnoDB Cluster	Any topology

8. Cloud High Availability¶

8.1 Amazon RDS Multi-AZ¶

Amazon RDS Multi-AZ maintains a synchronous standby replica in a different Availability Zone. Failover is automatic (typically 60–120 seconds) and DNS-based — your connection string does not change.

AWS CLI — create RDS Multi-AZ instance

aws rds create-db-instance \
  --db-instance-identifier university-prod \
  --db-instance-class db.r6g.xlarge \
  --engine mysql \
  --engine-version 8.0.35 \
  --master-username admin \
  --master-user-password "$RDS_PASS" \
  --db-name university \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --backup-retention-period 7 \
  --preferred-backup-window "03:00-04:00" \
  --deletion-protection

8.2 Cloud HA Comparison¶

Service	Engine	HA mechanism	Failover time	Read replicas
Amazon RDS Multi-AZ	MySQL 8.0	Synchronous standby	60–120s	Up to 5
Amazon Aurora MySQL	Aurora	Shared storage; 6-way replica	< 30s	Up to 15
Google Cloud SQL	MySQL 8.0	Failover replica	60–90s	Up to 10
Azure Database for MySQL	MySQL 8.0	Zone-redundant HA	60–120s	Up to 5

Cloud Managed = Automatic Backups

All major cloud managed database services include automated daily backups with point-in-time restore (typically 1–35 day retention). This shifts the operational burden of backup management from the DBA to the cloud provider, but you should still verify backup coverage in your SLA.

Key Vocabulary¶

Term	Definition
Logical backup	Export of data as SQL statements or CSV; portable and human-readable
Physical backup	Copy of raw InnoDB data files; fast but version-dependent
Hot backup	Backup taken while the database is running and serving traffic
`--single-transaction`	mysqldump flag creating a consistent InnoDB snapshot without table locks
`--master-data=2`	Embeds CHANGE MASTER TO as a comment with binlog file and position
XtraBackup	Percona's open-source hot physical backup tool for InnoDB
Prepare phase	XtraBackup step that applies redo log to copied files, making them consistent
Binary log (binlog)	Ordered log of all changes to MySQL data; used for replication and PITR
binlog_format ROW	Logs before/after row images; safest for PITR and replication
GTID	Global Transaction Identifier — unique UUID:sequence for each transaction
Seconds_Behind_Master	Replica lag metric — seconds behind the source's latest event
RPO	Recovery Point Objective — maximum acceptable data loss expressed in time
RTO	Recovery Time Objective — maximum acceptable time to restore service
PITR	Point-In-Time Recovery — restoring a database to a specific moment using binlog
Group Replication	MySQL plugin providing synchronous multi-master replication with auto-failover
InnoDB Cluster	MySQL HA solution combining Group Replication + MySQL Router + MySQL Shell
ProxySQL	Open-source MySQL proxy supporting advanced read/write splitting and pooling
Multi-AZ	AWS RDS feature placing a synchronous standby in a different availability zone
Relay log	File on a replica that buffers events received from the source before applying
`innodb_force_recovery`	Emergency setting (1–6) to start MySQL after severe InnoDB corruption

Self-Assessment¶

Self-Assessment — Week 12

Question 1. Your university database is 800 GB. A mysqldump takes 4 hours to complete. Management wants RPO ≤ 15 minutes and RTO ≤ 30 minutes. Explain why mysqldump alone cannot meet these targets and design a backup architecture (tools, schedule, components) that does.

Question 2. You run SHOW SLAVE STATUS\G and observe: Slave_IO_Running: Yes, Slave_SQL_Running: No, Last_SQL_Error: Error 'Duplicate entry 1234 for key PRIMARY'. Explain what happened, why, and describe two different ways to resolve it — one that skips the error and one that is safer long-term.

Question 3. A developer accidentally ran UPDATE students SET gpa = 0.0; without a WHERE clause at 11:42 AM today. Your last backup was at 2:00 AM. You have continuous binary logging enabled. Outline the exact recovery steps in order, including the specific mysqlbinlog flags you would use and how you would identify the position of the erroneous statement.

Question 4. Compare GTID-based replication with traditional binary log file/position replication across four dimensions: failover complexity, multi-source replication support, compatibility with backup tools, and operational risk during topology changes.

Question 5. Your team is debating between deploying MySQL InnoDB Cluster (with MySQL Router) versus using Amazon RDS Multi-AZ. List three factors that would favor each option, and describe the scenario where you would choose InnoDB Cluster over RDS.