Week 14 — NoSQL, NewSQL & Cloud Database Systems¶

Course Objectives: CO8, CO9 | Focus: NoSQL, NewSQL, Cloud Databases | Difficulty: ⭐⭐⭐☆☆

Learning Objectives¶

By the end of this week, you should be able to:

[ ] Explain the CAP theorem and classify real database systems as CP, AP, or CA
[ ] Describe the four primary NoSQL paradigms and give real-world use cases for each
[ ] Write MongoDB CRUD operations and basic aggregation pipeline stages
[ ] Compare MongoDB schema design patterns — embedding vs referencing — for a given workload
[ ] Explain Redis data structures and implement common caching and session patterns
[ ] Describe how NewSQL systems (Spanner, CockroachDB, TiDB) achieve horizontal scale with ACID
[ ] Use MySQL's JSON data type with JSON_EXTRACT, JSON_TABLE, and generated column indexes
[ ] Compare major cloud database services across pricing, features, and appropriate use cases
[ ] Apply a structured database selection framework to a real scenario
[ ] Design a polyglot persistence architecture for a multi-service application

1. The CAP Theorem¶

1.1 The Three Properties¶

The CAP Theorem (Brewer, 2000; formally proven by Gilbert and Lynch, 2002) states that a distributed data store can provide at most two of the following three guarantees simultaneously:

Property	Meaning
C — Consistency	Every read receives the most recent write or an error. All nodes see the same data at the same time.
A — Availability	Every request receives a (non-error) response, though the response may not contain the most recent data.
P — Partition Tolerance	The system continues to operate even when network messages between nodes are lost or delayed.

Partition Tolerance is Not Optional

In any real distributed system, network partitions will occur. You cannot simply "choose" not to have partition tolerance. The real trade-off in practice is: when a partition occurs, do you sacrifice Consistency or Availability? This makes the practical choice CP vs AP.

1.2 Real System Classifications¶

System	Type	Trade-off
MySQL (single node)	CA	Not distributed; no partition concerns
MySQL InnoDB Cluster	CP	On partition: refuses writes to preserve consistency
MongoDB (default)	CP	Primary stepdown during partition; secondaries not writable
MongoDB (eventually consistent reads)	AP	Reads from stale secondaries; data may lag
Cassandra	AP	Always accepts writes; eventual consistency
Redis Cluster	AP	Accepts reads/writes; partitioned nodes may diverge
Google Spanner	CP (effectively CA)	Uses TrueTime API + synchronised clocks to minimise partition effects
DynamoDB	AP (configurable)	Strong consistency available, but not default
CockroachDB	CP	Raft consensus; partitioned nodes halt writes

1.3 Beyond CAP — the PACELC Model¶

The PACELC extension recognises that even without a partition, there is a latency/consistency trade-off:

PACELC: if Partition → (A vs C);
        Else         → (L vs C)

A system like DynamoDB: PA/EL — in partition favours availability; in normal operation favours low latency over consistency.

2. NoSQL Paradigms¶

2.1 The Four NoSQL Categories¶

Category	Model	Examples	Best For
Document	JSON/BSON documents	MongoDB, CouchDB, Firestore	Flexible schemas, content management, user profiles
Key-Value	Opaque value behind a key	Redis, DynamoDB, Riak	Sessions, caching, shopping carts, real-time data
Column-Family	Rows with sparse columns	Cassandra, HBase, Bigtable	Write-heavy time series, IoT, event logs
Graph	Nodes and edges	Neo4j, Amazon Neptune, TigerGraph	Social networks, recommendations, fraud detection

2.2 When NOT to Use NoSQL¶

NoSQL Is Not a Silver Bullet

NoSQL databases sacrifice ACID guarantees (or implement them differently). Do not use NoSQL when:

You need multi-table transactions (e.g., financial systems with consistent balances)
Your data is highly relational and you need complex joins
Your team's expertise is SQL and the project timeline is tight
Regulatory compliance requires auditable, consistent records (HIPAA, SOX, PCI-DSS)

3. MongoDB Deep Dive¶

3.1 Document Model vs Relational¶

In MongoDB, data is stored as BSON (Binary JSON) documents in collections (not tables). There is no enforced schema by default.

Relational ModelMongoDB Document Model

University enrollment — relational

-- Three separate tables with foreign keys
SELECT s.first_name, s.last_name,
       c.title, c.course_code, g.numeric_grade
FROM   students    s
JOIN   enrollments e ON e.student_id = s.student_id
JOIN   sections    sc ON sc.section_id = e.section_id
JOIN   courses     c  ON c.course_id   = sc.course_id
LEFT JOIN grades   g  ON g.enrollment_id = e.enrollment_id
WHERE  s.student_id = 1001;

MongoDB — embedded document model

// Students collection — one document per student
// Enrollments embedded within the student document
db.students.insertOne({
  _id: ObjectId("64a1b2c3d4e5f6789abcdef0"),
  student_id: 1001,
  first_name: "Maria",
  last_name:  "Hernandez",
  email:      "mhernandez@frostburg.edu",
  gpa:        3.72,
  enrollments: [
    {
      section_id:   201,
      course_code:  "ITEC445",
      title:        "Database Systems II",
      semester:     "Fall2025",
      grade:        { numeric: 94.5, letter: "A" }
    },
    {
      section_id:   187,
      course_code:  "ITEC412",
      title:        "Cloud Computing",
      semester:     "Fall2025",
      grade:        null     // not yet graded
    }
  ]
});

3.2 MongoDB CRUD Operations¶

MongoDB CRUD

// ---- INSERT ------------------------------------------------
db.students.insertOne({
  student_id: 1002,
  first_name: "James",
  last_name:  "Okafor",
  email:      "jokafor@frostburg.edu",
  gpa:        3.15
});

db.students.insertMany([
  { student_id: 1003, first_name: "Priya", last_name: "Nair",   gpa: 3.90 },
  { student_id: 1004, first_name: "David", last_name: "Kim",    gpa: 2.85 }
]);

// ---- READ --------------------------------------------------
// Find all students with GPA >= 3.5
db.students.find(
  { gpa: { $gte: 3.5 } },
  { first_name: 1, last_name: 1, gpa: 1, _id: 0 }   // projection
).sort({ gpa: -1 });

// Find with nested field filter
db.students.find({
  "enrollments.semester": "Fall2025",
  "enrollments.course_code": "ITEC445"
});

// ---- UPDATE ------------------------------------------------
// Update GPA for one student
db.students.updateOne(
  { student_id: 1001 },
  { $set: { gpa: 3.75, last_updated: new Date() } }
);

// Set a grade in an embedded array element (arrayFilters)
db.students.updateOne(
  { student_id: 1001 },
  { $set: { "enrollments.$[elem].grade": { numeric: 96.0, letter: "A" } } },
  { arrayFilters: [ { "elem.section_id": 201 } ] }
);

// ---- DELETE ------------------------------------------------
db.students.deleteOne({ student_id: 1004 });

// Delete all students with GPA below 1.0 (academic suspension)
db.students.deleteMany({ gpa: { $lt: 1.0 } });

3.3 Aggregation Pipeline¶

The MongoDB aggregation pipeline processes documents through a series of stages:

Aggregation pipeline — department GPA summary

db.students.aggregate([
  // Stage 1: Filter active students
  { $match: { status: "active" } },

  // Stage 2: Unwind enrollments array
  { $unwind: "$enrollments" },

  // Stage 3: Filter for Fall2025
  { $match: { "enrollments.semester": "Fall2025" } },

  // Stage 4: Extract department from course code
  { $addFields: {
      dept_code: { $substr: ["$enrollments.course_code", 0, 4] }
  }},

  // Stage 5: Group by department
  { $group: {
      _id:            "$dept_code",
      avg_gpa:        { $avg: "$gpa" },
      total_students: { $addToSet: "$student_id" },
      enrollment_count: { $sum: 1 }
  }},

  // Stage 6: Add computed field
  { $addFields: {
      unique_students: { $size: "$total_students" }
  }},

  // Stage 7: Sort
  { $sort: { avg_gpa: -1 } },

  // Stage 8: Project final shape
  { $project: {
      _id:              0,
      department:       "$_id",
      avg_gpa:          { $round: ["$avg_gpa", 2] },
      unique_students:  1,
      enrollment_count: 1
  }}
]);

3.4 Schema Design — Embedding vs Referencing¶

Criteria	Embedding	Referencing
Relationship cardinality	One-to-few	One-to-many or many-to-many
Access pattern	Child data always accessed with parent	Child data queried independently
Update frequency	Child rarely updated	Child updated frequently
Document size risk	Low	N/A — data in separate collection
Join required?	❌ No ($lookup equivalent available)	✅ `$lookup` pipeline stage
Duplication	Possible	Normalised

MongoDB $lookup (join equivalent)

// Get students with their course details (referenced pattern)
db.enrollments.aggregate([
  { $lookup: {
      from:         "courses",
      localField:   "course_id",
      foreignField: "_id",
      as:           "course_info"
  }},
  { $unwind: "$course_info" },
  { $project: {
      student_id: 1,
      course_code: "$course_info.course_code",
      title:       "$course_info.title"
  }}
]);

4. Redis — Data Structures and Use Cases¶

4.1 Core Data Structures¶

Redis CLI — core data structure operations

# ---- STRING (most basic) -----------------------------------
SET   session:user:1001 '{"userId":1001,"role":"student","exp":1728000000}'
GET   session:user:1001
TTL   session:user:1001          # Remaining TTL in seconds
EXPIRE session:user:1001 3600    # Set TTL of 1 hour
DEL   session:user:1001

# ---- LIST (ordered, allows duplicates) ---------------------
RPUSH notification_queue '{"type":"email","to":"mhernandez@frostburg.edu"}'
RPUSH notification_queue '{"type":"sms","to":"+13015551234"}'
LLEN  notification_queue         # 2
BLPOP notification_queue 30      # Blocking pop (30s timeout) — worker pattern

# ---- SET (unique members, unordered) -----------------------
SADD  enrolled_sections:1001 201 187 203
SISMEMBER enrolled_sections:1001 201  # 1 (true)
SMEMBERS enrolled_sections:1001       # {201, 187, 203}
SCARD enrolled_sections:1001          # 3

# ---- SORTED SET (unique members with score) ----------------
ZADD  course_leaderboard 96.5 "student:1001"
ZADD  course_leaderboard 91.0 "student:1003"
ZADD  course_leaderboard 88.5 "student:1002"
ZREVRANGE course_leaderboard 0 9 WITHSCORES   # Top 10 with scores
ZRANK     course_leaderboard "student:1001"   # Rank (0-indexed)

# ---- HASH (field-value map for an object) ------------------
HSET    student:1001 first_name "Maria" last_name "Hernandez" gpa 3.72
HGET    student:1001 gpa                    # "3.72"
HGETALL student:1001                        # All fields
HINCRBY student:1001 login_count 1          # Atomic increment

4.2 Real-World Use Cases¶

Session StoreRate LimitingPub/Sub Messaging

Redis session management

import redis
import json
import uuid

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

SESSION_TTL = 3600  # 1 hour

def create_session(user_id: int, role: str) -> str:
    token = str(uuid.uuid4())
    session_data = {
        "user_id": user_id,
        "role":    role,
        "created": "2025-10-15T09:00:00"
    }
    r.setex(
        f"session:{token}",
        SESSION_TTL,
        json.dumps(session_data)
    )
    return token

def get_session(token: str) -> dict | None:
    data = r.get(f"session:{token}")
    if data:
        r.expire(f"session:{token}", SESSION_TTL)  # Sliding expiry
        return json.loads(data)
    return None

Redis sliding window rate limiter

import redis, time

r = redis.Redis(host='localhost', port=6379)

def is_rate_limited(user_id: int, limit: int = 100, window_secs: int = 60) -> bool:
    """Allow at most `limit` requests per `window_secs`."""
    key = f"ratelimit:{user_id}"
    now = time.time()
    window_start = now - window_secs

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)   # Remove old entries
    pipe.zadd(key, {str(now): now})               # Add this request
    pipe.zcard(key)                               # Count in window
    pipe.expire(key, window_secs + 1)             # Auto-cleanup
    results = pipe.execute()

    request_count = results[2]
    return request_count > limit

Redis pub/sub — real-time notifications

import redis, threading, json

r = redis.Redis(host='localhost', decode_responses=True)

def publish_enrollment_event(student_id: int, section_id: int):
    """Publish event for all interested subscribers."""
    message = json.dumps({
        "event":     "enrollment_confirmed",
        "student_id": student_id,
        "section_id": section_id
    })
    r.publish("enrollment_events", message)

def enrollment_listener():
    """Subscribe and process events."""
    pubsub = r.pubsub()
    pubsub.subscribe("enrollment_events")
    for message in pubsub.listen():
        if message["type"] == "message":
            data = json.loads(message["data"])
            print(f"Sending confirmation to student {data['student_id']}")

4.3 Redis Persistence Options¶

Mode	How It Works	Durability	Performance
RDB (snapshot)	Periodic fork + dump to `.rdb` file	May lose data since last snapshot	Fastest reads; compact file
AOF (append-only file)	Log every write command	At most 1 second loss (fsync=everysec)	Slightly slower; larger file
RDB + AOF	Both enabled; AOF used for recovery	Best durability	Moderate overhead
No persistence	In-memory only	Complete loss on restart	Maximum performance

4.4 Eviction Policies¶

Redis memory management

# Set max memory limit
CONFIG SET maxmemory 4gb

# Set eviction policy
CONFIG SET maxmemory-policy allkeys-lru
# Policies:
# noeviction     — return error when memory full (default)
# allkeys-lru    — evict LRU key from all keys
# allkeys-lfu    — evict least-frequently-used key
# volatile-lru   — evict LRU key from keys with TTL
# volatile-ttl   — evict key with shortest TTL
# allkeys-random — evict random key

5. NewSQL — Distributed SQL¶

NewSQL systems provide the scalability of NoSQL with the ACID guarantees and SQL interface of traditional relational databases.

5.1 NewSQL Architecture¶

Traditional SQL:         NewSQL:
┌───────────┐            ┌─────┐ ┌─────┐ ┌─────┐
│  MySQL    │            │Node1│ │Node2│ │Node3│
│ (single   │            └──┬──┘ └──┬──┘ └──┬──┘
│  server)  │               └───────┴───────┘
└───────────┘             Distributed Raft Consensus
                          Horizontal sharding
                          SQL interface preserved
                          ACID via consensus protocol

5.2 NewSQL System Comparison¶

System	Origin	Consistency	SQL Compatibility	Sharding
Google Spanner	Google (2012)	External consistency (stronger than serialisable)	ANSI SQL + extensions	Automatic (F1 driver)
CockroachDB	Cockroach Labs	Serialisable	PostgreSQL-compatible	Automatic (range-based)
TiDB	PingCAP	Snapshot isolation	MySQL-compatible	Automatic (region-based)

CockroachDB — SQL looks like PostgreSQL

-- CockroachDB syntax is nearly identical to PostgreSQL
CREATE TABLE students (
    student_id  INT          DEFAULT unique_rowid() PRIMARY KEY,
    first_name  VARCHAR(50)  NOT NULL,
    last_name   VARCHAR(50)  NOT NULL,
    email       VARCHAR(100) UNIQUE NOT NULL,
    gpa         DECIMAL(3,2) DEFAULT 0.00
);

-- Distribute table data across zones
ALTER TABLE enrollments
PARTITION BY LIST (semester) (
  PARTITION fall2025 VALUES IN ('Fall2025'),
  PARTITION spring2026 VALUES IN ('Spring2026')
);

6. MySQL as a Document Store¶

6.1 JSON Data Type in MySQL 8.0¶

MySQL's native JSON type validates, stores, and indexes JSON without requiring a separate NoSQL system:

MySQL JSON data type

-- Table with JSON column
CREATE TABLE student_profiles (
    student_id   INT          PRIMARY KEY,
    profile_json JSON         NOT NULL,
    FOREIGN KEY (student_id) REFERENCES students(student_id)
);

-- Insert JSON document
INSERT INTO student_profiles (student_id, profile_json)
VALUES (1001, '{
  "major":       "Computer Science",
  "minor":       "Mathematics",
  "advisor_id":  42,
  "preferences": { "notifications": true, "theme": "dark" },
  "tags":        ["honors", "research_track"]
}');

-- JSON_EXTRACT: retrieve nested value
SELECT
    student_id,
    JSON_EXTRACT(profile_json, '$.major')                  AS major,
    JSON_UNQUOTE(JSON_EXTRACT(profile_json, '$.preferences.theme')) AS theme,
    JSON_EXTRACT(profile_json, '$.tags[0]')                AS first_tag
FROM student_profiles
WHERE student_id = 1001;

-- Shorthand arrow operator (->) — equivalent to JSON_EXTRACT
SELECT
    student_id,
    profile_json -> '$.major'            AS major,
    profile_json ->> '$.preferences.theme' AS theme   -- ->> unquotes
FROM student_profiles;

6.2 JSON_TABLE — Shredding JSON into Relational Rows¶

JSON_TABLE converts JSON arrays to rows

SELECT jt.*
FROM   student_profiles sp,
       JSON_TABLE(
           sp.profile_json,
           '$.tags[*]'
           COLUMNS (
               tag VARCHAR(50) PATH '$'
           )
       ) AS jt
WHERE  sp.student_id = 1001;
-- Returns:
-- tag
-- --------
-- honors
-- research_track

6.3 Generated Columns and Functional Indexes on JSON¶

Virtual generated column + index on JSON path

-- Add a virtual column extracting major from JSON
ALTER TABLE student_profiles
ADD COLUMN major VARCHAR(100)
    GENERATED ALWAYS AS (JSON_UNQUOTE(profile_json ->> '$.major'))
    VIRTUAL;

-- Index the generated column for fast lookups
CREATE INDEX idx_major ON student_profiles (major);

-- Now this query uses the index:
EXPLAIN SELECT student_id, major
FROM    student_profiles
WHERE   major = 'Computer Science';

7. Cloud Database Services Comparison¶

7.1 AWS Cloud Databases¶

Amazon RDSAmazon AuroraAmazon DynamoDB

Managed relational DB (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server)

Automated backups, patches, minor version upgrades
Multi-AZ for HA; read replicas for scale-out
Pricing: instance hours + storage + I/O + data transfer
Best for: lift-and-shift of existing relational workloads

RDS read replica creation

aws rds create-db-instance-read-replica \
  --db-instance-identifier university-replica-1 \
  --source-db-instance-identifier university-prod \
  --db-instance-class db.r6g.large \
  --availability-zone us-east-1b

MySQL/PostgreSQL-compatible with proprietary distributed storage

Storage auto-scales to 128 TB; 6-way replication across 3 AZs at storage layer
Up to 15 read replicas with < 10ms replica lag
Aurora Serverless v2: auto-scales capacity in ACU (Aurora Capacity Units)
2–3× performance vs RDS MySQL on same hardware
Best for: high-throughput MySQL workloads, variable traffic

Feature	RDS MySQL	Aurora MySQL
Max storage	64 TB	128 TB
Failover time	60–120s	< 30s
Read replicas	5	15
Replica lag	Seconds	< 100ms
Cost	Lower	~20% more

Fully managed serverless key-value and document DB

Single-digit millisecond latency at any scale
No capacity planning: On-Demand mode scales automatically
Global Tables: multi-region active-active replication
Pricing: read/write capacity units or on-demand per request
Best for: session stores, shopping carts, gaming leaderboards, IoT

DynamoDB boto3 example

import boto3

dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table    = dynamodb.Table('UserSessions')

# Put item
table.put_item(Item={
    'session_id': 'abc-123',
    'user_id':    1001,
    'ttl':        1728000000   # DynamoDB TTL auto-deletion
})

# Get item
response = table.get_item(Key={'session_id': 'abc-123'})
session  = response.get('Item')

7.2 Multi-Cloud Comparison¶

Service	Provider	Engine	HA	Serverless	Best For
Cloud SQL	GCP	MySQL 8.0, PG 16	Regional failover	❌	Standard relational workloads on GCP
Cloud Spanner	GCP	Spanner (NewSQL)	Multi-region	✅ (autoscale)	Global, strongly-consistent transactional
Firestore	GCP	Document (NoSQL)	Multi-region	✅	Mobile/web apps, hierarchical data
Azure Database for MySQL	Azure	MySQL 8.0	Zone-redundant	❌	MySQL workloads in Azure ecosystem
Azure CosmosDB	Azure	Multi-model	Multi-region	✅	Globally distributed, multiple APIs
Azure SQL	Azure	SQL Server	Zone-redundant	✅ (Serverless)	SQL Server workloads, Azure ecosystem

8. Database Selection Framework¶

8.1 Selection Criteria¶

When selecting a database for a new system, evaluate these dimensions:

┌─────────────────────────────────────────────────────┐
│               DATABASE SELECTION FRAMEWORK          │
│                                                     │
│  1. DATA MODEL                                      │
│     • Relational / Tabular → SQL (MySQL, PostgreSQL)│
│     • Documents / Flexible → MongoDB, Firestore     │
│     • Key-Value / Cache    → Redis, DynamoDB        │
│     • Time Series          → TimescaleDB, InfluxDB  │
│     • Graph                → Neo4j, Neptune         │
│                                                     │
│  2. CONSISTENCY REQUIREMENTS                        │
│     • Strict ACID (financial, medical) → SQL        │
│     • Eventual OK (social, analytics) → AP NoSQL    │
│                                                     │
│  3. SCALE PROFILE                                   │
│     • < 1 TB, moderate load  → Single MySQL server  │
│     • Read scale-out needed  → Replication + proxy  │
│     • Write scale-out needed → NewSQL / NoSQL shard │
│                                                     │
│  4. QUERY COMPLEXITY                                │
│     • Complex joins, reporting → SQL                │
│     • Simple key lookups       → Redis / DynamoDB   │
│     • Aggregation pipelines    → MongoDB            │
│                                                     │
│  5. OPERATIONAL OVERHEAD                            │
│     • Small team / startup → Managed cloud service  │
│     • Large team / on-prem → Self-hosted            │
└─────────────────────────────────────────────────────┘

8.2 Polyglot Persistence in Practice¶

Polyglot persistence means using multiple database systems within one application, each chosen for the type of data it manages best.

University Application — Polyglot Persistence Architecture:

┌─────────────────────────────────────────────────────────┐
│                    Application Services                 │
│                                                         │
│  Enrollment Service ──── MySQL 8.0 ──── Transactional   │
│  (registration, grades, transcripts)    ACID required   │
│                                                         │
│  Session Service ──────── Redis ──────── Fast K/V       │
│  (auth tokens, user state)               < 1ms access   │
│                                                         │
│  Search Service ────── Elasticsearch ── Full-text       │
│  (course catalog search)                 Inverted index  │
│                                                         │
│  Analytics Service ──── ClickHouse ───── OLAP           │
│  (enrollment trends, reporting)          Columnar store  │
│                                                         │
│  Profile Service ──────── MongoDB ────── Flexible docs  │
│  (student preferences, social data)      Schema-free     │
└─────────────────────────────────────────────────────────┘

Polyglot Persistence Trade-offs

Using multiple databases provides performance and flexibility benefits, but adds operational complexity: more systems to monitor, back up, upgrade, and staff expertise in. Apply polyglot persistence only when the benefit of a specialised database demonstrably outweighs the cost of managing it.

8.3 Relational to Document Migration Considerations¶

ETL: Relational to MongoDB (denormalisation)

"""
Migrate university enrollments from MySQL to MongoDB.
Demonstrates denormalisation for document model.
"""
import mysql.connector
from pymongo import MongoClient
import json

# Source: MySQL
mysql_conn = mysql.connector.connect(
    host="localhost", database="university",
    user="etl_user", password="etl_pass"
)
cursor = mysql_conn.cursor(dictionary=True)

# Destination: MongoDB
mongo_client = MongoClient("mongodb://localhost:27017/")
mongo_db     = mongo_client["university_nosql"]
students_col = mongo_db["students"]

# Fetch students with all enrollments in one query
cursor.execute("""
    SELECT
        s.student_id, s.first_name, s.last_name, s.email, s.gpa,
        c.course_code, c.title AS course_title,
        sec.section_num, sec.semester,
        g.numeric_grade, g.letter_grade
    FROM students s
    LEFT JOIN enrollments e  ON e.student_id   = s.student_id
    LEFT JOIN sections    sec ON sec.section_id = e.section_id
    LEFT JOIN courses     c   ON c.course_id   = sec.course_id
    LEFT JOIN grades      g   ON g.enrollment_id = e.enrollment_id
    ORDER BY s.student_id
""")

students = {}
for row in cursor.fetchall():
    sid = row["student_id"]
    if sid not in students:
        students[sid] = {
            "student_id": sid,
            "first_name": row["first_name"],
            "last_name":  row["last_name"],
            "email":      row["email"],
            "gpa":        float(row["gpa"]) if row["gpa"] else None,
            "enrollments": []
        }
    if row["course_code"]:
        students[sid]["enrollments"].append({
            "course_code":   row["course_code"],
            "title":         row["course_title"],
            "section_num":   row["section_num"],
            "semester":      row["semester"],
            "numeric_grade": float(row["numeric_grade"]) if row["numeric_grade"] else None,
            "letter_grade":  row["letter_grade"]
        })

# Insert into MongoDB
docs = list(students.values())
if docs:
    result = students_col.insert_many(docs)
    print(f"Migrated {len(result.inserted_ids)} students to MongoDB")

cursor.close()
mysql_conn.close()
mongo_client.close()

Key Vocabulary¶

Term	Definition
CAP Theorem	Distributed systems cannot guarantee Consistency, Availability, and Partition Tolerance simultaneously
CP system	Chooses Consistency over Availability during a network partition
AP system	Chooses Availability over Consistency during a network partition
NoSQL	Broad category of non-relational databases: document, key-value, column-family, graph
Document database	Stores semi-structured JSON/BSON documents; flexible schema; example: MongoDB
Key-value store	Maps opaque keys to values; ultra-fast lookups; example: Redis
Column-family	Organises data as columns that can be grouped; great for sparse, wide data; example: Cassandra
Graph database	Stores entities as nodes and relationships as edges; example: Neo4j
BSON	Binary JSON — MongoDB's wire and storage format; supports more types than JSON
Aggregation pipeline	MongoDB's data transformation framework; chain of processing stages
Embedding	MongoDB pattern: store related data inside a single document
Referencing	MongoDB pattern: store related data in a separate collection with a foreign key
NewSQL	Database category providing SQL ACID guarantees with horizontal NoSQL scalability
GTID (Spanner)	Google TrueTime — globally ordered timestamps using GPS/atomic clocks
Raft consensus	Distributed consensus algorithm used by CockroachDB, TiDB, etcd for leader election
JSON_EXTRACT	MySQL function to retrieve a value at a JSON path expression
Generated column	MySQL column whose value is computed from an expression; can be indexed
TTL	Time To Live — automatic expiry setting for cache entries
Eviction policy	Redis rule determining which key is removed when memory is full
Polyglot persistence	Architecture using multiple specialised databases within one application

Self-Assessment¶

Self-Assessment — Week 14

Question 1. A social media startup says: "We chose Cassandra because it's always available and never rejects writes." A financial services company says: "We use MySQL because data must always be consistent." Explain, using the CAP theorem and PACELC model, why each choice makes sense for its domain. Then describe a scenario where Cassandra's choice would be unacceptable.

Question 2. You are designing a MongoDB schema for a university system. A professor's record includes basic profile data (name, email, department) and a list of courses they have taught over 15 years (potentially 50+ courses). Explain whether you would use embedding or referencing for the courses list, citing at least three specific technical reasons.

Question 3. Your team's Node.js application uses Redis as a session store with a 1-hour sliding TTL. A QA engineer reports that after 55 minutes of inactivity, a user's session is sometimes preserved and sometimes expired. Explain what is happening (at the Redis data structure level) and how the sliding TTL implementation should be corrected.

Question 4. Compare CockroachDB and traditional MySQL replication for a multinational university with campuses in the US, Europe, and Asia. Each campus needs low-latency reads and writes for its own students, and enrollment must be globally consistent (a seat taken in New York cannot be double-booked in London). Which system architecture serves this use case better, and why?

Question 5. MySQL 8.0's JSON type with generated columns and functional indexes can replace some use cases that would otherwise require MongoDB. Describe a scenario where MySQL JSON is the right choice and a scenario where MongoDB's document model is genuinely superior — justify both answers technically.