notes

profiling a warehouse-scale computer

System Design as development for use case
- Optimized for record append and random reads
Master-Slave
- Limitations: faul tolerance despite replicas, throughput
Bottlenecks & network optimization
- Data & Control flow separation
State restoration & logging (lots of things I don't get here)
- Related: OS journaling
Weak consistency - "tolerable errors" (i.e. clients reading different states)
Garbage Collection
- Amortized cost w/ FS scans
- Parallels w/ language design
Terms to learn:
1. Network Bandwidth and per-machine limit
2. Racks & data centers - how are these managed (i.e. "cross-{rack,DC} replication")?
Use the latest {soft,hard}ware or deal with slowdowns (older kernel fsync() requiring reading entirety of file on append)
Getting to know the real numbers: 440 MB/s throughput on double chunkserver kill & google network
Network as the ultimate bottleneck & inefficiency