# notes ## [profiling a warehouse-scale computer](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf) ## [cassandra](https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf) ## [bigtable](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) ## [gfs](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) - System Design as development for use case - Optimized for record append and random reads - Master-Slave - Limitations: faul tolerance despite replicas, throughput - Bottlenecks & network optimization - Data & Control flow separation - State restoration & logging (lots of things I don't get here) - Related: OS journaling - Weak consistency - "tolerable errors" (i.e. clients reading different states) - Garbage Collection - Amortized cost w/ FS scans - Parallels w/ language design - Terms to learn: 1. Network Bandwidth and _per-machine_ limit 2. Racks & data centers - how are these managed (i.e. "cross-{rack,DC} replication")? - Use the latest {soft,hard}ware or deal with slowdowns (older kernel `fsync()` requiring reading entirety of file on append) - Getting to know the real numbers: 440 MB/s throughput on double chunkserver kill & google network - Network as the ultimate bottleneck & inefficiency ## [mapreduce](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) ## [spark](https://people.eecs.berkeley.edu/~matei/papers/2016/cacm_apache_spark.pdf) ## [rpc](https://www.h3c.com/en/Support/Resource_Center/EN/Home/Switches/00-Public/Trending/Technology_White_Papers/gRPC_Technology_White_Paper-6W100/)