diff --git a/notes.md b/notes.md index 126c0b2..f7b4ecd 100644 --- a/notes.md +++ b/notes.md @@ -33,6 +33,17 @@ ## [mapreduce](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) +- mapreduce: map[k0, v0] -> [k1,v1] -> reduce[k1,v[]] -> v[] +- Master-Slave assigns map/reduce tasks +- Separate M & R -> M >> R (usually) -> optimize worker allocation +- Map & reduce individually parallelized, but *not* overall + - Reducer waits for all intermediate kv pairs in order, then told by master -> this is how output is sorted +- RPC remote file read for data transfer from M -> R +- Re-execute entire M/R stage for fault tolerance +- Backup Tasks: dynamic performance adjustments -> 44% speedup (slow on machine -> reschedule) +- Caching & Network Topology: schedule workers close to *internal GFS chunkservers* to minimize latency +- Simplicity + abstraction - not optimal, but first of its kind and made waves + ## [spark](https://people.eecs.berkeley.edu/~matei/papers/2016/cacm_apache_spark.pdf) ## [rpc](https://www.h3c.com/en/Support/Resource_Center/EN/Home/Switches/00-Public/Trending/Technology_White_Papers/gRPC_Technology_White_Paper-6W100/) diff --git a/papers/kafka.pdf b/papers/kafka.pdf new file mode 100644 index 0000000..c8ae612 Binary files /dev/null and b/papers/kafka.pdf differ diff --git a/papers/zookeeper.pdf b/papers/zookeeper.pdf new file mode 100644 index 0000000..4b66849 Binary files /dev/null and b/papers/zookeeper.pdf differ diff --git a/readme.md b/readme.md index 65a2d90..924972f 100644 --- a/readme.md +++ b/readme.md @@ -8,10 +8,12 @@ - [x] [cassandra](./notes.md#cassandra) - [x] [bigtable](./notes.md#bigtable) - [x] [gfs](./notes.md#gfs) -- [mapreduce](./notes.md#gfs) -- [spark](./notes.md#spark) +- [x] [mapreduce](./notes.md#gfs) +- [x] [spark](./notes.md#spark) - rpc +- zookeeper +- kafka - tiktok monolith - bloom filters - dynamo