MapReduce
a way to get things done on unreliable clusters of computers that might fail!
Distributes load over many machines = 1000s, not dozens
- programmer cannot know how many machines at program-time or runtime
a system that implements MapReduce (hadoop) automatically parallelizes the job for you
- this allows the program to not worry about system administration
MapReduce is popular, not perfect (good for making web-scale systems, built in reliability, no database schema necessary, programming language of your choice, easy to administer)
MapReduce:
- uses files as intermediate info
- so that mappers & reducers can work independently & not remember state optimizations:
- map function transforms a (key, value) input into a new (key, value) output
- reduce function takes output from map as input, and outputs to disk file
2 vectors: u = [1, 2, 3] v = [4, 5, 6]
dot product (uv) = 14 + 25 + 36
mapper function : (index i, vector u, vector v) => (a[i]*b[i]) list
grouper (can concatenate products into a list)
reducer function : (a[i]*b[i] list) => (sum)
MapReduce for word counting (how many times does "dog" appear) : inputs: lineNo, textStr mapper:
Map(lineNo, textStr) { for each word w in textStr: emit(w, 1) }
Reduce (w, interm-vals) { emit(outKey, interm-vals.size()) }
with huge doc sizes, you PARSE them.
optimization:
- if a map() worker dies, restart the task on a different box
- if a reducer() dies, restart the reducer
- master keeps track of heartbeat messages
- slow machine? speculative execution! you can guess that one computer will be faster than the other, so just send task to both, and kill the slower one.
- large data? use "combiner" or a 'local reducer' to compress data between map & reduce
summary:
- scalable
- stateless, everything you need to know is in the input files