MapReduce

a way to get things done on unreliable clusters of computers that might fail!

Distributes load over many machines = 1000s, not dozens

programmer cannot know how many machines at program-time or runtime

a system that implements MapReduce (hadoop) automatically parallelizes the job for you

 - this allows the program to not worry about system administration

MapReduce is popular, not perfect (good for making web-scale systems, built in reliability, no database schema necessary, programming language of your choice, easy to administer)

MapReduce:

uses files as intermediate info
so that mappers & reducers can work independently & not remember state optimizations:
map function transforms a (key, value) input into a new (key, value) output
reduce function takes output from map as input, and outputs to disk file

2 vectors: u = [1, 2, 3] v = [4, 5, 6]

dot product (uv) = 14 + 25 + 36

mapper function : (index i, vector u, vector v) => (a[i]*b[i]) list

grouper (can concatenate products into a list)

reducer function : (a[i]*b[i] list) => (sum)

MapReduce for word counting (how many times does "dog" appear) : inputs: lineNo, textStr mapper:

Map(lineNo, textStr) { for each word w in textStr: emit(w, 1) }

Reduce (w, interm-vals) { emit(outKey, interm-vals.size()) }

with huge doc sizes, you PARSE them.

optimization:

if a map() worker dies, restart the task on a different box
if a reducer() dies, restart the reducer
master keeps track of heartbeat messages
slow machine? speculative execution! you can guess that one computer will be faster than the other, so just send task to both, and kill the slower one.
large data? use "combiner" or a 'local reducer' to compress data between map & reduce

summary:

scalable
stateless, everything you need to know is in the input files