MapReduce

a way to get things done on unreliable clusters of computers that might fail!

Distributes load over many machines = 1000s, not dozens

  • programmer cannot know how many machines at program-time or runtime

a system that implements MapReduce (hadoop) automatically parallelizes the job for you

 - this allows the program to not worry about system administration

MapReduce is popular, not perfect (good for making web-scale systems, built in reliability, no database schema necessary, programming language of your choice, easy to administer)

MapReduce:

  • uses files as intermediate info
  • so that mappers & reducers can work independently & not remember state optimizations:
  • map function transforms a (key, value) input into a new (key, value) output
  • reduce function takes output from map as input, and outputs to disk file

2 vectors: u = [1, 2, 3] v = [4, 5, 6]

dot product (uv) = 14 + 25 + 36

mapper function : (index i, vector u, vector v) => (a[i]*b[i]) list

grouper (can concatenate products into a list)

reducer function : (a[i]*b[i] list) => (sum)

MapReduce for word counting (how many times does "dog" appear) : inputs: lineNo, textStr mapper:

Map(lineNo, textStr) { for each word w in textStr: emit(w, 1) }

Reduce (w, interm-vals) { emit(outKey, interm-vals.size()) }


with huge doc sizes, you PARSE them.

optimization:

  • if a map() worker dies, restart the task on a different box
  • if a reducer() dies, restart the reducer
  • master keeps track of heartbeat messages
  • slow machine? speculative execution! you can guess that one computer will be faster than the other, so just send task to both, and kill the slower one.
  • large data? use "combiner" or a 'local reducer' to compress data between map & reduce

summary:

  • scalable
  • stateless, everything you need to know is in the input files