Google File System
crawlers have to download web pages from the internet, store them all, and then index them (copy of the web). treating a room full of computers like one hard drive
*have to be constantly crawling the web and updating search results
how are we gonna store the data?
- local disk? jk our hard drive isn't big enough
- can't use relational database - we want this to work across many computers
can we use NFS filesystem? no, they get overwhelmed really fast
use a CLUSTER OF COMPUTERS (100s-1000s) with local disks & some network
- servers go into a rack
- all servers in a rack share a switch
- servers have local processing and local disks
- intra-rack bandwidth is cheap (computers inside the same rack)
- inter-rack badnwidth is expensive (between different racks)
WEIRD Requirements:
- high component failure rates (cheap hardware trying to store massive files... multiGB files)
- throughput : favored over low latency
Failure rates:
- 2-10% / yr
- MTTF (mean time to failure) = how long a device is expected to last
- useful when: a broken device will be replaced : like a computer
- MTBF (mean time between failure)
- useful when: a broken device should be repaired : like an entire cluster
Design solution:
- files stored as series of chunks (larger than most filesystems... default 64MB)
reliability through replication
- each chunk replicated across 3+ chunkservers
- chunkservers hold & serve chunks
- master holds metadata, knows how to convert file name to chunk-server-host
client asks master for metadata, then connects to chunkservers
GFS File Read: client asks master for data from url, then master responds with chunklist + locations for each chunk. then, client fetches each chunk, in sequence, from chunkservers
- GFS File Write (aka mutation): client asks master for filename info, master responds with chunklist + locations, client SENDS NEW DATA to chunkservers, and tries to apply changes in the same order Things can go wrong if:
- if a chunk-server crashes
- if they get the requests in different order
- if already busy when getting the request
- if multiple clients try to simultaneously write to the same file region
- network doesn't guarantee order that messages arrive
To fix this: CONSISTENCY MODEL
- pertains to one piece of data (file region)
- all clients should see the same data, regardless of which replicas they read from
- sacrificing the idea that everyone sees the same data, for speed
- you have copies, and sometimes the copies can be different (consistency model defines when your copies can be different)
consistency model:
- what happens if multiple writes happen to the same region?
- 3 possibilities:
- serial successful writes: all chunks see the same order of write operations: consistent & defined. one write, then another write, etc.
- concurrent successful writes: some writes could be append operations to a file all chunks see same order of write operations consistent, but undefined (all clients see same data, but it may not reflect what any one mutation has written... they may see incomplete data (if a server hasn't finished writing))
- failed writes: can leave inconsistent region of file (different clients can see different data at different times)
Atomic Operation: operation appears to the system as if it's happening instantaneously (even though it's not)
- how can google file system make an operation atomic?
- with a LOCK. = master node saying "nobody can touch this file name until i tell my chunk-servers whose job it is, and they acknowledge back that it's successful). but... slow (not parallel)
failure modes:
- chunkservers report to master every few seconds (called "heartbeat" message.. so master can keep track of which chunkservers are up and which go down)
- master can ask chunkservers to reduplicate data
- master is writing a log file with everyhting it does. SO, if master dies, shadow master provides read-only access until master restarts
- so you just kill the process, and it'll come back up
- diversion: flush & sync (flush from memory, sync to disk)
- GFS Replication allows master to tell a chunkserver to copy some data to another server (if it notices one server not having a heartbeat)
- chunkservers keep most copies within a rack, but one copy outside, incase there's a switch failure
GFS is great for big files, clients need to be custom-written, and protects from failures internally.
- not good for newer (real-time) applications, because it was designed for throughput – and latency became important for gmail, real-time things
master restart takes 10 seconds (still too high!)
DISTRIBUTED MASTERS??? it's hard.
PAXOS is a family of protocols for solving consensus in a network of unreliable processors (we want to maintain consistency in a distributed system)
- majority voting: READ:
1) client requests data
2) processes in the system VOTE on what is the correct data, based on replicas. called a quorum.
- WRITE: 1) "proposition" is sent to processes, with a logical timestamp "can I write?" 2) processes accept proposed change if they are most recent proposal they have seen for the same piece of data 3) write occurs when a majority agree (lamport clocks are a timestamp that tells you if something came before another)
- majority voting: READ:
1) client requests data
2) processes in the system VOTE on what is the correct data, based on replicas. called a quorum.