IR3 Lecture 13 - Web Search Implementation
Search Challenges: 1) result relevance 2) processing speed 3) scaling to many documents
Next to discuss: 1) crawler design 2) inverted index construction 3) distributed search architecture 4) deduplication
crawler = review lecture slide (queue based, breadth first)
- crawlers index terms in documents and make them accessible for search
- crawlers index all static web pages (including selected non-HTML pages, unless restricted by robots.txt). dynamic pages are iffy
- Robots.txt tells crawlers to not visit certain pages on site.
- crawlers pull out links and terms for the WebDB and term index
- fetcher = has a fetch list, divides it into k pieces
INDEXING : alg: merge sort
- merge sort is used to scale indexing
- instead of outputting every term, only output when you have the same thing in both (intersection). = merge intersection algorithm!
- keep moving whichever is smaller, and when they're equal, output value.
INVERTED INDEX CONSTRUCTION : very large (every term appears, larger than the document set itself)
- 1) read in a block of documents (as much as you can into memory), then create an inverted index out of it. then take the 2 inverted indices and merge them.
DISTRIBUTED SEARCHING
- issues to consider when scaling
- too much stuff, need to split to different machines
- split row-wise or column-wise?
DOC SEGMENTING: each machine will have their own sets of indices. So, you search "Britney," you send that search term to all 5 machines, and each machine has their own fragmented version of that row, and then they send you document IDs back. You then take the union and output that.
- if a machine goes down, users won't notice because all other machines will compensate
TERM SEGMENTING: segment by row instead of col. Machine 1 has terms A-B, 2 has C-D, etc. So if you search britney, you just go to machine 1, and get the output.
- if a machine goes down, user gets nothing for that term searched. Bad!
if you partition by document, it's easy to add new documents. by terms, documents need to be spread out (harder).
DEDUPLICATION
- how do you find duplicate webpages without comparing them (expensive)?
SHINGLING : compute the k-shingles for a page
- if docs share a lot of shingles, they're dups
- pick random shingles from union of two docs , then categorize what set they're in and extrapolate