IR3 Lecture 13 - Web Search Implementation

Search Challenges: 1) result relevance 2) processing speed 3) scaling to many documents

Next to discuss: 1) crawler design 2) inverted index construction 3) distributed search architecture 4) deduplication

crawler = review lecture slide (queue based, breadth first)

crawlers index terms in documents and make them accessible for search
crawlers index all static web pages (including selected non-HTML pages, unless restricted by robots.txt). dynamic pages are iffy
Robots.txt tells crawlers to not visit certain pages on site.
crawlers pull out links and terms for the WebDB and term index
fetcher = has a fetch list, divides it into k pieces

INDEXING : alg: merge sort

merge sort is used to scale indexing
instead of outputting every term, only output when you have the same thing in both (intersection). = merge intersection algorithm!
keep moving whichever is smaller, and when they're equal, output value.

INVERTED INDEX CONSTRUCTION : very large (every term appears, larger than the document set itself)

1) read in a block of documents (as much as you can into memory), then create an inverted index out of it. then take the 2 inverted indices and merge them.

DISTRIBUTED SEARCHING

issues to consider when scaling
too much stuff, need to split to different machines
split row-wise or column-wise?
DOC SEGMENTING: each machine will have their own sets of indices. So, you search "Britney," you send that search term to all 5 machines, and each machine has their own fragmented version of that row, and then they send you document IDs back. You then take the union and output that.
- if a machine goes down, users won't notice because all other machines will compensate
TERM SEGMENTING: segment by row instead of col. Machine 1 has terms A-B, 2 has C-D, etc. So if you search britney, you just go to machine 1, and get the output.
- if a machine goes down, user gets nothing for that term searched. Bad!
if you partition by document, it's easy to add new documents. by terms, documents need to be spread out (harder).

DEDUPLICATION

SHINGLING : compute the k-shingles for a page

if docs share a lot of shingles, they're dups
pick random shingles from union of two docs , then categorize what set they're in and extrapolate