IR3 Lecture 13 - Web Search Implementation

Search Challenges: 1) result relevance 2) processing speed 3) scaling to many documents

Next to discuss: 1) crawler design 2) inverted index construction 3) distributed search architecture 4) deduplication

crawler = review lecture slide (queue based, breadth first)

  • crawlers index terms in documents and make them accessible for search
  • crawlers index all static web pages (including selected non-HTML pages, unless restricted by robots.txt). dynamic pages are iffy
  • Robots.txt tells crawlers to not visit certain pages on site.
  • crawlers pull out links and terms for the WebDB and term index
  • fetcher = has a fetch list, divides it into k pieces

INDEXING : alg: merge sort

  • merge sort is used to scale indexing
  • instead of outputting every term, only output when you have the same thing in both (intersection). = merge intersection algorithm!
  • keep moving whichever is smaller, and when they're equal, output value.

INVERTED INDEX CONSTRUCTION : very large (every term appears, larger than the document set itself)

  • 1) read in a block of documents (as much as you can into memory), then create an inverted index out of it. then take the 2 inverted indices and merge them.

DISTRIBUTED SEARCHING

  • issues to consider when scaling
  • too much stuff, need to split to different machines
  • split row-wise or column-wise?
  • DOC SEGMENTING: each machine will have their own sets of indices. So, you search "Britney," you send that search term to all 5 machines, and each machine has their own fragmented version of that row, and then they send you document IDs back. You then take the union and output that.

    • if a machine goes down, users won't notice because all other machines will compensate
  • TERM SEGMENTING: segment by row instead of col. Machine 1 has terms A-B, 2 has C-D, etc. So if you search britney, you just go to machine 1, and get the output.

    • if a machine goes down, user gets nothing for that term searched. Bad!
  • if you partition by document, it's easy to add new documents. by terms, documents need to be spread out (harder).

DEDUPLICATION

  • how do you find duplicate webpages without comparing them (expensive)?

SHINGLING : compute the k-shingles for a page

  • if docs share a lot of shingles, they're dups
  • pick random shingles from union of two docs , then categorize what set they're in and extrapolate