Web Search & Intro to Info Retrieval

2nd half of the course: how to scale websites / how to do things faster, better

SEARCH:

challenges:

  • finding relevant results (today's and next few lectures)
  • then process speed
  • then scaling to many documents

Search Document Model:

  • incoming link text
  • title
  • page content
  • unique docid
  • Imagine searching by a SQL query (horribly inefficient)

Searching the web:

  • web search is basically a database problem, but no one uses SQL databases. These are the bad things about SQL:
    • every query is a top-k query
    • every query plan is the same
    • massive numbers of queries and data
  • totally differently engineered than a SQL query

History of page-rank search:

oldest: BOOLEAN RETRIEVAL

- give simple predicate, and each document gets a true or false value for whether it comes up in the search results.
- "exact match" retrieval
- AND, OR, NOT, NEAR
- can be done using a normal matrix with 0/1 values (rows: characters, cols: plays. matrix values based on if each character is in the play)

- # of rows you have is based upon the # of TERMS you have in your vocabulary.

((Difference between "term" and "word" = documents of "The White House" vs documents with houses that have been painted white. Certain phrases may be meaningful terms, which is why you may have more terms than words. up to you, the 'index builder'))

- Could you also have fewer terms than words? Yes, because there are "stop-words" that are pretty irrelevant and wouldn't want to be searched. Ex: "to, be, not, the" are bad to search for. BUT the term "to be or not to be" does matter. 

Stemming: only want the root word, rather than only the past, present, etc. Ex: I want the root of the word walk, I don't want to have to input "walk," "walked," "walking," "walker," etc

- Decide what your terms are, and you can create a term document matrix. 
- If you have "position-information", you can do proximity searches. Ex: 'Gates' should be closely related to 'Microsoft.' If you search Microsoft, you should also see results from 'Gates' 
- The titles of documents are more valuable than the text, to search for.

next oldest: VECTOR-SPACE MODEL

- Boolean queries are a fine start, but they don't give you **ranking**. Most people only look at the top 10 search results, so it matters a lot.

- "HITS" = count how many times a term appears in a document, and rank search results by hits. 
    - Doing only this is a problem! A malicious user can design a document that will always come up first in search You could just put every term copied & pasted a zillion times in your HTML doc, and just hide a hidden HTML tag. 

- what about... term frequency, term scarcity, and length of documents?
- a document is a vector (point in a multi-dimensional space)
- terms are axes.
- a doc is a point in the space
- space is hugely multidimensional. a dimension for every possible word
- most documents do not have most words, but they still have a lot of them. 
- documents are vectors in 3d space
- one axis per term
- term frequency - how many times the word appears in current document (if kangaroo is mentioned 10x instead of once, it'll be a better search result)

- document frequency : COMPARING DOCS.
     Ex: a doc that mentions "kangaroo" 10x is better than a doc that mentions "kangaroo" once. (how often the word appears in a doc collection). terms with low doc frequency are more meaningful, because the term appears is a small amount of docs. 

**"book" is more common** and has a **higher document frequency** = BAD search result 
"Rumplestiltskin," has a low doc frequency = GOOD search result 

TF x IDF REVIEW FORMULA

(term frequency x inverse document frequency) COMPUTE THIS IF VECTOR IS GIVEN NORMALIZED: W = tf log(N / n) Weight = term frequency inverse doc frequency inverse doc frequency = LOG (total # docs in collection C / # docs in C that contain term))

COMPUTE THIS TO NORMALIZE A VECTOR: 

TF-IDF normalization:

                 weight of document i 

––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– square root (summation of weights in each document)

next oldest: COSINE DISTANCE REVIEW FORMULA

  • you have to find the difference in angle between two vectors in space
  • if two things are perfectly aligned, give score of 1
  • farthest you can go while being in the same quadrant is 90 degrees, and cosine is 0

VECTOR SPACE SIMILARITY : determine angle between 2 vectors i and j

# Sim(Di,Dj) = Sum(weight(ik) * weight(jk))

if you're dealing with unit vectors, their dot (inner) product is their cosine.

EX: query vector Q = (0.4, 0.8) and document D2 (0.2, 0.7). what is the distance (aka similarity score)?

sim(Q,D2) = (0.4 0.2) + (0.8 0.7)

        ([(0.4)^2 + (0.8)^2]*[(0.2)^2 + (0.7)^2])

*trick question because these vectors are un-normalized.

  • normalization = length of first vector * length of second vector
  • ASK ASK ASK

newest: ASSESSING RANK QUALITY

- how do you know what you've built is good? hard, because it's not logical, it's just personal "if a user says it's right, it's right"
- *relevance* is key

Precision & Recall: REVIEW FORMULA

  • precision: fraction of retrieved docs that are relevant = relevant / retrieved
  • recall : fraction of relevant docs that are retrieved = retrieved / relevant | | Relevant | Not Relevant | | Retrieved | Tp | Fp | | Not Retrieved | Fn | Tn |

Precision P = tp/(tp+fp)

Recall R = tp/(tp+fn)

you can improve precision at the cost of recall better to have higher recall

KENDALL'S TAU: bubble sort distance

take all pairs, and see if they're in the same order

t = nc (pairs that agree) - nd (pairs that disagree)

            1/2 * (n)*(n-1)

best: 1, worst: -1, uncorrelated: 0