IR2

summary: vector spaces

mean reciprocal rank

you don't remember the title, so you google a few terms you know are in the doc. you just need the one best answer 1/# ranked, then take the mean over multiple searches

  • there's only one article you care about, not the entire search's rank.
  • so you figure out which ranking that article is, inverse it, and then find the mean after a few searches.

ranking assessment: precision vs recall (requires relevant vs not relevant judgments)

kendall's tau: measures correlation between 2 rankings MAKES SENSE!!

Graphs (or networks)

  • prestige & importance : centrality
  • which node is the most important (article)?
  • 1) find each node's distance to each other node
  • 2) find the closeness: add all distances, inverse
  • 3) normalize: multiply closeness by # nodes - 1
  • what if certain nodes are more important than others?

PageRank: REVIEW formula

  • importance of page = probabiliy the random surfer is on your page
  • suppose P has N forward links ; surfer clicks on link with probability 1/N
  • my page rank = page ranks of all nodes that point to me, * 1/out degree
  • recursive, my page rank is dependent on other nodes' page ranks, and vice versa

  • 1 problem with random surfer: if a webpage has no outgoing links (sink node). restart system at that point

  • 1 - d = probability that random surfer will type in another URL instead of clicking a link from webpage

building a search engine? REVIEW FORMULA FOR SCORE

1) compute page rank 2) compute search score

HITS

Hubs & Authorities

  • you're a good authority if a lot of good hubs point to you
  • you're a good hub if you point to a lot of good authorities

Latent Semantic Indexing:

  • documents are about topics, not terms. "this article is about x".
  • Use Eigen-decomposition (SVD) of document term matrix (???)

Relevance Factors :

  • which parts of the page words appear in
  • how close together the words appear
  • synonymous unspecified words
  • latent semantic indexing
  • guesses of user intent

SEARCH ENGINE OPTIMIZATION:

  • link farms (bad - have a lot of websites that point to you)
  • use text rather than images & flash for important content
  • make your site work with JS, Java, & CSS DISABLED (that's what search engine sees)
  • avoid links that look like form queries
  • have pages that focus on particular topics
  • have other relevant sites link to yours

Link Spam:

  • when you comment your link in an important blog
  • blog owners should use "rel = "nofollow"" to make sure search engines don't count that link