IR2
summary: vector spaces
mean reciprocal rank
you don't remember the title, so you google a few terms you know are in the doc. you just need the one best answer 1/# ranked, then take the mean over multiple searches
- there's only one article you care about, not the entire search's rank.
- so you figure out which ranking that article is, inverse it, and then find the mean after a few searches.
ranking assessment: precision vs recall (requires relevant vs not relevant judgments)
kendall's tau: measures correlation between 2 rankings MAKES SENSE!!
Graphs (or networks)
- prestige & importance : centrality
- which node is the most important (article)?
- 1) find each node's distance to each other node
- 2) find the closeness: add all distances, inverse
- 3) normalize: multiply closeness by # nodes - 1
- what if certain nodes are more important than others?
PageRank: REVIEW formula
- importance of page = probabiliy the random surfer is on your page
- suppose P has N forward links ; surfer clicks on link with probability 1/N
- my page rank = page ranks of all nodes that point to me, * 1/out degree
recursive, my page rank is dependent on other nodes' page ranks, and vice versa
1 problem with random surfer: if a webpage has no outgoing links (sink node). restart system at that point
- 1 - d = probability that random surfer will type in another URL instead of clicking a link from webpage
building a search engine? REVIEW FORMULA FOR SCORE
1) compute page rank 2) compute search score
HITS
Hubs & Authorities
- you're a good authority if a lot of good hubs point to you
- you're a good hub if you point to a lot of good authorities
Latent Semantic Indexing:
- documents are about topics, not terms. "this article is about x".
- Use Eigen-decomposition (SVD) of document term matrix (???)
Relevance Factors :
- which parts of the page words appear in
- how close together the words appear
- synonymous unspecified words
- latent semantic indexing
- guesses of user intent
SEARCH ENGINE OPTIMIZATION:
- link farms (bad - have a lot of websites that point to you)
- use text rather than images & flash for important content
- make your site work with JS, Java, & CSS DISABLED (that's what search engine sees)
- avoid links that look like form queries
- have pages that focus on particular topics
- have other relevant sites link to yours
Link Spam:
- when you comment your link in an important blog
- blog owners should use "rel = "nofollow"" to make sure search engines don't count that link