PageRank on Wikipedia

PageRank is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it’s a recursive definition, but if you use the right formula, then it’ll converge to something meaningful. The idea is that you can tell how important a webpage is by our important the pages are that link to it.

PR(p)=1dN+dqpPR(q)L(q)\text{PR}(p) = \frac{1-d}{N} + d\sum_{q \to p} \frac{\text{PR}(q)}{L(q)}

Here, d is a dampening factor, N is the number of pages and L(q)L(q) is the number of links out of page q.

The algorithm is pretty straight forward. Assign each page the page rank 1/n and then update each page’s score using the formula above until it converges. I thought it might take a long time to converge, or require special parallel processing, but even with 7 million wikipedia pages, the adjacency list fits in memory. Whenever I update the score for a page and the score changes by more than ϵ=1010\epsilon = 10^{-10}, I throw its outgoing neighbors into a queue for processing (if they aren’t already in the queue). The scores converge on my M2 MacBook within a few minutes.

I had held off on implementing this ranking. I figured that most wikipedia pages would be high quality and PageRank might not be able to distinguish them well. Also, there’s an old wikipedia game where if you take a random page and click on the first link, and then click on the first link of that page and so on, you’ll always get back to philosophy. So, I wasn’t hopeful about how much information I’d get from PageRank. I’m implementing it because I need some kind of ordering on documents so I can terminate retrieval early after I find enough pages that match a query. And, if you’re going to have early termination, you want the best pages to come first.

Well, as it turns out, PageRank works great on wikipedia giving some fairly obviously important pages. Here’s the top 20 wikipedia pages by PageRank:

  1. Association_football
  2. World_War_II
  3. United_States
  4. France
  5. Race_and_ethnicity_in_the_United_States_census
  6. Iran
  7. World_War_I
  8. United_Kingdom
  9. India
  10. Germany
  11. China
  12. Catholic_Church
  13. Village
  14. New_York_City
  15. Australia
  16. The_New_York_Times
  17. Moth
  18. London
  19. Latin
  20. National_Register_of_Historic_Places

Right at the top of the list: football. I can imagine that there’s a ton of pages for teams, players and events which all link to football. If those pages themselves are linked to, then football will get a really high score. The same logic works for the others — events, places, people, businesses will all link to the countries or cities where they happen. I find “Village” and “Moth” particularly interesting. Surely there’s lots of pages for individual villages and moths that link to those.

Now that I’m ordering the documents, that has ripples on how much of the pipeline infrastructure is setup. So, I’m busy rebuilding all the datasets and carrying this order along.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *