PageRank is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it’s a recursive definition, but if you use the right formula, then it’ll converge to something meaningful. The idea is that you can tell how important a webpage is by our important the pages are that link to it.
Here, d is a dampening factor, N is the number of pages and is the number of links out of page q.
The algorithm is pretty straight forward. Assign each page the page rank 1/n and then update each page’s score using the formula above until it converges. I thought it might take a long time to converge, or require special parallel processing, but even with 7 million wikipedia pages, the adjacency list fits in memory. Whenever I update the score for a page and the score changes by more than , I throw its outgoing neighbors into a queue for processing (if they aren’t already in the queue). The scores converge on my M2 MacBook within a few minutes.
I had held off on implementing this ranking. I figured that most wikipedia pages would be high quality and PageRank might not be able to distinguish them well. Also, there’s an old wikipedia game where if you take a random page and click on the first link, and then click on the first link of that page and so on, you’ll always get back to philosophy. So, I wasn’t hopeful about how much information I’d get from PageRank. I’m implementing it because I need some kind of ordering on documents so I can terminate retrieval early after I find enough pages that match a query. And, if you’re going to have early termination, you want the best pages to come first.
Well, as it turns out, PageRank works great on wikipedia giving some fairly obviously important pages. Here’s the top 20 wikipedia pages by PageRank:
- Association_football
- World_War_II
- United_States
- France
- Race_and_ethnicity_in_the_United_States_census
- Iran
- World_War_I
- United_Kingdom
- India
- Germany
- China
- Catholic_Church
- Village
- New_York_City
- Australia
- The_New_York_Times
- Moth
- London
- Latin
- National_Register_of_Historic_Places
Right at the top of the list: football. I can imagine that there’s a ton of pages for teams, players and events which all link to football. If those pages themselves are linked to, then football will get a really high score. The same logic works for the others — events, places, people, businesses will all link to the countries or cities where they happen. I find “Village” and “Moth” particularly interesting. Surely there’s lots of pages for individual villages and moths that link to those.
Now that I’m ordering the documents, that has ripples on how much of the pipeline infrastructure is setup. So, I’m busy rebuilding all the datasets and carrying this order along.

Leave a Reply