PageRank on Wikipedia

Apr 17, 2026

—

PageRank is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it’s a recursive definition, but if you use the right formula, then it’ll converge to something meaningful. The idea is that you can tell how important a webpage is by our important the pages are that link to it.

\text{PR}(p) = \frac{1-d}{N} + d\sum_{q \to p} \frac{\text{PR}(q)}{L(q)}

Here, d is a dampening factor, N is the number of pages and $L(q)$ is the number of links out of page q.

The algorithm is pretty straight forward. Assign each page the page rank 1/n and then update each page’s score using the formula above until it converges. I thought it might take a long time to converge, or require special parallel processing, but even with 7 million wikipedia pages, the adjacency list fits in memory. Whenever I update the score for a page and the score changes by more than $\epsilon = 10^{-10}$ , I throw its outgoing neighbors into a queue for processing (if they aren’t already in the queue). The scores converge on my M2 MacBook within a few minutes.

I had held off on implementing this ranking. I figured that most wikipedia pages would be high quality and PageRank might not be able to distinguish them well. Also, there’s an old wikipedia game where if you take a random page and click on the first link, and then click on the first link of that page and so on, you’ll always get back to philosophy. So, I wasn’t hopeful about how much information I’d get from PageRank. I’m implementing it because I need some kind of ordering on documents so I can terminate retrieval early after I find enough pages that match a query. And, if you’re going to have early termination, you want the best pages to come first.

Well, as it turns out, PageRank works great on wikipedia giving some fairly obviously important pages. Here’s the top 20 wikipedia pages by PageRank:

Association_football
World_War_II
United_States
France
Race_and_ethnicity_in_the_United_States_census
Iran
World_War_I
United_Kingdom
India
Germany
China
Catholic_Church
Village
New_York_City
Australia
The_New_York_Times
Moth
London
Latin
National_Register_of_Historic_Places

Right at the top of the list: football. I can imagine that there’s a ton of pages for teams, players and events which all link to football. If those pages themselves are linked to, then football will get a really high score. The same logic works for the others — events, places, people, businesses will all link to the countries or cities where they happen. I find “Village” and “Moth” particularly interesting. Surely there’s lots of pages for individual villages and moths that link to those.

Now that I’m ordering the documents, that has ripples on how much of the pipeline infrastructure is setup. So, I’m busy rebuilding all the datasets and carrying this order along.

pagerank scoring wikipedia

PageRank on Wikipedia

Comments

Leave a Reply Cancel reply