{"id":136,"date":"2026-04-17T17:55:28","date_gmt":"2026-04-17T17:55:28","guid":{"rendered":"https:\/\/byte64.com\/?p=136"},"modified":"2026-04-17T17:59:06","modified_gmt":"2026-04-17T17:59:06","slug":"pagerank-on-wikipedia","status":"publish","type":"post","link":"https:\/\/byte64.com\/?p=136","title":{"rendered":"PageRank on Wikipedia"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/PageRank\">PageRank<\/a> is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it&#8217;s a recursive definition, but if you use the right formula, then it&#8217;ll converge to something meaningful. The idea is that you can tell how important a webpage is by our important the pages are that link to it. <\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>PR<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>p<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mrow><mn>1<\/mn><mo>\u2212<\/mo><mi>d<\/mi><\/mrow><mi>N<\/mi><\/mfrac><mo>+<\/mo><mi>d<\/mi><mrow><munder><mo movablelimits=\"false\">\u2211<\/mo><mrow><mi>q<\/mi><mo>\u2192<\/mo><mi>p<\/mi><\/mrow><\/munder><\/mrow><mfrac><mrow><mtext>PR<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>q<\/mi><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><mrow><mi>L<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>q<\/mi><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\text{PR}(p) = \\frac{1-d}{N} + d\\sum_{q \\to p} \\frac{\\text{PR}(q)}{L(q)}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Here, d is a dampening factor, N is the number of pages and <math data-latex=\"L(q)\"><semantics><mrow><mi>L<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>q<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">L(q)<\/annotation><\/semantics><\/math> is the number of links out of page q.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The algorithm is pretty straight forward. Assign each page the page rank 1\/n and then update each page&#8217;s score using the formula above until it converges. I thought it might take a long time to converge, or require special parallel processing, but even with 7 million wikipedia pages, the adjacency list fits in memory. Whenever I update the score for a page and the score changes by more than <math data-latex=\"\\epsilon = 10^{-10}\"><semantics><mrow><mi>\u03f5<\/mi><mo>=<\/mo><msup><mn>10<\/mn><mrow><mo lspace=\"0em\" rspace=\"0em\">\u2212<\/mo><mn>10<\/mn><\/mrow><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon = 10^{-10}<\/annotation><\/semantics><\/math>, I throw its outgoing neighbors into a queue for processing (if they aren&#8217;t already in the queue). The scores converge on my M2 MacBook within a few minutes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I had held off on implementing this ranking. I figured that most wikipedia pages would be high quality and PageRank might not be able to distinguish them well. Also, there&#8217;s an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia_philosophy_phenomenon\">old wikipedia game<\/a> where if you take a random page and click on the first link, and then click on the first link of that page and so on, you&#8217;ll always get back to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Philosophy\">philosophy<\/a>. So, I wasn&#8217;t hopeful about how much information I&#8217;d get from PageRank. I&#8217;m implementing it because I need some kind of ordering on documents so I can terminate retrieval early after I find enough pages that match a query. And, if you&#8217;re going to have early termination, you want the best pages to come first.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Well, as it turns out, PageRank works great on wikipedia giving some fairly obviously important pages. Here&#8217;s the top 20 wikipedia pages by PageRank:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Association_football<\/li>\n\n\n\n<li>World_War_II<\/li>\n\n\n\n<li>United_States<\/li>\n\n\n\n<li>France<\/li>\n\n\n\n<li>Race_and_ethnicity_in_the_United_States_census<\/li>\n\n\n\n<li>Iran<\/li>\n\n\n\n<li>World_War_I<\/li>\n\n\n\n<li>United_Kingdom<\/li>\n\n\n\n<li>India<\/li>\n\n\n\n<li>Germany<\/li>\n\n\n\n<li>China<\/li>\n\n\n\n<li>Catholic_Church<\/li>\n\n\n\n<li>Village<\/li>\n\n\n\n<li>New_York_City<\/li>\n\n\n\n<li>Australia<\/li>\n\n\n\n<li>The_New_York_Times<\/li>\n\n\n\n<li>Moth<\/li>\n\n\n\n<li>London<\/li>\n\n\n\n<li>Latin<\/li>\n\n\n\n<li>National_Register_of_Historic_Places<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Right at the top of the list: football. I can imagine that there&#8217;s a ton of pages for teams, players and events which all link to football. If those pages themselves are linked to, then football will get a really high score. The same logic works for the others &#8212; events, places, people, businesses will all link to the countries or cities where they happen. I find &#8220;Village&#8221; and &#8220;Moth&#8221; particularly interesting. Surely there&#8217;s lots of pages for individual villages and moths that link to those.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now that I&#8217;m ordering the documents, that has ripples on how much of the pipeline infrastructure is setup. So, I&#8217;m busy rebuilding all the datasets and carrying this order along.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>PageRank is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it&#8217;s a recursive definition, but if you use the right formula, then it&#8217;ll converge to something meaningful. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,16,18],"tags":[31,27,7],"class_list":["post-136","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-data-processing","category-search","tag-pagerank","tag-scoring","tag-wikipedia"],"_links":{"self":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/136","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=136"}],"version-history":[{"count":4,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/136\/revisions"}],"predecessor-version":[{"id":140,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/136\/revisions\/140"}],"wp:attachment":[{"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}