The wikispeedia graph
When we think of wikipedia as a graph, we imagine nodes (articles) linked to others by directed edges (hyperlinks). We call a link Source a wikipedia page and a link Target a link in a source. Let’s look at the distribution of all the wikispeedia nodes:
Beyond counting links, we can use graph centrality measures to identify influential articles. Here we compute HITS scores (hubs and authorities) and PageRank values for all pages in the Wikipedia graph.
The HITS algorithm identifies two types of central nodes: hubs (pages pointing to many authoritative pages) and authorities (pages pointed to by many hubs). PageRank evaluates the overall importance of a page in the graph, taking into account both the quantity and quality of incoming links. Comparing these measures with the simple link counts shows that some pages are structurally important even if they do not have the highest number of direct links.
HITS confirms that the greatest authorities are geographical but the biggest hubs are lists. They don’t contain a lot of information but acts like directories. “Driving_on_the_left_or_right” is certainly a list of countries is thus very powerful in wikispeedia.
Link position
However, when playing wikispeedia, you are not just given a list of edges. You have a lot of text structured in sections, with some hyperlinks in it. You will always see the links appearing in the lead section or the infobox. But how many times will you scroll down the whole english history page to find the link you need?
It is thus important to look at the links placement in the articles:
Instead of restricting ourselves to the the sections we can look at the position in words and see how links are distributed in the pages in numbers of words
We could scale the y-axis to better understand the distribution. However, the problem might simply be that articles have variables length. So links far away will have a low count as they both need to be in a long article and need to be at the end of that article.
The previous plot, however, gives us a good idea of how far you need to scroll to get certain links. A page is rougly 500 words (the first bin) so 50% of the links are accessible in rougly 2 pages.
To solve the problem about the distribution we need to look at relative distances:
While the high count at the beginning of the articles matches our analysis and hypothesis up to now, we would not expect having more links towards the end.
Actually, at the end of some pages are listed a lot of related articles. Which can be very useful for navigation. For example:
![]() |
|---|
| Example of the bottom page of the United States article |
In the HITS algorithm page, the related PageRank is linked as another example.
But how useful are these links for wikispeedia? The first links could only link to small hubs or be repeatitive.
Hint
- We now want to focus on the top links.
