Wikipedia graph structure

The wikispeedia graph

When we think of wikipedia as a graph, we imagine nodes (articles) linked to others by directed edges (hyperlinks). We call a link Source a wikipedia page and a link Target a link in a source. Let’s look at the distribution of all the wikispeedia nodes:

In the joint distribution which shows a high density with some outliers. The Wikipedia_Text_of_the_GNU_Free_Documentation_License is a link appearing on all pages but having only one outgoing link, itself. People might click on it by curiosity but would get stuck and would have to go back. We can thus safely remove it from the paths and articles as it shifts uselessly the means and other statistics.

Beyond counting links, we can use graph centrality measures to identify influential articles. Here we compute HITS scores (hubs and authorities) and PageRank values for all pages in the Wikipedia graph.

The HITS algorithm identifies two types of central nodes: hubs (pages pointing to many authoritative pages) and authorities (pages pointed to by many hubs). PageRank evaluates the overall importance of a page in the graph, taking into account both the quantity and quality of incoming links. Comparing these measures with the simple link counts shows that some pages are structurally important even if they do not have the highest number of direct links.

HITS confirms that the greatest authorities are geographical but the biggest hubs are lists. They don’t contain a lot of information but acts like directories. “Driving_on_the_left_or_right” is certainly a list of countries is thus very powerful in wikispeedia.

However, when playing wikispeedia, you are not just given a list of edges. You have a lot of text structured in sections, with some hyperlinks in it. You will always see the links appearing in the lead section or the infobox. But how many times will you scroll down the whole english history page to find the link you need?

It is thus important to look at the links placement in the articles:

We see that that most links are placed at the beginning of the pages. The lead section contains 17.5% of the links while it is much smaller than the body.

Instead of restricting ourselves to the the sections we can look at the position in words and see how links are distributed in the pages in numbers of words

The links are more often placed at the beginning and the distribution is heavy tailed and seems to follow a power law.

We could scale the y-axis to better understand the distribution. However, the problem might simply be that articles have variables length. So links far away will have a low count as they both need to be in a long article and need to be at the end of that article.

The previous plot, however, gives us a good idea of how far you need to scroll to get certain links. A page is rougly 500 words (the first bin) so 50% of the links are accessible in rougly 2 pages.

To solve the problem about the distribution we need to look at relative distances:

Now we clearly see that most links are at the beginning of the articles. There are links from the lead even at the end. This means there is no body in some very short articles. Surprisingly, the article counts gets bigger at the end of the articles.

While the high count at the beginning of the articles matches our analysis and hypothesis up to now, we would not expect having more links towards the end.

Actually, at the end of some pages are listed a lot of related articles. Which can be very useful for navigation. For example:

US_bottom
Example of the bottom page of the United States article

In the HITS algorithm page, the related PageRank is linked as another example.

But how useful are these links for wikispeedia? The first links could only link to small hubs or be repeatitive.

Hint
  • We now want to focus on the top links.