Power of the first links

Before looking at the players’ behaviour,

Let’s see if the first links are highly connected by looking at how the path possibilities shrink when restricting the page to the top links.

By only looking at the lead section we are still able to complete more than half of the paths while the lead contains less than a fifth of the links (cf previous section).

The lead section contains a lot of useful links. By looking solely at the lead section we have great chances to still be able to go to the goal article. This is beneficial as the lead is smaller and contains less links, we can make choices faster and not bother scrolling through the whole page.

We can go further and look at how important the first \(n\)-links are:

The number of possibles paths grows quickly up to the 10 first links, then it converges to the maximum number of paths.

We also see that the shortest paths lengths 1-3 grow linearly, 4 increases superlinearly and >4 increases up to some point and then decreases.

This could happen because at the beginning, when we add the \(n^{\text{th}}\) hyperlink, a lot of unseen links are added to the subgraph. But as n increases, the number of new unseen links will decreases and thus the number of new paths.

This would explain the shortest path distribution, adding new unseen links increases the paths but their shortest length is longer as there are only a few edges per node.

When we increase \(n\), we are not adding new paths but connectivity! It is like keeping the same subgraphs but adding edges to have “shortcuts”

We can confirm our hypothesis by plotting the number of nodes in the subgraph made from the \(n\)-first links:

It confirms our hypothesis. For greater n, the number of added nodes (unseen links) becomes very small.

We can here take a break in our data story and draw a first conclusion. We expect the 2025 Wikipedia graph to have roughly the same properties as the graph from 2007 that we study. The top articles won’t change much, the distribution of section categories or link positions is likely the same, as the article format is still the same today, and people would tend to repeat the same link placement. Moreover, we still observe today the structure of the lead section and infobox that put as first links the most important ones. We thus assume that the effect of these properties that we highlight in the next parts are still effective today.

Now that we have looked into the graph structure and the role of the links positions, we can look at the navigation patterns.

Hint
  • Enough of the structure
  • How do people navigate?
  • Where are the interesting links? You might find this one there