Calculating centrality indices with the Paris public transport network
Table of Contents
I thought it would be interesting to try some basic network analysis of the Paris métro (rapid transit) and RER (commuter rail) systems, starting with some centrality indices. These are metrics that rank the nodes in a graph according to their importance, for some definition of importance. For example, in a transport network, the centrality of a station could give an indication of the likelihood that passengers will pass through that station as part of their itinerary. A public transport authority could use this information to gauge how inconvenient it would be if particular stations were closed, e.g. because of flooding or maintenance, and to help it identify the most effective ways of improving the resilience of the network. Here I’ll use Python, networkx, and SQLite.
Constructing the graph #
Here I’ve chosen to combine the métro and RER into a single graph. The first challenge is to construct this graph as an edge list and a node list. The agencies that manage the Paris public transport networks don’t provide such a dataset, but they do provide a GTFS dataset representing the train schedules for the next 30 days. (Thanks to Oliver Blanthorn for pointing this out.) The GTFS data represents the trips (sequences of stops) that trains are scheduled to take along their routes (métro or RER lines). To construct the edge list for the network, we can traverse each trip, storing an edge between each stop and the next one. Some trips go in one direction, others go in the opposite direction, and many don’t cover the whole line, but by traversing them all, we should build up a complete directed graph of the network. We’ll have to traverse about 64,000 trips, containing a total of nearly 1.5 million stops, and a lot of the data will be redundant for our purposes, but this doesn’t matter as long as we can process it in a reasonable amount of time.
We therefore have to implement an ETL (Extract, transform, load) process. I wrote a shell script that downloads the GTFS data, as well as several files from a dataset called Référentiel des lignes de transport en commun d’île-de-France, which we will also need. Since all these files are in CSV format, and use unique identifiers to link together the data in different files, it will be easier to work with this data if we use a relational database. In this shell script, I import the necessary files into an intermediate SQLite database, and create some database indexes to speed up the queries in the transformation step. This takes about 21 seconds on my laptop, and the size of the resulting database is 2.1G.
Finally, I’ve written a Python script for the transform and load phases. It reads the intermediate database and builds the output database, which just contains a table of edges and a table of nodes. Since the output database is small (132K), the script creates it in memory for better performance, and writes it to disk when it’s complete. This script takes about 8 seconds to run.
In the Référentiel, links between the métro and and RER networks are represented by zones de correspondance and pôles d’échange, which can include multiple stations connected by corridors. To include these relations in the graph, I’ve chosen to add edges between all the stations in the same group.
In the output database, the graph is a directed multigraph, because two stations can be connected by more than one line. In such cases, we can see those two stations as having a stronger connection than two stations connected by only one line. Other things being equal, more lines mean more passengers travelling between those two stations. We’ll keep this information about mutiple edges when we do our calculations.
Drawing parts of the graph #
First let’s draw some of the data to see if it looks correct. We can write a function that draws a transport network graph. Here we don’t care about the geographical positions of the stations, we just want to see that we have the right stations with the right links between them. To help us see which line is which, we can use the colours defined for the transport network map, which are given in the GTFS dataset, plus black for the edges I added for the zones de correspondance and the pôles d’échange.
|
|
Let’s try it with the métro lines 1 and 4, and the RER A:
|
|
That looks correct.
Plotting results #
The centrality functions in networkx
return a dictionary whose keys are node IDs and whose values are centrality scores. Let’s define a function that takes such a dictionary and plots a bar chart of the top 25 stations in descending order of centrality.
|
|
Degree centrality #
First let’s read the whole graph:
|
|
Loaded 563 nodes and 1363 edges.
The simplest centrality index is degree centrality: for each node, we can count its inbound links (for in-degree centrality) or its outbound links (for out-degree centrality), and normalise the result by dividing by $n - 1$, where $n$ is the number of nodes. In our case, the results are the same for in-degree and out-degree centrality.
|
|
We can see immediately that the stations with the highest scores are mostly métro stations rather than RER stations. One possible explanation could be the apparently more radial structure of the RER network compared to the métro. Besides being located on more than one line, several of these stations (such as Havre-Caumartin) belong to a pôle d’échange.
Eigenvector centrality #
With eigenvector centrality, each node gets a score that is proportional to the sum of the scores of its neighbours. This way, a node can get a high score by having many neighbours and/or by having high-scoring neighbours. Given an undirected graph of $n$ nodes, for a node $i$ whose set of neighbours is $M(i)$, the eigenvector centrality $x_i$ is defined as
$$ x_i = \frac{1}{\lambda} \sum_{j \in M(i)} x_j $$where $\lambda$ is a constant. We can rewrite this equation using the network’s adjacency matrix $A$, in which $a_{ij}$ is 1 if there is an edge that links nodes $i$ and $j$, and 0 otherwise:
$$ x_i = \frac{1}{\lambda} \sum_{j = 1}^n a_{ij} x_j $$This is equivalent to
$$ \mathbf{x} = \frac{1}{\lambda} A\mathbf{x} $$where $\mathbf{x}$ is a vector whose elements are the centrality scores. We can write this as
$$ A\mathbf{x} = \lambda\mathbf{x} $$which means that $\lambda$ is an eigenvalue of $A$ and $\mathbf{x}$ is the corresponding eigenvector. Given the eigenvalues and eigenvectors of the matrix, choosing the appropriate eigenvector is simple, because (according to the Perron–Frobenius theorem), for a matrix (like the adjacency matrix) that has no negative elements, only the eigenvector corresponding to the largest eigenvalue can be chosen to have no negative elements. With a directed graph like our transport network, we can choose a left eigenvector or a right eigenvector. Usually the left eigenvector is chosen (and this is what networkx
does), so that a node’s score is based on its inbound edges rather than its outbound edges. This doesn’t make much difference in our case, though, because nearly every inbound edge in our network has a matching outbound edge.
When eigenvector centrality is used with a directed graph, the graph must be strongly connected. This is true of our transport network, because every station is reachable from every other station:
|
|
True
Here we use the networkx.eigenvector_centrality_numpy
function, which first constructs the adjacency matrix. In a directed multigraph such as this one, if there are $n$ edges in the same direction between two nodes, networkx
represents them as the value $n$ in the matrix, which is equivalent to a single edge with weight $n$. It then gets the relevant eigenvector, and returns a dictionary containing each node ID and the corresponding element of the eigenvector.
|
|
If we compare these results to the degree centrality results, we can see that Havre-Caumartin, which is in the middle of the six-station Paris Saint-Lazare - Opéra pôle d’échange, gets a higher score because it’s adjacent to the two highly-ranked stations Saint-Lazare and Opéra. Several other stations, such as Saint-Augustin, Auber, and Haussmann Saint-Lazare, also get higher scores for the same reason.
Betweenness centrality #
Betweenness centrality counts the number of times a node is on one of the shortest paths (i.e., those that pass through as few nodes as possible) between each pair of other nodes. For a node $v$, it is defined as
$$ c_B(v) =\sum_{s \neq v \neq t} \frac{\sigma(s, t|v)}{\sigma(s, t)} $$where $s$ and $t$ are any other nodes in the network, $\sigma(s, t)$ is the number of shortest paths from $s$ to $t$, and $\sigma(s, t|v)$ is the number of those paths that pass through $v$. In a transport network, a station with a high value for $c_B(v)$ is one that many passengers are likely to pass through if they choose one of the shortest paths to their destination.
|
|
Most of the stations on this list are RER stations rather than métro stations, which makes sense, because the RER covers more distance between stops, so if the same journey can be done via métro or via RER, the shortest path will usually be via RER. It’s not surprising to see Châtelet les Halles at the top of the list, given its geographically central location and the large number of lines that pass through its pôle d’échange. It also looks like there are other configurations that can give a station a relatively high rank. For example, Juvisy is on a long section of the RER C where few stations have connections to other lines, and it connects to a similar section of the RER D, so it seems likely that many shortest paths have to pass through it.
Conclusion #
This has just been a brief exploration of a few centrality indices for this network, with some examples of how different definitions of a node’s importance result in different rankings. All the scripts used here can be found in this blog’s Git repository, along with the Jupyter notebook from which I generated this page. Please contact me if you have any questions or suggestions.