Benjamin Geer

Calculating centrality indices with the Paris public transport network

Benjamin Geer
Table of Contents

I thought it would be interesting to try some basic network analysis of the Paris métro (rapid transit) and RER (commuter rail) systems, starting with some centrality indices. These are metrics that rank the nodes in a graph according to their importance, for some definition of importance. For example, in a transport network, the centrality of a station could give an indication of the likelihood that passengers will pass through that station as part of their itinerary. A public transport authority could use this information to gauge how inconvenient it would be if particular stations were closed, e.g. because of flooding or maintenance, and to help it identify the most effective ways of improving the resilience of the network. Here I’ll use Python, networkx, and SQLite.

Constructing the graph #

Here I’ve chosen to combine the métro and RER into a single graph. The first challenge is to construct this graph as an edge list and a node list. The agencies that manage the Paris public transport networks don’t provide such a dataset, but they do provide a GTFS dataset representing the train schedules for the next 30 days. (Thanks to Oliver Blanthorn for pointing this out.) The GTFS data represents the trips (sequences of stops) that trains are scheduled to take along their routes (métro or RER lines). To construct the edge list for the network, we can traverse each trip, storing an edge between each stop and the next one. Some trips go in one direction, others go in the opposite direction, and many don’t cover the whole line, but by traversing them all, we should build up a complete directed graph of the network. We’ll have to traverse about 64,000 trips, containing a total of nearly 1.5 million stops, and a lot of the data will be redundant for our purposes, but this doesn’t matter as long as we can process it in a reasonable amount of time.

We therefore have to implement an ETL (Extract, transform, load) process. I wrote a shell script that downloads the GTFS data, as well as several files from a dataset called Référentiel des lignes de transport en commun d’île-de-France, which we will also need. Since all these files are in CSV format, and use unique identifiers to link together the data in different files, it will be easier to work with this data if we use a relational database. In this shell script, I import the necessary files into an intermediate SQLite database, and create some database indexes to speed up the queries in the transformation step. This takes about 21 seconds on my laptop, and the size of the resulting database is 2.1G.

Finally, I’ve written a Python script for the transform and load phases. It reads the intermediate database and builds the output database, which just contains a table of edges and a table of nodes. Since the output database is small (132K), the script creates it in memory for better performance, and writes it to disk when it’s complete. This script takes about 8 seconds to run.

In the Référentiel, links between the métro and and RER networks are represented by zones de correspondance and pôles d’échange, which can include multiple stations connected by corridors. To include these relations in the graph, I’ve chosen to add edges between all the stations in the same group.

In the output database, the graph is a directed multigraph, because two stations can be connected by more than one line. In such cases, we can see those two stations as having a stronger connection than two stations connected by only one line. Other things being equal, more lines mean more passengers travelling between those two stations. We’ll keep this information about mutiple edges when we do our calculations.

Drawing parts of the graph #

First let’s draw some of the data to see if it looks correct. We can write a function that draws a transport network graph. Here we don’t care about the geographical positions of the stations, we just want to see that we have the right stations with the right links between them. To help us see which line is which, we can use the colours defined for the transport network map, which are given in the GTFS dataset, plus black for the edges I added for the zones de correspondance and the pôles d’échange.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
%config InlineBackend.figure_formats = ['svg']
import sqlite3
import pandas as pd
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
import itertools as it


def draw_graph(G, title, height, label_x_offset):
    plt.figure(figsize=(12, height))
    plt.margins(x=0.18, y=0)
    plt.title(title)

    pos = nx.nx_agraph.graphviz_layout(G, prog="dot")

    # See https://networkx.org/documentation/stable/auto_examples/drawing/plot_multigraphs.html
    connectionstyle = [f"arc3,rad={r}" for r in it.accumulate([0.15] * 4)]

    nx.draw_networkx_nodes(G, pos, node_size=30, node_color="#000000")

    edge_colors = [
        f"#{G[source_node_id][target_node_id][edge_key]["route_color"]}"
        for source_node_id, target_node_id, edge_key in G.edges
    ]

    nx.draw_networkx_edges(
        G, pos, edge_color=edge_colors, connectionstyle=connectionstyle
    )

    label_pos = {
        node_id: (node_pos[0] + label_x_offset, node_pos[1])
        for node_id, node_pos in pos.items()
    }

    text = nx.draw_networkx_labels(
        G,
        label_pos,
        labels=nx.get_node_attributes(G, "node_name"),
        horizontalalignment="left",
        verticalalignment="bottom",
        font_size=6,
    )

    for _, t in text.items():
        t.set_rotation(15)

Let’s try it with the métro lines 1 and 4, and the RER A:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
DATABASE = "../../../notebook-data/paris-rail-data/paris-graph.db"

conn = sqlite3.connect(f"file:{DATABASE}?mode=ro")

nodes = pd.read_sql_query(
    """select distinct
        nodes.node_id, nodes.node_name
    from nodes
    inner join edges on
        nodes.node_id = edges.source_node_id or
        nodes.node_id = edges.target_node_id
    where
        (edges.route_type = 1 and edges.route_name = '1') or
        (edges.route_type = 1 and edges.route_name = '4') or
        (edges.route_type = 2 and edges.route_name = 'A') or
        (edges.route_type is null and
            (edges.route_name = 'PdE Châtelet - Les Halles' or
            edges.route_name = 'ZdC Gare de Lyon'))""",
    conn,
    index_col="node_id",
)

edges = pd.read_sql_query(
    """select * from edges where
        (edges.route_type = 1 and edges.route_name = '1') or
        (edges.route_type = 1 and edges.route_name = '4') or
        (edges.route_type = 2 and edges.route_name = 'A') or
        (edges.route_type is null and
            (edges.route_name = 'PdE Châtelet - Les Halles' or
            edges.route_name = 'ZdC Gare de Lyon'))""",
    conn,
).replace({np.nan: None})

G = nx.from_pandas_edgelist(
    edges,
    source="source_node_id",
    target="target_node_id",
    edge_attr=True,
    create_using=nx.MultiDiGraph(),
)

node_attributes = nodes.to_dict(orient="index")
nx.set_node_attributes(G, node_attributes)

draw_graph(
    G,
    title="Métro lines 1 and 4 and the RER A",
    height=20,
    label_x_offset=6,
)

svg

That looks correct.

Plotting results #

The centrality functions in networkx return a dictionary whose keys are node IDs and whose values are centrality scores. Let’s define a function that takes such a dictionary and plots a bar chart of the top 25 stations in descending order of centrality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def centrality_chart(centrality_dict, title):
    names_and_scores = [
        (node_attributes[item[0]]["node_name"], item[1])
        for item in centrality_dict.items()
    ]

    top_25 = sorted(names_and_scores, key=lambda t: t[1], reverse=True)[:25]

    top_25_df = pd.DataFrame(
        top_25,
        columns=["Station", "Centrality score"],
    ).set_index("Station")

    ax = top_25_df.plot.barh(title=title)
    ax.invert_yaxis()

Degree centrality #

First let’s read the whole graph:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
nodes = pd.read_sql_query(
    "select * from nodes",
    conn,
    index_col="node_id",
)

edges = pd.read_sql_query(
    "select * from edges",
    conn,
).replace({np.nan: None})

conn.close()

G = nx.from_pandas_edgelist(
    edges,
    source="source_node_id",
    target="target_node_id",
    edge_attr=True,
    create_using=nx.MultiDiGraph(),
)

node_attributes = nodes.to_dict(orient="index")
nx.set_node_attributes(G, node_attributes)
print(f"Loaded {len(G.nodes)} nodes and {len(G.edges)} edges.")
Loaded 563 nodes and 1363 edges.

The simplest centrality index is degree centrality: for each node, we can count its inbound links (for in-degree centrality) or its outbound links (for out-degree centrality), and normalise the result by dividing by $n - 1$, where $n$ is the number of nodes. In our case, the results are the same for in-degree and out-degree centrality.

1
2
in_degree_centrality_dict = nx.in_degree_centrality(G)
centrality_chart(in_degree_centrality_dict, "In-degree centrality")

svg

We can see immediately that the stations with the highest scores are mostly métro stations rather than RER stations. One possible explanation could be the apparently more radial structure of the RER network compared to the métro. Besides being located on more than one line, several of these stations (such as Havre-Caumartin) belong to a pôle d’échange.

Eigenvector centrality #

With eigenvector centrality, each node gets a score that is proportional to the sum of the scores of its neighbours. This way, a node can get a high score by having many neighbours and/or by having high-scoring neighbours. Given an undirected graph of $n$ nodes, for a node $i$ whose set of neighbours is $M(i)$, the eigenvector centrality $x_i$ is defined as

$$ x_i = \frac{1}{\lambda} \sum_{j \in M(i)} x_j $$

where $\lambda$ is a constant. We can rewrite this equation using the network’s adjacency matrix $A$, in which $a_{ij}$ is 1 if there is an edge that links nodes $i$ and $j$, and 0 otherwise:

$$ x_i = \frac{1}{\lambda} \sum_{j = 1}^n a_{ij} x_j $$

This is equivalent to

$$ \mathbf{x} = \frac{1}{\lambda} A\mathbf{x} $$

where $\mathbf{x}$ is a vector whose elements are the centrality scores. We can write this as

$$ A\mathbf{x} = \lambda\mathbf{x} $$

which means that $\lambda$ is an eigenvalue of $A$ and $\mathbf{x}$ is the corresponding eigenvector. Given the eigenvalues and eigenvectors of the matrix, choosing the appropriate eigenvector is simple, because (according to the Perron–Frobenius theorem), for a matrix (like the adjacency matrix) that has no negative elements, only the eigenvector corresponding to the largest eigenvalue can be chosen to have no negative elements. With a directed graph like our transport network, we can choose a left eigenvector or a right eigenvector. Usually the left eigenvector is chosen (and this is what networkx does), so that a node’s score is based on its inbound edges rather than its outbound edges. This doesn’t make much difference in our case, though, because nearly every inbound edge in our network has a matching outbound edge.

When eigenvector centrality is used with a directed graph, the graph must be strongly connected. This is true of our transport network, because every station is reachable from every other station:

1
nx.is_strongly_connected(G)
True

Here we use the networkx.eigenvector_centrality_numpy function, which first constructs the adjacency matrix. In a directed multigraph such as this one, if there are $n$ edges in the same direction between two nodes, networkx represents them as the value $n$ in the matrix, which is equivalent to a single edge with weight $n$. It then gets the relevant eigenvector, and returns a dictionary containing each node ID and the corresponding element of the eigenvector.

1
2
eigenvector_centrality_dict = nx.eigenvector_centrality_numpy(G)
centrality_chart(eigenvector_centrality_dict, "Eigenvector centrality")

svg

If we compare these results to the degree centrality results, we can see that Havre-Caumartin, which is in the middle of the six-station Paris Saint-Lazare - Opéra pôle d’échange, gets a higher score because it’s adjacent to the two highly-ranked stations Saint-Lazare and Opéra. Several other stations, such as Saint-Augustin, Auber, and Haussmann Saint-Lazare, also get higher scores for the same reason.

Betweenness centrality #

Betweenness centrality counts the number of times a node is on one of the shortest paths (i.e., those that pass through as few nodes as possible) between each pair of other nodes. For a node $v$, it is defined as

$$ c_B(v) =\sum_{s \neq v \neq t} \frac{\sigma(s, t|v)}{\sigma(s, t)} $$

where $s$ and $t$ are any other nodes in the network, $\sigma(s, t)$ is the number of shortest paths from $s$ to $t$, and $\sigma(s, t|v)$ is the number of those paths that pass through $v$. In a transport network, a station with a high value for $c_B(v)$ is one that many passengers are likely to pass through if they choose one of the shortest paths to their destination.

1
2
betweenness_centrality_dict = nx.betweenness_centrality(G)
centrality_chart(betweenness_centrality_dict, "Betweenness centrality")

svg

Most of the stations on this list are RER stations rather than métro stations, which makes sense, because the RER covers more distance between stops, so if the same journey can be done via métro or via RER, the shortest path will usually be via RER. It’s not surprising to see Châtelet les Halles at the top of the list, given its geographically central location and the large number of lines that pass through its pôle d’échange. It also looks like there are other configurations that can give a station a relatively high rank. For example, Juvisy is on a long section of the RER C where few stations have connections to other lines, and it connects to a similar section of the RER D, so it seems likely that many shortest paths have to pass through it.

Conclusion #

This has just been a brief exploration of a few centrality indices for this network, with some examples of how different definitions of a node’s importance result in different rankings. All the scripts used here can be found in this blog’s Git repository, along with the Jupyter notebook from which I generated this page. Please contact me if you have any questions or suggestions.

Categories:
Topics: