We’re inundated with data. Ever-expanding databases and spreadsheets are rife with hidden business insights. How can we analyze data and extract conclusions when there’s so much of it? Graphs (networks, not bar graphs) provide an elegant approach.
We often use tables to represent information generically. But graphs use a specialized data structure: Instead of a table row, a node represents an element. An edge connects two nodes to indicate their relationship.
This graph data structure enables us to observe data from unique angles, which is why graph data science is used in every field from molecular biology to the social sciences:
Right image credit: ALBANESE, Federico, et al. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI]
So how can developers leverage graph data science? Let’s turn to the most-used data science programming language: Python.
Getting Started With “Graph Theory” Graphs in Python
Python developers have several graph data libraries available to them, such as NetworkX, igraph, SNAP, and graph-tool. Pros and cons aside, they have very similar interfaces for handling and processing Python graph data structures.
We’ll use the popular NetworkX library. It’s simple to install and use, and supports the community detection algorithm we’ll be using.
Creating a new graph with NetworkX is straightforward:
But G
isn’t much of a graph yet, being devoid of nodes and edges.
How to Add Nodes to a Graph
We can add a node to the network by chaining on the return value of Graph()
with .add_node()
(or .add_nodes_from()
for multiple nodes in a list). We can also add arbitrary characteristics or attributes to the nodes by passing a dictionary as a parameter, as we show with node 4
and node 5
:
This will output:
But without edges between nodes, they’re isolated, and the dataset is no better than a simple table.
How to Add Edges to a Graph
Similar to the technique for nodes, we can use .add_edge()
with the names of two nodes as parameters (or .add_edges_from()
for multiple edges in a list), and optionally include a dictionary of attributes:
The NetworkX library supports graphs like these, where each edge can have a weight. For example, in a social network graph where nodes are users and edges are interactions, weight could signify how many interactions happen between a given pair of users—a highly relevant metric.
NetworkX lists all edges when using G.edges
, but it does not include their attributes. If we want edge attributes, we can use G[node_name]
to get everything that’s connected to a node or G[node_name][connected_node_name]
to get the attributes of a particular edge:
This will output:
But reading our first graph this way is impractical. Thankfully, there’s a much better representation.
How to Generate Images From Graphs (and Weighted Graphs)
Visualizing a graph is essential: It lets us see the relationships between the nodes and the structure of the network quickly and clearly.
A quick call to nx.draw(G)
is all it takes:
Let’s make weightier edges correspondingly thicker via our nx.draw()
call:
We provided a default thickness for weightless edges, as seen in the result:
Our methods and graph algorithms are about to get more complex, so the next step is to use a better-known dataset.
Graph Data Science Using Data From the Movie Star Wars: Episode IV
To make it easier to interpret and understand our results, we’ll use this dataset. Nodes represent important characters, and edges (which aren’t weighted here) signify co-appearance in a scene.
Note: The dataset is from Gabasova, E. (2016). Star Wars social network. DOI: https://doi.org/10.5281/zenodo.1411479.
First, we’ll visualize the data with nx.draw(G_starWars, with_labels = True)
:
Characters that usually appear together, like R2-D2 and C-3PO, appear closely connected. In contrast, we can see that Darth Vader does not share scenes with Owen.
Python NetworkX Visualization Layouts
Why is each node located where it is in the previous graph?
It’s the result of the default spring_layout
algorithm. It simulates the force of a spring, attracting connected nodes and repelling disconnected ones. This helps highlight well-connected nodes, which end up in the center.
NetworkX has other layouts that use different criteria to position nodes, like circular_layout
:
The result:
This layout is neutral in the sense that the location of a node does not depend on its importance—all nodes are represented equally. (The circular layout could also help visualize separate connected components—subgraphs having a path between any two nodes—but here, the whole graph is one big connected component.)
Both of the layouts we’ve seen have a degree of visual clutter because edges are free to cross other edges. But Kamada-Kawai, another force-directed algorithm like spring_layout
, positions the nodes so as to minimize the energy of the system.
This reduces edge-crossing but at a price: It’s slower than other layouts and therefore not highly recommended for graphs with many nodes.
This one has a specialized draw function:
That yields this shape instead:
Without any special intervention, the algorithm put main characters (like Luke, Leia, and C-3PO) in the center, and less prominent ones (like Camie and General Dodonna) by the border.
Visualizing the graph with a specific layout can give us some interesting qualitative results. Still, quantitative results are a vital part of any data science analysis, so we’ll need to define some metrics.
Node Analysis: Degree and PageRank
Now that we can visualize our network clearly, it may be of interest to us to characterize the nodes. There are multiple metrics that describe the characteristics of the nodes and, in our example, of the characters.
One basic metric for a node is its degree: how many edges it has. The degree of a Star Wars character’s node measures how many other characters they shared a scene with.
The degree()
function can calculate the degree of a character or of the entire network:
The output of both commands:
Sorting nodes from highest to lowest according to degree can be done with a single line of code:
The output:
Being just a total, the degree doesn’t take into account details of individual edges. Does a given edge connect to an otherwise isolated node or to a node that is connected with the entire network? Google’s PageRank algorithm aggregates this information to gauge how “important” a node is in a network.
The PageRank metric can be interpreted as an agent moving randomly from one node to another. Better-connected nodes have more paths leading through them, so the agent will tend to visit them more often.
Such nodes will have a higher PageRank, which we can calculate with the NetworkX library:
This prints Luke’s rank and our characters sorted by rank:
Owen is the character with the highest PageRank, surpassing Luke, who had the highest degree. The analysis: Although Owen is not the character who shares the most scenes with other characters, he is a character who shares scenes with many important characters such as Luke himself, R2-D2, and C-3PO.
In greater contrast, C-3PO, the character with the third-highest degree, is the one with the lowest PageRank. Despite C-3PO having many connections, a lot of them are with unimportant characters.
The takeaway: Using multiple metrics can give deeper insight into different characteristics of a graph’s nodes.
Community Detection Algorithms
When analyzing a network it may be important to separate communities: groups of nodes that are highly connected to each other but minimally connected with nodes outside their community.
There are multiple algorithms for this. Most of them are found within unsupervised machine learning algorithms because they assign a label to the nodes without the need for them to have been labeled previously.
One of the most famous is label propagation. In it, each node starts with a unique label, in a community of one. The labels of the nodes are iteratively updated according to the majority of the labels of the neighboring nodes.
The labels diffuse through the network until all nodes share a label with most of their neighbors. Groups of nodes closely connected to each other end up having the same label.
With the NetworkX library, running this algorithm takes a mere three lines of Python:
The output:
In this list of sets, each set represents a community. Readers familiar with the movie will notice the algorithm managed to perfectly separate the “good guys” from the “bad guys,” differentiating the characters meaningfully without using any true (community) label or metadata.
Intelligent Insights Using Graph Data Science in Python
We’ve seen that getting started with graph data science tools is more straightforward than it might sound. Once we represent data as a graph using the NetworkX library in Python, a few short lines of code can be illuminating. We can visualize our dataset, measure and compare node characteristics, and cluster nodes sensibly via community detection algorithms.
Having the skill to extract conclusions and insights from a network using Python enables developers to integrate with tools and methodology commonly found in data science services pipelines. From search engines to flight scheduling to electrical engineering, these methods apply easily to a wide range of contexts.
Recommended Reading on Graph Data Science
Community Detection Algorithms
Zhao Yang, René Algesheimer, and Claudio Tessone. “A Comparative Analysis of Community Detection Algorithms on Artificial Networks.” Scientific Reports, 6, no. 30750 (2016).
Graph Deep Learning
Thomas Kipf. “Graph Convolutional Networks.” September 30, 2016.
Applications of Graph Data Science
Albanese, Federico, Leandro Lombardi, Esteban Feuerstein, and Pablo Balenzuela. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI].
Cohen, Elior. “PyData Tel Aviv Meetup: Node2vec.” YouTube. November 22, 2018. Video, 21:09. https://www.youtube.com/watch?v=828rZgV9t1g.
This content was originally published here.