By the end of this lesson, you will know how to use basic data structures in order to represent graphs in computer memory.
Hi everyone. In this lesson we will talk about how to choose data structures for representing graphs.
Representing data in computer memory
As I’m sure you’ll recall, computer memory is organized into locations in which data are stored. Accessing data therefore requires us to know at which location, or, address, they’ve been stored.
This organization is convenient when you know where to look, but it can be tricky if you don’t know the exact address of what you’re looking for.
In the same way that it’s easier to find a house by its address rather than by its description, or a library book by its call number, it’s easier to find data when you know the specific address.
Basic data structures to represent graphs
Let’s imagine that we want to store the adjacency matrix of a graph.
One way to store data in memory is to use an array. Arrays are adjacent data structures that are used to represent sequences of data, where each piece of data uses the same size in memory.
So if you know the address of the first piece of data, the address of any other piece of data can easily be computed. For example, if the first element address is 1337, and if each piece of data is stored using one address, then the fourth element address is 1340.
So, this means that you can rapidly access the -th piece of data of an array. On the other hand, accessing the -th non-zero element can be time-consuming, because all cells need to be checked one by one to see whether or not they contain a zero.
In terms of graph adjacency matrices, arrays are efficient to check if two vertices and are connected by an edge. On the other hand they are not that efficient if you want to retrieve all the neighbors of , because you need to test each vertex one by one.
Another common data structure is a list. Lists are not adjacent in memory, and to find a piece of data, you need to find all the previous ones first. The principle of lists is that each address not only contains data, but also the address of the next piece of data to look for. So if you want to access the third piece of data in a list, you first must access the first, look at the address of the second, then look at the second to find the address of the third.
Accessing the -th piece of data in a list can be time-consuming. However, it IS possible to overcome the time-consuming process of accessing the -th piece of data by only storing pieces of data of interest, which considerably reduces memory usage compared to an array. In terms of graphs, lists can be used to store only information about the neighbors of vertices for example, since non-neighbors are often irrelevant. If each vertex has only a few neighbors, representing only neighbors with lists can significantly save memory. On the other hand, checking if and are connected by an edge may require you to search the full list of ‘s neighbors.
A final example of data structures is dictionaries. Dictionaries are elaborate structures that aim to combine the advantages of both arrays and lists, in particular, fast access and efficient memory usage, respectively.
Dictionaries make use of hash functions, which are basically mechanisms to transform contents in addresses. This is the optimal choice for prototyping, because it allows programmers to benefit from both speed of access and low memory usage.
As a rule of thumb, dictionaries should always be used if you don’t know exactly what you’re doing.
More details on the list-based solution
We have seen before that an adjacency matrix is a convenient object for representing a graph in memory.
However, in most cases, graphs are sparse objects, i.e., the number of existing edges is low compared to the number of edges of a complete graph. A direct implication is that most of the entries of the adjacency matrix are 0s. Since the number of elements in an adjacency matrix is equal to the square of the graph order, this can quickly lead to a lot of memory space used.
A possible solution to circumvent this problem is to use a different data structure: a list of lists. Let us call such an object , with being lists. In this structure, () will represent the edges that can be accessed from vertex .
As an example, consider the following adjacency matrix:
Assuming vertices to be labelled from 1 to , this matrix is equivalent to the list .
We can quickly notice that the number of stored numbers has shrunk from to .
While this solution saves some memory space, it suffers from different limitations:
- Checking existence of an edge requires to go through all elements of the list to verify if is one of its elements. This can take some time if has a lot of neighbors. In comparison, making the same check with an adjacency matrix takes a single operation, as one just need to verify that .
- It is not as easy to extend to weighted graphs. In the case of adjacency matrices, entries represent the weight associated with the edge. Here, entries are indices of non-zero elements, which cannot be altered without creating/deleting edges. A possible solution is to replace the lists of indices with lists of couples , where is the weight of edge .
To go further
- Understanding the efficiency of GPU algorithms for matrix-matrix multiplication: A research paper illustrating one of the main reasons why matrices are frequently used.
- Graph Processing on FPGAs: Taxonomy, Survey, Challenges: A research paper illustrating the use of specific hardware (here, FPGA) for processing large graphs.