21/10/2014

Graph library design choices

I am currently writing the documentation for my library. The effort of structuring and explaining the underlying logic of it is definitely a great help in making apparent what the library lacks and what could be done to make it better.

Unfortunately, time is limited, and I will never have the time to make the perfect Java graph library. Here are nevertheless some tips and facts to keep in mind in order to do so.


Graphs are about:
- relation between objects

And possibly:
- information about objects
- information about relation between objects

yes, but:

What?

i.e. which information?

* about the relations:
- directed or undirected
- normal graph or multigraph
- normal graph or hypergraph
- both directed and directed edges?

* ad-hoc information:
- weighted? (e.g. for min paths)
- weighted with several weights? (e.g. for max flows)
- with spatial information?
- with any other information?

Now, where, when and how to store these information?

Where?

* within the objects: in an information field of the objects

n.name = "toto";
n.name;

+: do not use hashmap
-: increases the size of a node

* outside of the graph: in a map object-> information that stores information separately from the objects and the graph

names.put(n,"toto");
names.get(n);

+: the algorithms that modify information sets leaves the graph untouched (the graph does not contains state information)
-: uses hashtables (heavy, centralised)

* within the graph but outside the objects: the same map, but stored at the level of the graph

g.names.put(n,"toto");
g.names.get(n);

+: the info about the graph is stored within the graph (easy to access)
-: the graph contains state information
-: uses hashtable

When?

* If the info associated with a node is frequently accessed it is better to store this info in a dedicated field.

+: fast
-: need to implement ad-hoc classes

* If several algorithms are going to be used on a given graph, it is therefore not possible or desirable to modify the graph, i.e. to change its state. As such it is better to store the information relevant only to an algorithm outside of the graph.

+: respect functional decomposition
+: easy to implement
-: state info is sometimes interesting or necessary, solution: return it with the standard return in a tupple

How?

* use ad-hoc node object to contain the information (information directly into the node)

+: all the info there, plus the neighbours in one place
-: need to implement specific nodes for each new application

e.g.

class CityNode {
   String name;
   int population;
   List<CityNode> neighbours;
}


CityNode n;


* use generic node objects to contains the ad-hoc information (information pointed from the node)

+: generic, does not need to develop new nodes
-: the ad-hoc information and neighbour information are separated, from the former it is not possible to access the later

e.g.

class GenericNode<K> {
   K data;
   List<GenericNode<K>> neighbours;
}

class CityNodeData {
   String name;
   int population;
}

GenericNode<CityNodeData> n;

* use no node objects only identifiers and separated maps to contain the information (information independent from the nodes)

+: totally respect function decomposition
-: easier to store data in one place sometimes
-: edge information need two levels of indirection with hashmas, can be cumbersome

e.g.

class Node {
   List neighbours;
}

Map<Node, String>  mapNodeCityName;
Map<Node, Integer> mapNodeCityPopulation;

Node n;

BUT

adjacency information is node information... and the same questions applies: shall adjacency:
- be encoded within the nodes with an adjacency list field: requiring the creation of additional objects
- be encoded in map node->adjacency list : node object are not even needed in this case

And shall lists be encoded with linked lists or arrays lists?
And shall maps be encoded with hashmaps or carefully managed within arrays?

Usage of arrays is extremely compact, lightning fast for iteration thanks to caching but slower for modification.
Usage of linked list and hashmap is not compact (require to save pointer information) and is slower for iteration, but faster for modification.

two distinct use cases appear:
- high performance graph library: like grph which will take the pain of optimising in the minute details to the point of coding in C in order to save time. In order to go further, data should be arranged to match the cache access pattern within each implemented algorithm.
- user and developer friendly graph library: like any other Java library, provide slower than C highly reusable Java code based on Java concepts. That's the point of using Java. (and it is strongly typed and faster than Python: that's the point of not using Python)

It is clear that graphs that are static once loaded could benefit from a speedup over dynamic one if this is known in advance. However as these encompass two different style of coding algorithms it is not as transparent as using two different implementation for on abstract data type.

* Ease of development

The fact is that it is much easier to implement only a bare relation structure and additional information encoded outside of it in maps than implementing the ad-hoc graphical structure that precisely fit one's purpose.

e.g. for a graph of cities with two edge information, three node information including spatial ones, it is easier to specify:

Map mapNames = new Hashmap()
mapNames.put(name1)
etc...

Graph g = new Graph(g, {mapNames, mapPopulation, mapCoordinates}, {mapDistance, mapCapacity})  // not correct Java code btw, but let your imagination fly to understand

Than:

AdHocGraph g = new AdHocGraph()
g.addNode(name1, population1, coordinates1)
etc...

because this second solution require the developement of an AdHocGraph with AdHocNode able to hold precisely the information of a name, a population and a coordinate

The solution, in order to make development easier, is to be more generic and provide a class of node able to all contain an information of the same parametrized type, e.g. City, where a City object contains the precise three information we are interested in.

So the question is:
Information Within Nodes : for static information of the graph
Information Within Maps : for ad-hoc algorithms

* Sequential or parallel

But you know, we are entering the age of distributed and parallel algorithms... What we just discussed is fine for centralized and sequential processing... What about the future?

Distributed algorithms are piece of information spread over a network (...a graph...) which interact by exchanging information. If there is not central point of control, the algorithm is more than distributed, it is decentralised.

To be perfectly clear about the meanings:
- sequential: fixed workflow + 1 thread
- parallel: fixed workflow + N threads (same computer)
- distributed: fixed workflow + N threads (different computer) + communication delays + faults
- dencentralised: distributed with several independent workflows

Not all algorithms can be distributed, but also not all algorithms even need to be distributed, what we would be more interested into is harnessing the power of parallel computation than coping with the full set of difficulties of distributed computing (server faults, communication faults) and methods (data replication, process replication, consensus algorithms).

Sometimes an algorithm can not be fully distributed, but can be partially parallelized, with some part of the problem being computed independently, and possibly combined. For instance any matrix summation can be fully parallelized, with worker jobs working on independent parts of the problem and combining their solution by writing them in the result matrix. Matrix multiplication can be only partially parallelised, using Strassen algorithm for instance, because the computation over independent parts of the problem require additional computation to be combined, therefore introducing sequential dependencies over the computation that needs to be performed, and that can't therefore be all executed in parallel.

When using a decentralised algorithm there is only one choice for the location of data: inside the nodes of the graph. However for distribution and parallelisation, such is not the case. As in the sequential case, data can be stored inside the graph at the level of nodes or outside of the graph in a separate map. In distribution, storing the data makes it simpler to transmit info to processors: they just process the nodes, while storing the data in maps requires to first extract the relevant parts and transmit them along the graph to the processors. In parallelisation, all the processors can access the same memory making the split irrelevant.

The catch here is that what parts of a computation on a graph can be parallelised are determined by the graph structure itself, but all the code that parallelise, compute and aggregate piece of solution has nothing to do with the design of a graph library, still we need to take the distribution at  the core of the algorithms.

Let us consider different kinds of distribution in the problem of computing shortest paths with dijkstra:
* - node level: each node execute a local dijkstra algorithms that updates its information given the message he receives and updates other nodes. The heap, which is a global data must be managed throught a consensus algorithm. Such a solution is slow due to the communication overhead and should be used when strictly necessary
* - group of nodes level: each processor is responsible for the computation of a subparts of the graph: this can provide significant speedup if communication costs or memory contention are low. Note that this woudln't be the exact dijkstra algorithms as separating the graph and assembling parts of the solution require additional work to be done in a principled way.
* - graph level: no distribution, no speedup.