Most of the questions to the Forum, to stackoverflow, and to the Issues system are about small-ish graphs. Does that mean that large graph builders have everything under control?
My intuition says “no”, but my intuition is often confused.
So, if there are any users who build large graphs (say, 1500 edges or more +/-), please share your opinions - how can Graphviz be more supportive?
p.s. I know that faster is always better. What else? Especially in the area of documentation, tips, tools.
One quick win would be the ability to have some sort of “progress” callback (at the API level)
FWIW My main use case involves graphs that are too big and I present a tree view + breadcrumbs view of the data so they can pick the “top” cluster to render from…
We might provide more tutorial guidance about what to expect. In terms of layout, if a graph is directed but is not a tree, layered graph layout (dot) isn’t usually practical for more than a few dozen nodes. You can achieve somewhat of the same effect using neato -Gmode=hier (as explained in this report) up to maybe a few hundred nodes. sfdp for undirected graphs was engineered for thousands of nodes.
For large graph viewing, I’m not sure of the state of the art today, but it used to be the case that web clients run out of gas around 100K DOM objects. A node or edge typically has a couple of DOM objects. (I forget whether piecewise cubic Bezier splines are represented by one DOM object or possibly one per segment.) We like d3-graphviz (which is presently pinned to the top of this forum) it does rely on DOM rendering, and it’s probably good for several tens of thousands of nodes and edges. (Someone should let us know.)
cytoscape may be more scalable.
When graphs get really large, for undirected graphs (spring models) there is technology like tSNE and uMAP implementations in python that may be more satisfactory if you don’t need to see the edges. Also the “graphistry” commercial software uses GPUs for layout and rendering and is more scalable up to the memory limit of the GPU. I don’t think it has the concrete rendering features of graphviz (like all the node shapes and text layout options) but when I looked at it a few years ago it seemed good.
The vast majority of expensive graphs I’ve profiled are bottlenecked in dfs_range. I wonder if simply plumbing through a progress callback for that single function would be enough.
When I profiled, long ago, I thought the cost of a large layout was fairly well balanced between phases 2-4 (mincross, X coord solving, spline routing); cost of phase 1 (ranking = Y coord solving) i negligible. dfs_range is only involved in phase 3. A progress bar would probably need to account for this.
Beware inferences like that, though — there’s a strong, strong bias on Stack Exchange sites for questions to be accompanied by a Minimal, Reproducible Example. It’s kind of learned behavior for a lot of support/assistance contexts, really.
So, people dealing with large-graph issues may simply be reducing them down to small-graph examples, for the purpose of discussing their issues with others.
Though I will remind everyone of the graphviz repo issue I filed a while back, wondering what the world.gv and 4elt.gv files in the graphviz distribution were all about. Both of those definitely fall into the “large graph” category, as 4ELT is 15,000+ nodes and over 45,000 edges, while the World map is 148,201 nodes and 1 fewer edges.
(You were the one who eventually figured out that neato could render either graph in mere seconds, after others of us had hit program crashes and 5±minute runtimes trying to process them through dot or sfdp.)
Actually, I guess that’s one area that could use expansion.
There are a lot of tools documented as part of the graphviz swiss army knife. I count 29 tools in the Command Line section, plus another 10 in the Layout Engines section. And clearly, they all have different strengths and weaknesses. So, how does a hypothetical large-graph user go about figuring out what criteria they should use when selecting a tool for their needs, never mind actually making that selection?
The documentation doesn’t necessarily offer that much guidance in that regard. In fact, sometimes it’s almost the exact opposite.
neato is a reasonable default tool to use for undirected graphs that aren’t too large (about 100 nodes), when you don’t know anything else about the graph.
neato attempts to minimize a global energy function, which is equivalent to statistical multi-dimensional scaling.
And yet, all evidence in that description to the contrary, it was the only tool capable of rendering those large graphs on less-than-glacial time scales.
So I guess the questions are,
Why is neato the only capable choice for rendering those example fles?
How did you know it was the right choice?
Can more of that logic/guidance be encoded into the documentation, somehow?
If you want real world example large graphs, look for Gitlab issues labelled performance. These are mostly graphs that are too large (either in runtime or memory) for Graphviz to currently handle.
The documentation could definitely be improved. The tools in Graphviz accumulated organically, so doesn’t provide a good overview of how everything fits together, or what to use when.
Several times we tried working on a book, but didn’t get that far.
Someone (who did write a leading visualization textbook) suggested to start differently, by working on a better FAQ or Wiki, with plenty of examples, and see how that goes.
I was going to write a brief explanation of the layout tools and viewers here. Even that started to seem kind of complicated and pointlessly anecdotal.
Why is neato the only capable choice for rendering those example files? neato (and only neato) has special features that allows it to accept pre-defined node layouts (pos attribute) and predefined node + edge layouts (see FAQ | Graphviz and FAQ | Graphviz). It then takes these pos values as given and creates finished graphs - much faster.
How did you know it was the right choice?
Eventually, I really looked at the source. Upon noticing the pos= node attributes, neato -n was the obvious choice.
Can more of that logic/guidance be encoded into the documentation, somehow?
Oddly, this (presence of pos= attributes) is a situation that is reasonable documented - but who gazes at a 1Mb input file?
I like all kinds of documentation styles (see scnorth’s answer below) I also like active documentation (my term, not trademarked) - systems that volunteer answers to questions the user does not know to ask.
Specifically, one thing we might do is enhance the layout engine front-end to
flag language misuse (e.g. rankdir applied to a cluster)
engine use suggestions (e.g. use neato -n if pos attributes present)
flag implementation weaknesses/bugs (e.g. ortho splines fail if ports used)
flag invalid or dubious attribute values (e.g. height=25)