I’ve found some mention of a few large test case files bevy.gv and ftzz.gv, but I cannot find these files anywhere. Where might they be? I have my own test case I stumbled upon, but it’s apparently way too big to allow convenient profiling, and perhaps one of those can be cut down more easily.
Incidentally, has anyone tried to parallelize dot? I seemed to have some small success with some initial work but then I took a breather to organize testing first after finding only a small speedup. These days computers have a lot of cores, and running single-threaded seems so wasteful. Are there still people working on the project who understand the basic algorithms being used and might be willing to discuss/review alternate implementation strategies if I try to do this?
You probably want to look at tests/compare_performance.py in the Graphviz repo. ftzz.gv and bevy.gv are checked in as tests/2475_1.dot and tests/2475_2.dot.
I worked on some of the basic algorithms in dotgen, and would be happy to discuss parallelization.
There are four major phases to the layout, as described in the 1993 IEEE TSE paper. There have been some research publications in the past 5-10 years aimed at parallelizing some of the algorithms (specifically, network simplex) that would be worth finding and reading.
Searching for “parallel network simplex” one finds Kara and Özturan 2021, in which the abstract mentions “Our experiments show speedups up to four are possible, though they are typically lower.” Well, ok. Coarse-grained parallel crossing reduction might employ different randomization? Parallel spline routing might involve finding independent sets of routes and giving them to different threads.
Somewhat bad news is that running time is balanced between network simplex, crossing reduction, and spline routing, so running much faster means attacking all 3.
Have you used the tests in compare_performance.py as regression tests, or are they too slow for that? We seem to be just ignoring the outputs there, and I’m seeing some errors.
When I run compare_performance.py, I notice that #1172 has errors on the released versionon Ubuntu 24.04 (2.42.2-9ubuntu0.1, which calls itself 2.43 if asked):
lib/pathplan/shortest.c:93: cannot realloc pnls
Error: in routesplines, Pshortestpath failed
... [repeats many times]
Error: lost 4970 455 edge
Error: lost 4971 416 edge
... [ many more similar errors]
warning: command returned 1
which seemed to have disappeared when run from recent gitlab sources.
I also see similar errors in #2064 with the released version.
Also #2475, which took almost 2 hours to run on my (not completely idle) laptop.
Ok, too slow for CI, but an occasional regression run might be a good idea.
I have ~60 input files with >300 nodes & >300 edges (semi-arbitrary cut-off). Many came from the Issues web pages and this forum.
Below is a graph showing the % of time spent in each of the 5 phases (inclued the drawing step) for some of the smaller files. (I’ll keep timing the larger files, but don’t expect completions)
Absolute timings are available if anyone want them.
How commonly used are the various layout filtes (dot, neato, circo)? I notice you are running mostly dot but also those other two in compare_performance.py. Are your runs using different tools?
I’ve managed to do sampling profiles of a few of the faster cases on my AMD Linux laptop. Indeed the hot spots correspond with cache misses, so memory footprint is key. Neato has very different hotspots than dot, but somehow I missed the circo test for now. My own test case was bogged down in mincross when I gave up profiling.
Btw, do separable subgraphs/components get processed often enough that we might do those in parallel? Not immediately clear that the code is thread-safe processing multiple subgraphs at once. If this is a common use case, and it’s not already done, it’s probably worth duplicating the graph to get better memory locality.
Lots of clever code here. I’ll have to dig into the data structures to figure out whether it can be compacted and/or what operations may be thread-safe w.r.t. code.
I’m suddenly remembering that parallelizing dusty-deck Fortran was made harder because of bugs probably introduced hand-tuning for old vector machines. In another previous jobs, we had some very clever hand-coded scheduling code that was eventually surpassed by a simpler approach with a commercial ILP solver, which more than one person could understand (after that one person left, and the code was unmaintainable).
What do you mean by “edge placement” apart from the 2nd network simplex phase, and spline routing? Is this the “top half” of the spline router, i.e. find the feasible regions for the splines to be drawn?
The legend definitely needs work. edge placement is found by grepping stderr (dot -v …) for [Graphviz] routespl.c: splines is found by grepping stderr (dot -v …) for [Graphviz] emit.c:
very good question, no simple answer. But the bulk of the performance issues seem to be about dot
I do not use compare_performance.py. Not an editorial, just a very old dog (me) not wishing to learn a new trick.
Do separable subgraphs/components get processed frequently?
All(?) the engines recognize pack & packmode attributes concerning layout of disconnected components
dot and fdp handle clusters separately
You can run every input through every layout engine. Some attributes are not used by every engine, but they will be accepted (and maybe ignored). Here is a bash script that produces commands for many engines and interesting options
I have a relatively reviewable fix to ns.c which copies data into a private graph and
computes with more locality.
In a preliminary comparison, for the test case 2343.dot, I see 71% reduction in total run
time, due to 5x improvement in 2nd ranking time (on a recent AMD laptop processor).
“#2465 1” was -47% in one run, @2621 was -56%. Other “compare_performance.py”
examples are either very fast or didn’t have much improvement.
Test case #1864, which takes about 22m23s, and other shorter tests use neato or circo and are apparently unaffected by this change.
The approach is kind of ugly and a little inefficient (copying data in and out), but less risky as it barely touches the original algorithm code. I’m still cleaning up a merge request, so we
can discuss whether this can be made better when I post that.
Meanwhile, I’m looking for more benchmarks to validate and motivate reviewer effort as
well as to find other hotspots. As the really big cases get bogged down in other spots,
I suspect I need to look at mincross and maybe a few other passes. I’ll start by profiling #1864 for the next round.
@steveroush Do you have numeric data for that graph and/or a collection of the input files
and cmd line options used to run them?
I only see two graphs reachable from that .md file: 2854_0.gv and 4elt.gv.
They’re pretty big (28170 lines and 9048 lines) but don’t seem to contain 60 examples.