Here’s an e-mail conversation I had with Emden earlier about rtest before I (temporarily) gave up.
Re: State of automated testing in Graphviz?
You replied on Sat 2020-04-04 19:18
Emden R. Gansner erg@emdenrg.net
Fri 2020-04-03 23:08
On 3/30/20 2:52 AM, Magnus Jacobsson wrote:
Summary:
- Making check in rtest: All tests seem to fail, but it goes undetected in the summary and exit status. Do they really fail or not?
- Making check in shapes: All tests pass.
- Making check in vuln: Test fails. Real bug or just the reference data that needs to be updated to a newer version of Graphviz?
No other tests seem to run. Should they?
Any tiny piece of information regarding anything of the above would be highly appreciated.
We have never been as dutiful about testing, especially as expected in current software development regimens. The main difficulty is having a reliable test to determine whether or not output has changed. Ideally, the test would compare two images (bitmaps, pdfs, etc.) and check if they are essentially the same. We tried xoring bitmaps but this really doesn’t capture “sameness” as determined by the human eye. This would probably be an ideal application of machine learning, if it hasn’t already been done. Given to graph drawings, are they basically identical?
In any case, failing this, the tests in rtests as well as those in tests produce some form of text output and then do a diff of the output file with a reference file. As with xoring bitmaps, this is very unreliable. For example, I notice node attributes are now listed one per line, where previously multiple attributes were on the same line. So files identical except for formatting would be reported as failures. More commonly, a small change in the Graphviz source would change a floating point number. So 27.2 might become 27.18, causing a test failure. This problem is magnified because text outputs are also very sensitive to OS and machine type. So you have all of these outputs which visually look identical but are reported as failures.
When we were more disciplined, we would run the tests before a major release and I would manually eyeball old and new output to see if they were reasonably identical. If these looked good, we would then refresh the reference outputs, at least for one machine and OS type. Sometimes, a test would crash, so that would identify an actual hard error and we could fix it.
Barring an accurate diff for images allowing insignificant variations, we should at least update the reference outputs to sync with the current version after manually checking by sight that the new outputs haven’t introduced a bad change. Then, in theory, we could follow best practices and run the tests after each commit. With luck, only a few tests would fail. These could be quickly checked visually and if satisfactory, the failing reference outputs would be updated. Usually, the onus would be put on the person making the commit, with the commit failing if some tests fail.
To answer your question about the vuln test, the new output and the reference output are significantly different. At first glance, the new output looks nicer than the old but that doesn’t mean the new drawing is “correct”. To check that, one would have to peruse the input and verify all the constraints specified in the input are met in the drawing. Fortunately, Graphviz constraints are fairly high level and can be satisfied by many different layouts. Indeed, the vuln test is probably not a good test, as the input is so complex, the output is likely to change given any change in the source.