Getting rtest to run in continuous integration

Ref: fix: uniformly treat margin as unsigned in pack.c (!1365) · Merge requests · graphviz / graphviz · GitLab

Here’s an e-mail conversation I had with Emden earlier about rtest before I (temporarily) gave up.

Re: State of automated testing in Graphviz?

You replied on Sat 2020-04-04 19:18

Emden R. Gansner erg@emdenrg.net

Fri 2020-04-03 23:08

On 3/30/20 2:52 AM, Magnus Jacobsson wrote:

Summary:

  • Making check in rtest: All tests seem to fail, but it goes undetected in the summary and exit status. Do they really fail or not?
  • Making check in shapes: All tests pass.
  • Making check in vuln: Test fails. Real bug or just the reference data that needs to be updated to a newer version of Graphviz?

No other tests seem to run. Should they?

Any tiny piece of information regarding anything of the above would be highly appreciated.

We have never been as dutiful about testing, especially as expected in current software development regimens. The main difficulty is having a reliable test to determine whether or not output has changed. Ideally, the test would compare two images (bitmaps, pdfs, etc.) and check if they are essentially the same. We tried xoring bitmaps but this really doesn’t capture “sameness” as determined by the human eye. This would probably be an ideal application of machine learning, if it hasn’t already been done. Given to graph drawings, are they basically identical?

In any case, failing this, the tests in rtests as well as those in tests produce some form of text output and then do a diff of the output file with a reference file. As with xoring bitmaps, this is very unreliable. For example, I notice node attributes are now listed one per line, where previously multiple attributes were on the same line. So files identical except for formatting would be reported as failures. More commonly, a small change in the Graphviz source would change a floating point number. So 27.2 might become 27.18, causing a test failure. This problem is magnified because text outputs are also very sensitive to OS and machine type. So you have all of these outputs which visually look identical but are reported as failures.

When we were more disciplined, we would run the tests before a major release and I would manually eyeball old and new output to see if they were reasonably identical. If these looked good, we would then refresh the reference outputs, at least for one machine and OS type. Sometimes, a test would crash, so that would identify an actual hard error and we could fix it.

Barring an accurate diff for images allowing insignificant variations, we should at least update the reference outputs to sync with the current version after manually checking by sight that the new outputs haven’t introduced a bad change. Then, in theory, we could follow best practices and run the tests after each commit. With luck, only a few tests would fail. These could be quickly checked visually and if satisfactory, the failing reference outputs would be updated. Usually, the onus would be put on the person making the commit, with the commit failing if some tests fail.

To answer your question about the vuln test, the new output and the reference output are significantly different. At first glance, the new output looks nicer than the old but that doesn’t mean the new drawing is “correct”. To check that, one would have to peruse the input and verify all the constraints specified in the input are met in the drawing. Fortunately, Graphviz constraints are fairly high level and can be satisfied by many different layouts. Indeed, the vuln test is probably not a good test, as the input is so complex, the output is likely to change given any change in the source.

1 Like

More emails about this. I never got around to doing what I said was going to do.

Re: State of automated testing in Graphviz?

Magnus Jacobsson

Sat 2020-04-04 19:18

Hi Emden.

Thanks for your extensive answer.

Actually, yesterday I did what you suggest; text diff where possible and eyeball the rest on my Ubuntu 18.04 computer. I was very tired after eyeballing 99 images and diffing 28 files, but also very satisfied to see all tests pass :tada:.

I wasn’t really that happy today when I discovered that running the tests on Fedora33 gave 80 diffing files and what was even worse, running the test for Ubuntu 18.04 in a Docker container gave 122 (!) diffing files :scream:.

Tomorrow I will look at ImageMagick ImageMagick – Command-line Tools: Compare or some other tool to see if it can help.

1 Like

Re: State of automated testing in Graphviz?

Emden R. Gansner erg@emdenrg.net

Sat 2020-04-04 20:40

On 4/4/20 1:18 PM, Magnus Jacobsson wrote:

Tomorrow I will look at ImageMagick ImageMagick – Command-line Tools: Compare or some other tool to see if it can help.

If you can find something that would work, that would be great. But I was serious about some ML-based tool seeming the right way to go, though getting the training data would be tedious.

FWIW We do some screenshot testing at work, and what works well for us is to try to make the output deterministic (and if we can’t do that and we want to pass for small diffs, then we set a simple threshold of the % of pixels that are allowed to differ before the test is considered failed). Decode the images into bitmaps (make sure that they’re saved as lossless PNGs or raw bitmaps) and compare pixel-by-pixel.

One more trick is you can ignore the RGB component if the Alpha channel is totally transparent if the RGB component isn’t deterministic (though often it’s easier to just make the RGB component deterministic).

The other trick with screenshot testing is to make it really easy to update the golden screenshots - as you mention, minor source changes will sometimes lead to big changes in the pictures. That’s OK, as long as it’s not very hard to update the pictures when they do change – the main goal is to detect regressions, but when you’re making changes to the core rendering logic I think it would be OK to eyeball the tests and bulk-approve changes to them.

Some days (maybe you’ve made a change to a core part of the app) you’re reviewing hundreds of images (it really helps to have a diff tool in your code reviewer that can flash back and forth, or display highlighted differences in pixels) and your eyes bleed a bit ;-), but it’s often worth it for the confidence it gives you that you’re not breaking things.

1 Like

I suspect we might have more stable results diffing the SVG output, being slightly abstracted from OS-specific image libraries, but still representing the layout. And we can still open the SVG in the browser to see how it looks :slight_smile:

2 Likes

Mark’s suggestion of using the SVG outputs seems like the way to go to me. Maybe we can build something on top of ImageMagick that suppresses fuzzy differences, like minor floating point discrepancies. I’d definitely be more comfortable with a systematic automated check than relying on a human to eyeball two images and form a subjective opinion.

I’m inclined to prioritize this (getting the full test suite running) over all the other outstanding work. Without these tests running in CI we have limited visibility over whether our other changes are breaking things.

By the way, does this mean contrib/diffimg is deprecated/unused?

I agree. The tests do however already utilize the SVG format (among others). I have WIP in a local branch. I’ll recapture where I was with this almost two months ago and make a short write-up during the weekend so that we don’t reinvent the wheel.

2 Likes

I’m currently working on getting a subset of the rtest test cases that produces identical results on all platform up and running in CI. My intention is this would short-term benefit the fixes and refactoring that primarily @smattr is doing and then later on find a way to make smarter comparisons to enable all of them.

For all Centos* and Fedora* platforms I’m getting:

Format: "png" not recognized. Use one of: canon cmap cmapx cmapx_np dot dot_json eps fig gv imap imap_np ismap json json0 mp pic plain plain-ext pov ps ps2 svg svgz tk vml vmlz xdot xdot1.2 xdot1.4 xdot_json

Any input as to why would be appreciated in order to cut the time for me to figure it out myself.

EDIT: It works when graphviz-gd-*.rpm is installed.

Note to self: Consider John’s comment #4 from https://forum.graphviz.org/t/new-test-stage-in-the-pipeline/41/2

With Add rtest (!1395) · Merge requests · graphviz / graphviz · GitLab we will have rudimentary rtest regression tests in CI. The end result is kinda lame, since the only test cases whose output is checked against reference files are the 10 (out of 135) test cases that give the same results for all platforms. This is at least better than nothing since we actually run all test cases so we would notice if any of them caused a crash. We also have something that can be improved upon, although I think the next step would be to replace the rtest.sh script with pytest (or similar) test cases before starting to actually improve the comparisons or take a completely different approach.

I’ve used pytest for the overall test framework for no particular reason. I could as well have chosen Python unittest which I’ve used before. I’m open to pick another if you have one you prefer. It’s very tiny as it is now so a change is still easy.

This was a rough experience :cold_sweat:. I’m glad it’s over. At least for now. And, hey! I found a regression in the process :superhero: :champagne: even though it would still go undetected if it was re-introduced :disappointed:

1 Like

Great work, @magjac! This looks like it would have taken quite a lot of time, so thanks for spending your effort on this.

I get the impression pytest is more popular, but unittest has the advantage of being in the Python standard library so we don’t need to install an extra package. I’m OK with either.

1 Like

I’m fine with either pytest or unittest (or even just if-statements and print() and exit(1) on failure!). All good to me. Ideally we can get useful error messages on failure (a lot of the assertEqual-style tests have pretty spammy output showing the entire left-hand-side and right-hand-side large strings when really you want a diff), but that’s pretty far up my hierarchy of needs. Getting any CI working is a great place to start!

1 Like

Thanks. If there are no suggestion for something else, I’m going to stick with pytest out of pure laziness.

Merged to master with https://gitlab.com/graphviz/graphviz/-/commit/c9f756613347bfcaceab2999d467e9c9a6a5b81d

1 Like

Here are 3 programs that will “diff” image files and produce a text analysis:

I would expect that any of these (with a bit of tuning) would be able to quickly (?) produce yes/no/questionable reports

1 Like

Whoops! Wrong, again. I just used the 3 programs listed above to diff 756 .gv files run thru dot -Tpng. on my Fedora system with this version of dot - graphviz version 2.45.20200611.1412 (20200611.1412) and a Ubuntu running dot - graphviz version 2.45.20200611.1412 (20200611.1412).
748 differed! Ouch. Though I’ve only eyeballed a few dozen, font differences were the biggest problem by far. I’ve only seen 1 real layout difference (and it was pretty modest).
Single characters were clearly different and longer labels caused differences in graph height and/or width.

1 Like

I deciphered tests.tcl in rtest2 and have begun a proof-of-concept migration to Python. I converted TESTS to JSON for portability and to avoid writing a special-purpose parser. Once I have input and output generation working, I’ll look into the python pixelmatch library (GitHub - whtsky/pixelmatch-py: A fast pixel-level image comparison library, originally created to compare screenshots in tests.). If all goes well, the plan is to add pytest or unittest support, then add it to CMake. If pixelmatch doesn’t work out, I’ve seen other methods using OpenCV.

See Python version of rtest2; related to 1772 (!1479) · Merge requests · graphviz / graphviz · GitLab which provides a Python replacement for test.tcl. The original TESTS file was converted to JSON so it’s a little more understandable and portable and doesn’t need it’s own parser. File differences are written to the results directory - unified diffs for text formats and pngs for image differences. Output is in TAP format which should be digestible by pytest and other frameworks and exit codes are compatible with CTest’s zero-means-success expectation.

The main issue I’m having is that dot complains about needing to be initialized with dot -c. I’ve run into this with other tests so I must be missing something with pre-test setup of dot - is there a way to run dot -c to register plugins locally/temporarily instead of writing configuration info to a system directory?