How do I properly escape arbitrary text for use in labels?

I am trying to “export” a hierarchical document into Dot.
The nodes in my document may, in principle, have any content in them, for example, latex, images, code (even graphviz code).

I want to somehow escape this text so that it is rendered “safely” as graphviz node content, apart from some elements which I might later decide to implement. (For example I might want to render those latex formulae and insert them into the graph as images.

Is there some “reliable” way to make sure that a string is “safe” to be used in a node label, other than base64-encoding it? (As base64 is not readable even for the ascii part.)

Particularly gnawing happened to be URLs, which have http:// in them, // being universally treated as a sign of comment.

[This suggestion is untested, beware]
Consider making every node label an image. This might have two advantages:

  • It would allow you put the escape problem on a “real” text processor (latex, svg, html, Word, troff, …) that is better designed for the “global escape” problem
  • as a side benefit, it would make it much easier to use line-wrap / word-wrap to manage the width of the text in your nodes.

Unfortunately, due to past mistakes, we realized there is no way to safely put arbitrary text in graphviz strings, as we made mistakes in handling quotes and escapes.

For simple cases like URLs that don’t contain escapes, just put the string inside quotes, and escape any quotes inside the string.

Wow, that is brutally honest. No chance of this changing in observable future?

I guess this really makes images a “correct” solution.

The escaping rules are indeed bizarre and ambiguous, but I’m a little more optimistic than others. You can see something approximating my current understanding of them in fix double escaping during canonicalization (3b81ca48) · Commits · graphviz / graphviz · GitLab.

I think the rule would be something like:

  • \\\
  • "\"
  • \n, \f, \rsomething (need to think more and/or look at the code wrt this)
  • invalid UTF-8 → something…
  • anything else → as-is

And, as Stephen said, "-quote all label text.

I agree with Matthew; we intended to cover common cases. Also, strings that are processed through the lexer can potentially get different treatment than strings presented through the C/C++ API.

As an alternative, strings can be encoded in the Graphviz HTML-ish syntax in which one can map any special character into an “entity encoding” like &

What escape rules apply (or don’t apply) if they come in though the C/C++ API?
Might a simple C/C++/Python program that just shoves text into label fields and then executes the desired engine “solve” this problem?

You mean what sequences are recognized as something other than their literal text? Basically all the ones is_escape that I linked to before is trying to dodge. This stuff is also (incompletely) documented at escString | Graphviz.

I think what Stephen is getting at is that the lexer (lib/cgraph/scan.l) does some of its own escaping of strings present in dot file input before it reaches the C/C++ API.

It hadn’t occurred to me, but Stephen’s suggestion of using an HTML-like string is a good idea and may be the simplest. Use < and > instead of " delimiters. Encode every character using &#…; syntax for maximum paranoia, at the expense of any readability.