How do I properly escape arbitrary text for use in labels?

I am trying to “export” a hierarchical document into Dot.
The nodes in my document may, in principle, have any content in them, for example, latex, images, code (even graphviz code).

I want to somehow escape this text so that it is rendered “safely” as graphviz node content, apart from some elements which I might later decide to implement. (For example I might want to render those latex formulae and insert them into the graph as images.

Is there some “reliable” way to make sure that a string is “safe” to be used in a node label, other than base64-encoding it? (As base64 is not readable even for the ascii part.)

Particularly gnawing happened to be URLs, which have http:// in them, // being universally treated as a sign of comment.

[This suggestion is untested, beware]
Consider making every node label an image. This might have two advantages:

  • It would allow you put the escape problem on a “real” text processor (latex, svg, html, Word, troff, …) that is better designed for the “global escape” problem
  • as a side benefit, it would make it much easier to use line-wrap / word-wrap to manage the width of the text in your nodes.

Unfortunately, due to past mistakes, we realized there is no way to safely put arbitrary text in graphviz strings, as we made mistakes in handling quotes and escapes.

For simple cases like URLs that don’t contain escapes, just put the string inside quotes, and escape any quotes inside the string.

Wow, that is brutally honest. No chance of this changing in observable future?

I guess this really makes images a “correct” solution.

The escaping rules are indeed bizarre and ambiguous, but I’m a little more optimistic than others. You can see something approximating my current understanding of them in fix double escaping during canonicalization (3b81ca48) · Commits · graphviz / graphviz · GitLab.

I think the rule would be something like:

  • \\\
  • "\"
  • \n, \f, \rsomething (need to think more and/or look at the code wrt this)
  • invalid UTF-8 → something…
  • anything else → as-is

And, as Stephen said, "-quote all label text.

I agree with Matthew; we intended to cover common cases. Also, strings that are processed through the lexer can potentially get different treatment than strings presented through the C/C++ API.

As an alternative, strings can be encoded in the Graphviz HTML-ish syntax in which one can map any special character into an “entity encoding” like &

What escape rules apply (or don’t apply) if they come in though the C/C++ API?
Might a simple C/C++/Python program that just shoves text into label fields and then executes the desired engine “solve” this problem?

You mean what sequences are recognized as something other than their literal text? Basically all the ones is_escape that I linked to before is trying to dodge. This stuff is also (incompletely) documented at escString | Graphviz.

I think what Stephen is getting at is that the lexer (lib/cgraph/scan.l) does some of its own escaping of strings present in dot file input before it reaches the C/C++ API.

It hadn’t occurred to me, but Stephen’s suggestion of using an HTML-like string is a good idea and may be the simplest. Use < and > instead of " delimiters. Encode every character using &#…; syntax for maximum paranoia, at the expense of any readability.

If you want to ensure that a string is safe to be used as a node label in Graphviz, you can perform a process called “escaping” or “quoting” the string. This process involves replacing special characters or sequences with their corresponding escape codes so that they are rendered correctly.

In the case of Graphviz, the characters you need to be cautious about are double quotes (") and backslashes (). To escape these characters, you can use backslashes as well. Here’s an example of how you can escape a string in Python:

def escape_string(string):
    escaped_string = string.replace('\\', '\\\\')
                           .replace('"', '\\"')
                           .replace('\n', '\\n')
                           .replace('\r', '\\r')
                           .replace('\t', '\\t')
                           .replace('\b', '\\b')
                           .replace('\f', '\\f')
    return escaped_string

In the above code, the backslash character (“") is escaped by replacing it with two backslashes (”\“). Double quotes (”) are escaped by preceding them with a backslash (“"). Additionally, other special characters such as newline (”\n"), carriage return (“\r”), tab (“\t”), backspace (“\b”), and form feed (“\f”) are also escaped.

You can use this escape_string function to escape your hierarchical document’s content before using it as a node label in Graphviz. By doing so, you can ensure that the content is rendered correctly and safely within the node labels.

Regarding URLs, if you want to display them as part of a node label, you can escape them using the same escape_string function. This will prevent Graphviz from treating the “//” sequence as a comment. However, keep in mind that Graphviz doesn’t provide built-in support for rendering URLs as clickable links within node labels.

As mentioned above, if you are accepting arbitrary input you also need to anticipate malformed UTF-8. Depending on where your input is coming from, you also need to consider alternate encodings (Graphviz understands more than just UTF-8).

Why would Graphviz treat // within a string as a comment?

Yes, it does. See e.g. HREF uses in HTML-like labels. It’s certainly constrained, but it’s not non-existent.