Should identifiers be quoted when using characters from the “Miscellaneous Symbols and Pictographs” unicode block?

magjac · February 12, 2024, 6:11pm

Apparently this renders fine, but is it legal DOT?
[dot verbose=true]

digraph G {
  charset="UTF-8"
  🍔 -> 💩
}

[/dot]

I assumed that I would have had to write this with quotes, like so:
[dot verbose=true]

digraph G {
  charset="UTF-8"
  "🍔"-> "💩"
}

[/dot]

The DOT spec has this to say:

An ID is one of the following:

Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores (‘_’) or digits([0-9]), not beginning with a digit;
…

I’m not sure how to interpret \200-\377 in this context since “DOT assumes the UTF-8 character encoding”. Does it mean the Unicode Block “Latin-1 Supplement”? That seems odd, but seems also to work:

[dot verbose=true]

digraph G {
  charset="UTF-8"
  ¨ -> ¸
}

[/dot]

Surely it cannot mean the first byte of an UTF-8 sequence?

The reason I’m asking is because of this Graphviz Visual Editor issue:

Since it was written, we have changed the example in the official documentation through this commit:

scnorth · February 13, 2024, 1:02pm

I don’t know, and I don’t think anyone knows, but the answer is in lib/cgraph/scan.l, which includes this:

LETTER  [A-Za-z_\200-\377]
DIGIT   [0-9]
NAME    {LETTER}({LETTER}|{DIGIT})*
NUMBER  [-]?(({DIGIT}+(\.{DIGIT}*)?)|(\.{DIGIT}+))(\.|{LETTER})?
ID      ({NAME}|{NUMBER})

There’s other code that handles the case of quoted strings. Also HTML strings. Inside quoted strings, the lexer just soaks up ([^"\\\n]*|[\\]) until it sees ["].

There’s no awareness of charset settings or anything like that.

When this was written, not much though was given to anything except ASCII. It appears we just noticed, hey, if we define LETTER to include \200-\377, we can process UTF-8. Hold my beer…

Topic		Replies	Views
Bi-color nodes and special characters Help	7	1771	July 23, 2020
How to enter certain UTF-8 characters Help	3	895	November 16, 2022
How to correctly use dot -q to suppress warnings or help with accents in a dot file? Help	4	619	October 2, 2023
Character encoding for strings in and out of libcgraph Dev	0	299	March 30, 2023
Node label text in DOT JSON output truncated when it contains \\ Help	2	290	October 31, 2023

Should identifiers be quoted when using characters from the “Miscellaneous Symbols and Pictographs” unicode block?

Related topics