Should identifiers be quoted when using characters from the “Miscellaneous Symbols and Pictographs” unicode block?

Apparently this renders fine, but is it legal DOT?
[dot verbose=true]

digraph G {
  charset="UTF-8"
  🍔 -> 💩
}

[/dot]

I assumed that I would have had to write this with quotes, like so:
[dot verbose=true]

digraph G {
  charset="UTF-8"
  "🍔"-> "💩"
}

[/dot]

The DOT spec has this to say:

An ID is one of the following:

  • Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores (‘_’) or digits([0-9]), not beginning with a digit;

I’m not sure how to interpret \200-\377 in this context since “DOT assumes the UTF-8 character encoding”. Does it mean the Unicode Block “Latin-1 Supplement”? That seems odd, but seems also to work:

[dot verbose=true]

digraph G {
  charset="UTF-8"
  ¨ -> ¸
}

[/dot]

Surely it cannot mean the first byte of an UTF-8 sequence?

The reason I’m asking is because of this Graphviz Visual Editor issue:

Since it was written, we have changed the example in the official documentation through this commit:

I don’t know, and I don’t think anyone knows, but the answer is in lib/cgraph/scan.l, which includes this:

LETTER  [A-Za-z_\200-\377]
DIGIT   [0-9]
NAME    {LETTER}({LETTER}|{DIGIT})*
NUMBER  [-]?(({DIGIT}+(\.{DIGIT}*)?)|(\.{DIGIT}+))(\.|{LETTER})?
ID      ({NAME}|{NUMBER})

There’s other code that handles the case of quoted strings. Also HTML strings. Inside quoted strings, the lexer just soaks up ([^"\\\n]*|[\\]) until it sees ["].

There’s no awareness of charset settings or anything like that.

When this was written, not much though was given to anything except ASCII. It appears we just noticed, hey, if we define LETTER to include \200-\377, we can process UTF-8. Hold my beer…

1 Like