Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores (‘_’) or digits([0-9]), not beginning with a digit;
…
I’m not sure how to interpret \200-\377 in this context since “DOT assumes the UTF-8 character encoding”. Does it mean the Unicode Block “Latin-1 Supplement”? That seems odd, but seems also to work:
[dot verbose=true]
digraph G {
charset="UTF-8"
¨ -> ¸
}
[/dot]
Surely it cannot mean the first byte of an UTF-8 sequence?
The reason I’m asking is because of this Graphviz Visual Editor issue:
I don’t know, and I don’t think anyone knows, but the answer is in lib/cgraph/scan.l, which includes this:
LETTER [A-Za-z_\200-\377]
DIGIT [0-9]
NAME {LETTER}({LETTER}|{DIGIT})*
NUMBER [-]?(({DIGIT}+(\.{DIGIT}*)?)|(\.{DIGIT}+))(\.|{LETTER})?
ID ({NAME}|{NUMBER})
There’s other code that handles the case of quoted strings. Also HTML strings. Inside quoted strings, the lexer just soaks up ([^"\\\n]*|[\\]) until it sees ["].
There’s no awareness of charset settings or anything like that.
When this was written, not much though was given to anything except ASCII. It appears we just noticed, hey, if we define LETTER to include \200-\377, we can process UTF-8. Hold my beer…