Strange graphviz syntax in pydot parser -- valid/useful anywhere?

Apologies that this isn’t directly about graphviz, but rather one of the third-party wrapper libraries — specifically, the Python pydot package.

Pydot includes a limited parser for graphviz syntax, as it supports parsing from .dot/.gv files into API graph definitions. It doesn’t parse the bits of the code that it doesn’t directly interpret (HTML-like strings, for instance, are just passed through untouched), but it does have the ability to interpret basic graph, node, edge, and subgraph syntax.

I’ve discovered a few odd rules in the parsing for port components of node IDs and edge endpoints. There are definitions for syntax that, as far as I can tell, aren’t valid graphviz statements at all. While pydot appears to support them, at least in its parser, actually writing code in that form just results in syntax errors from dot, neato, and any other engine I’ve tried to feed it to.

These rules are so old they date back to the very first commit in the pydot repo, so there’s no hope of tracking down any sort of explanation for them via commit history.

So, I’m wondering if anyone recognizes any of this syntax, knows if there’s anywhere it might possibly be valid, remembers if it was ever valid, or… well, really, can offer any other insights.

If not, and since the code is at odds with both the documented grammar and the observed behavior of graphviz, I’m probably going to rip it out of the parser on the grounds that accepting it would only be creating invalid graph definitions that would break when they were fed back into the graphviz tools.

Pydot’s dot_parser.py currently includes, all as components of the node_id rule that can appear either at the start of a node statement, or on either side of an edge operation (-- or ->), these parser rules:

  1. port_angle, which is defined as a literal @ sign followed by a valid identifier.
  2. port_location, which is defined as either:
    1. a possibly-repeating series of a literal : followed by an ID (correct, although too permissive)
      OR
    2. a literal :, followed by a literal (, then an ID, a literal ,, another ID, and a literal ). So, :(ID,ID)
  3. Those are assembled into a port definition that’s either:
    1. A port_location:ID:ID or :(ID,ID) — followed by an optional port_angle@ID
      OR
    2. A port_angle followed by an optional port_location

Except for 2.1, I can’t find any sign that any of those are permitted by the graphviz grammar or existing parsers. If they are it appears to be totally undocumented.

What gives? Is it syntax for some esoteric graphviz tool? Some other “graphviz-like” software that extended the syntax? Legacy syntax that used to be supported in graphviz, but was dropped so long ago there isn’t even any record of it in the gitlab repo’s history? Pydot’s parser just straight-up inventing its own, incompatible syntax?

…Any thoughts/guesses/wisdom?

I consider anything Graphviz-adjacent fine for this forum, including even discussion of theoretical non-Graphviz algorithms that may be of interest to Graphviz devs/users.

We’ve had several discussions in the past with third parties implementing their own DOT parsers. I don’t recall if this includes the pydot authors, but possibly. As people who’ve attempted this have discovered, the documented DOT grammar is not exactly what is parsed by Graphviz. My personal opinion is that the only way to implement a bug-for-bug compatible parser is to look at Graphviz’ scan.l and copy its logic exactly. I have never seen a third-party parser that actually does this, though I have not done an exhaustive survey.

Surprised me!
page 34 of https://graphviz.org/pdf/dotguide.pdf

port → port-location [port-angle] | port-angle [port-location]
port-location → ’:’ id | ’:’ ’(’ id ’,’ id ’)’
port-angle → ’@’ id
1 Like

My 2c: I feel like documenting the grammar/syntax of graphviz is probably a long-term strategic problem we have: for graphviz to be successful long-term, we need interop with other programs; but we haven’t started on solving it, and probably don’t have the time.

We meant these grammar hooks for features that for the most part we never got around to developing. Sorry.

Whoa, freaky!

None of that’s actually supported in the current parser, though, right?

Because, black-box testing from the outside, I definitely could not make dot recognize anything like that. It reports a syntax error at either the @ or the (.

I suppose there are a few different ways to go about that, really. Documenting things like the formal grammar for the language is one approach. I encountered a project recently that took a different tack entirely — which I could remember which — and simply mandated that any software supporting their features must use the official parser. Not “had to support the same syntax”, not “had to produce identical results”… no, it was “You. must. use. the parser distributed by the upstream project itself. Directly.

…Oh! I’m 90% sure it was AppStream. Apparently anyone who wants to support AppStream metadata has two choices:

  1. Link with libappstream.so.5 and call its functions / use one of the available sets of bindings.
  2. Not support AppStream metadata.

So, apparently that’s an option, too. (#WhoKnew?)

But in the graphviz case that’d require turning the parser and the data structures it outputs into stable, public API… and I suspect it’d require making the thing thread-safe, too (in exactly the way it is currently very not), if people are going to be expected to call into it from, e.g., Rust or JavaScript code.

…You know, now that I say it, that alone might be enough of an argument in favor of this approach, if it forces the parser to be scrubbed of all its global variables and fragile state-keeping as a side effect!1

Because, if pydot had easy access to a libgvparser.so with some Python bindings wrapped around it, I’d happily git rm dot_parser.py in a hot microsecond.2

Notes

  1. (That project justification just turned into the climate argument: “What if it turns out we fixed the environment and made the world a cleaner, healthier, better place for humanity to thrive… ‘for no reason’?” Oh, no. The. Horror.)
  2. Don’t worry, though. I’m not going to be holding my breath or anything. It’s just nice to dream, sometimes…

I don’t see how this is feasible for Graphviz. There is no licensing or gating standardisation body. Anyone can write a parser that claims to parse DOT. Regardless of whether it is actually conformant, if their tool is useful people are going to start using it.

There’s work ongoing on that front, Plan to Remove global library state (#2558) · Issues · graphviz / graphviz · GitLab.

Oh, it’s not. It’s not feasible for AppStream, either. (Perhaps I was insufficiently sarcastic. …Not a problem I usually have!) It’s a bonkers stance for any open-source effort to take.

Heck, the only way I even know about the… policy? preference?.. is because of the Quixotic game of whack-a-mole their devs end up playing with other projects that write their own parsing.

(Those other parsers invariably get some detail wrong, or fail to follow the latest changes in the ever-evolving spec. Which gives upstream leverage to report the very existence of the custom parser as a bug.)

But, that being said, the fact that they have a reference parser available, in shared-library form with a C interface that can be wired into basically any language with fairly minimal work, DOES give them a much stronger position from which to say, “Don’t write a parser. Just use ours, dummies.”

Even if they can never enforce that, a strong enough argument will often win out. (And, “you can maintain way less code” is pretty compelling to most OSS projects.) Or, better yet, the even more effective approach: Just file a PR in the downstream repo that does all the work of replacing the custom parser with the upstream’s.

They still have absolutely no hope of eliminating all of the “rogue” parsers that exist, or will ever come to be. But when it’s done in that manner, so far it seems like it’s been a lot more effective than I would’ve expected.

To be clear, I was not objecting to exposing Graphviz’ parser in the public API. This would be almost trivial to do, as the only real barrier is that its prototypes are in cghdr.h rather than cgraph.h. But I suspect third-party tools might want something that gives them a raw AST rather than fully processed Graphviz data structures.