I'm trying to use gvpr. Is that a mistake?

tl;dr - Is there some tool I should be using instead of gvpr to do what gvpr does?

I am trying to use gvpr to take a very large graph and reduce it to something narrowly focused on a few nodes, but I’m finding gvpr very frustrating.

The problem:

I’m trying to debug a problem with

This part of closure-compiler creates a very large directed graph of all the property names in use in a JavaScript program. The goal is to find property names that never appear on the same object, so they can safely be renamed by the compiler to use the same name. A bug is causing this graph to be wrong such that 2 properties that do appear on the same object get renamed to the same name, so one overwrites the other.

I can modify the code just a bit to make it spit out the whole graph as a dot file.
The graph is far too large to reasonably analyze manually, so I want to use gvpr
to take this graph, find the nodes that refer to the problem property names, explore edges
in both directions to find all nodes that are on paths that lead to or away from them,
and output just those nodes and their connecting edges.

I struggled a lot to figure out exactly how $tvroot and $tvnext work to control traversals.
The man page is great at giving details about individual methods, but
falls down on giving me a clear model of how each traversal actually runs.

  1. What happens if you change $tvtype somewhere other than in a BEG_G clause?
  2. If I set $tvnext or $tvroot to a node that gets visited before gvpr decides it needs a new starting point, will it just pick an arbitrary, unvisited node to start from instead?
  3. When is the appropriate time to use $tvnext instead of just setting $tvroot to the next root you want?

My intention was to

  1. Do a flat traversal to grab the nodes of interest and stick them in a subgraph.
  2. Do a forward traversal for each of those nodes to find all nodes reachable starting from them - another subgraph.
  3. Do a backward traversal for each of the nodes of interest to find all nodes that can reach them - another subgraph.
  4. Clone all those subgraphs into $T & call the method that adds all the edges that connect them from the original graph.

But, I can’t see a really clean way to do this.
I had something sort of working that involved me having to double-check whether the next node of interest has already been visited, or discover when I visit it without actually switching to its own traversal, and then update $tvroot or $tvnext to the next node on my list.

Surely there’s some standard coding pattern to say “traverse DFS forward starting from each node in this list / subgraph”?

Still, I was sort of getting there, when gvpr started doing some just plain buggy things.

  1. I’m setting one of the nodes of interest to be colored red $.color = colorString;, but it’s actually getting color="" in the output dot file.

I’ve even put stderr prints in the script to verify that I really set it to “red” just before printing out.
In the script the same line also sets other nodes to be “green”, and they look just fine.

  1. The label strings in the output dot file have started including characters that aren’t there in the input graph. Garbage characters.

Is gvpr just not capable of handling my really long strings?
This started happening when I changed my strategy a bit to use subgraphs and clone() at the end instead of trying
to just add all my nodes directly to $T as I found them.
Maybe the clone() is the problem?

So, as my tl;dr says above, am I just using the wrong tool here?
Is there some newer, better, and better-supported way to do what I’m trying to do?

Thanks,
Bradford

@steveroush is the gvpr master and can probably tell you some tips and tricks.

wrt string stuff… the design of gvpr has some fairly questionnable memory safety properties. E.g. the language allows nested expressions and something like a call stack but it does not seem to have any internal mechanisms for correctly representing this. Strings are a particular problem area, with a lot of code patterns I would call unsafe at any speed. Not trying to throw anyone under the bus, but IMHO there are numerous known and unknown bugs in its implementation. Having said that, some manage to be extremely productive with it.

If your graph is massive and your end goal is not explicitly visualization, you may be better off trying something that ties into a more general purpose scripting language. I’m thinking NetworkX, though I have no specific experience with it.

[note that you did not provide the input file or the gvpr program, so we’re guessing here.
How large is your graph? The gc (gc | Graphviz) program will give us sizes.]

find the nodes that refer to the problem property names, 
explore edges in both directions to find all nodes that are 
on paths that lead to or away from them, and output 
just those nodes and their connecting edges.

Maybe this is all you need:
ccomps (https://graphviz.org/pdf/ccomps.1.pdf) decomposes graphs into their connected components, printing the components to standard output. i.e. separates the input into independent graphs

If that is not sufficient, maybe the following gvpr program will help, though it only colors “downstream”, not “upstream”. [On a quick read of the code, I think it would be easy to modify it toward your needs by changing fstout & nxtout to fstedge & nxtedge. Handling multiple nodes requires more thought + caffeine]

/*
  given the name of a node (via the command line -a argument)
    color that node yellow
    color all descendents and connecting edges green
    color all other nodes grey
*/
BEGIN{
  node_t   aNode,  Start;
  graph_t  aGraph, Root;
  int      seenE[], seenN[];
  int      Ecnt=0;
  string   start;

  /////////////////////////////////////////////////////////////////////
  // the anEdge argument is just for bookkeeping
  //   each call creates a new instance
  //   so the nxtedge call does not over-write
  void nodeTraverse(node_t thisNode,   edge_t   anEdge){
    print("// NODE: ", thisNode.name, "  seen: ", seenN[thisNode]);
    if (seenN[thisNode]!=1){
      seenN[thisNode]=1;
      thisNode.fillcolor="green";
      thisNode.style="filled";      
      for (anEdge = fstout(thisNode); anEdge; anEdge = nxtout(anEdge)){
        print("// edge: ", anEdge.name, "  ", anEdge.tail, "  ", anEdge.head, "  seen: ", seenE[anEdge]);
        if (seenE[anEdge]==0){
	  anEdge.color="green";
          seenE[anEdge]=1;
          print("// recurse: ", anEdge.head);
          nodeTraverse(anEdge.head, anEdge);
        }
      }
    }
    print ("//  DONE: ", thisNode.name);
  }  // end of nodeTraverse
  //////////////////////////////////////////////
}
BEG_G{
  Root=$G;
  start=ARGV[0];
  print ("//   start: ", start);
  Start=isNode($G, start);
  if (Start==NULL){
    printf(2, "Error: unknown node >%s<\n", start);
    Ecnt++;
  }
  if (Ecnt>0) exit(9);
}
//
// grey out all nodes & edges
// we will color the ones we want later
//
N{
  $.style="filled";
  $.fillcolor="lightgrey";
}
E{
  $.color="lightgrey";
}
// now find all the children & color them & their edges
END_G{
  nodeTraverse(Start);  // no edge parameter on this call
  print("//  Start: ", Start.name,Start.color);
  Start.fillcolor="yellow";
}

FYI:

  • I can’t immediately answer the tvroot & tvnext questiuons, I seldom use them
  • If I remember correctly, there is an (avoidable) bug in $T (whoops, I haven’t filed a bug report), but I don’t remember exactly what it is this early AM. Let me know if you need more info.
  • Rather that creating new graphs & cloning nodes & edges, I’d just keep track of all the uninteresting nodes and then delete them at the very end. If objects are deleted during a traversal, bad things sometimes happen.
  • $.color=colorString - I think that is your bug, stderr not withstanding; feel free to provide your source & prove me wrong, everyone else does!
  • gvpr is a different language - some bugs for sure, often very powerful, sometimes just exasperating.

Thanks so much for the advice!

I think the most important piece of advice I gleaned from it was:

Write your own traversal methods rather than trying to use $tvroot, $tvtype, and friends.

Here’s the script I ended up writing that appears to work.

It includes notes pointing out a really egregious gvpr bug I discovered along the way.

#!/usr/bin/gvpr -f

BEGIN {
// Declare the output graph so we can refer to it in methods here.
// We will create it as a subgraph of $G in the first BEG_G clause.
// $G isn't available in the BEGIN clause.
graph_t outputGraph;

/**
* Associative array from a display color to a property name.
 * key = color name - color to paint the nodes in the output graph
   * value = substring to match on a node's label
   */
  string propToColor[string];
  propToColor["interesting_property_name_1"] = "green";
  propToColor["interesting_property_name_2"] = "red";

  /**
   * Map from a Node Of Interest to the color name to use for that node.
   *
   * We will fill this array in the first traversal.
   */
  string noiToColor[node_t];

  /**
   * If the given node is a node of interest, record it in noiToColor.
   */
  void updateNoi2Color(node_t argNode,    string propName) {
    for (propToColor[propName]) {
      // NOTE: We're assuming property names are unique enough that simply
      // finding them in the label guarantees that we're actually matching
      // the part of the label that records properties and not some other bit.
      if (index(argNode.label, propName) >= 0) {
        noiToColor[argNode] = propToColor[propName];
      }
    }
  }

  /**
   * Return the color string we want the node to have in the output graph.
   */
  string getColorForNode(node_t argNode) {
    if (argNode in noiToColor) {
      return noiToColor[argNode];
    }
    return ""; // use default color
  }

  /**
   * Add the anode to outputGraph and set its color correctly,
   * unless it's already been added.
   */
  void addNodeToOutput(node_t argNode) {
    if (!isIn(outputGraph, argNode)) {
      aset(argNode, "color", getColorForNode(argNode));
      subnode(outputGraph, argNode);
    }
  }

  /**
   * isVisitedFwd[node] => 1 if addFwdPaths has already visited it.
   */
  int isVisitedFwd[node_t];

  /**
   * Add argNode and all nodes and edges found by following its reverse edges
   * to outputGraph.
   */
  void addFwdPaths(node_t argNode,  edge_t fwdEdge) {
    if ( ! isVisitedFwd[argNode] ) {
      isVisitedFwd[argNode] = 1;
      addNodeToOutput(argNode);
      for (fwdEdge = fstout(argNode); fwdEdge; fwdEdge = nxtout(fwdEdge)) {
        // NOTE: GVPR BUG! I must explicitly pass a value for the parameter
        // that is only supposed to be a local variable. If I don't, then
        // the recursive call shares and modifies the variable here!
        addFwdPaths(opp(fwdEdge, argNode), NULL);
        // Add the edge after adding the node.
        // We don't want the edge addition to also automatically add
        // the opposite node, avoiding our addNodeToOutput() method.
        subedge(outputGraph, fwdEdge);
      }
    }
  }

  /**
   * isVisitedRev[node] => 1 if addRevPaths has already visited it.
   */
  int isVisitedRev[node_t];

  /**
   * Add argNode and all nodes and edges found by following its reverse edges
   * to outputGraph.
   */
  void addRevPaths(node_t argNode,  edge_t revEdge) {
    if ( ! isVisitedRev[argNode] ) {
      isVisitedRev[argNode] = 1;
      addNodeToOutput(argNode);
      for (revEdge = fstin(argNode); revEdge != NULL ; revEdge = nxtin(revEdge)) {
        // NOTE: GVPR BUG! I must explicitly pass a value for the parameter
        // that is only supposed to be a local variable. If I don't, then
        // the recursive call shares and modifies the variable here!
        addRevPaths(opp(revEdge, argNode), NULL);
        // Add the edge after adding the node.
        // We don't want the edge addition to also automatically add
        // the opposite node, avoiding our addNodeToOutput() method.
        subedge(outputGraph, revEdge);
      }
    }
  }

  void addNoiPaths(node_t noi) {
    for (noiToColor[noi]) {
      addFwdPaths(noi);
      addRevPaths(noi);
    }
  }
}

/**
 * The first traversal of the graph will just look for the nodes that
 * represent color-types that have the interesting properties.
 *
 * NOTE: Each `BEG_G` clause starts a full graph traversal. gvpr resets
 * all nodes to an "unvisited" state. It will then try to visit all nodes
 * and edges subject to any limitations set up in the `BEG_G clause`.
 */
BEG_G {
  outputGraph = subg($G, "outputGraph");
  setDflt(outputGraph, "N", "style", "filled");
  setDflt(outputGraph, "N", "color", "lightgrey");
}

N {
  updateNoi2Color($);
}

END_G {
  addNoiPaths();
  $O = outputGraph;
}

Note that I don’t actually want connected components here. that gets ~all of the original graph.
I just want paths that can end with or start from my nodes of interest.

For those who were curious, the specific graph I’m dealing with has 5075 nodes.
I guess that’s not really huge, but it’s much bigger than I want to read or try to visualize.
The result of running my script is a graph with just 20 nodes.

The gvpr tool definitely fulfills a need that I occasionally have,
but it seems really lacking in many respects.
What do people who have to work with large GraphViz graphs all the time do?
Do most folks just find an appropriate library for Java or C++ or something?

Thanks again,
Bradford

Somehow my email response was cut off just a short way through my script, so I’ve fixed that on the forum website.

We could retire gvpr due to its insufficiencies. Maybe someone would implement something in the same spirit as gvpr using python.

In the early graphviz days we thought that generic graph programming systems and databases could eventually become important, but did not have the resources or focus to do that ourselves. So I think a lot of people probably do their work in a system that provides a broader set of features (like Neo4J, or NetworkX) and occasionally render graphs in Graphviz at the end of an analysis if they need it at all.