I have a graph that is a sort of “pre-requisite” graph that I use in my class, to show my students “We need to learn [this] in order to learn [that].” I will paste the full graph at the bottom. I could instead make a more minimal example if it’s desired but I paste the complete graph in case the complexity is important for my question below.
I would like to (programmatically) “highlight” a particular path in my graph. Specifically, I would like to input the name of a node (e.g., “variance”) to the script and the script would color that node in yellow and would color all of the children nodes and edges in green. All of the other nodes and edges would be greyed out. In the particular example where I input “variance” I would want the node “variance” colored in yellow and I would want the following nodes (as well as the relevant edges) colored green: “standardization”, “confidence interval”, “inference”, “causal inference”, “standard error”, “test statistic”, “hypothesis testing”, “p-value”, “efficiency”. All other nodes would be greyed out.
I’d like to keep the position of everything the same as the original graph, in order to get the “highlight” effect, when switching slides (I’m planning to incorporate these into a beamer presentation).
I will do this often in my class, for each topic I introduce, so I would like to be able to change “variance” to the name of a different node and get a similar result. That’s why I want a scripted approach where I input the name of the node.
digraph econometrics {
label = "Prereq chart: it is necessary (solid lines) or helpful (dashed lines) to understand source nodes in order to understand target nodes.";
labelloc = "top";
subgraph prob {
label = "Probability"
node [color = red]
"experiment"
"random variable" -> "random interval"
"random variable" -> { "probability distribution", "i.i.d." }
"i.i.d."
"normal distribution"
"expected value"
"conditional expected value"
"variance"
"probability distribution" -> "i.i.d."
"probability distribution" -> "expected value"
"variance" -> "standardization"
"Law of Large Numbers"
"Central Limit Theorem"
}
subgraph stats {
label = "Statistics"
node [color = teal]
// used to have this in probability, but I like this here.
"sample"
"parameter"
"random interval" -> "confidence interval"
"random variable" -> "estimator"
"estimator"
"population"
"inference"
"causal effect"
"causal effect estimator"
"causal effect assumption"
"causal inference"
"regression function"
"OLS"
"error term"
"estimator" -> { "unbiasedness", "efficiency" }
"p-value"
"test statistic"
"hypothesis testing"
"story telling"
"counterfactual outcome"
"standard error"
}
"variance" -> "efficiency"
"expected value" -> "unbiasedness"
"estimator" -> "standard error"
"standard error" -> "confidence interval"
"standard error" -> "test statistic"
"expected value" -> "variance"
"expected value" -> "conditional expected value"
"conditional expected value" -> "regression function" -> "OLS"
"parameter" -> "regression function"
"error term" -> "causal effect assumption"
// TODO: currently I treat "parameter" only in statistical sense....
// this link is key and connects the worlds of probability to
// statistics
// Maybe have separate notes "parameter of a distribution" and "parameter of a population" (?).
"population" -> "parameter"
// TODO: this is for catching uncategorized nodes.
// node [color = pink]
"estimator" -> "OLS";
"sample" -> "estimator"
"parameter" -> "estimator"
// TODO: have a slide explaining this edge.
// it is about replacing sigma^2 with \hat{sigma^2}!
"estimator" -> "confidence interval"
"probability distribution" -> "normal distribution"
"normal distribution" -> "confidence interval"
"normal distribution" -> "Central Limit Theorem"
"Central Limit Theorem" -> { "hypothesis testing" , "confidence interval" }
"normal distribution" -> "standardization"
"standardization" -> { "confidence interval", "test statistic" }
"i.i.d." -> "sample"
// not strictly necessary. e.g., time series.
"i.i.d." -> "estimator" [style = dashed]
"experiment" -> "sample"
"experiment" -> "random variable"
"parameter" -> "causal effect"
"OLS" -> "causal effect estimator" [style = dashed]
"story telling" -> "causal effect assumption"
"counterfactual outcome" -> "causal effect assumption"
// causal effect is *defined* in terms of counterfactual outcomes
"counterfactual outcome" -> "causal effect"
"causal effect assumption" -> "causal effect estimator"
"causal effect" -> "causal effect estimator"
// to emphasize that a test statistic *is* a r.v.
"random variable" -> "test statistic"
// a test statistic is usually based on an estimator
// but not always.
"estimator" -> "test statistic" [style = dashed]
"parameter" -> "hypothesis testing"
"test statistic" -> "hypothesis testing"
"test statistic" -> "p-value"
"probability distribution" -> "test statistic"
"confidence interval" -> "inference"
"hypothesis testing" -> "inference"
"estimator" -> "inference"
"Central Limit Theorem"
"i.i.d." -> "Law of Large Numbers" [style = dashed]
"Law of Large Numbers" -> "estimator" [style = dashed]
"inference" -> "causal inference"
"OLS" -> "causal inference" [style = dashed]
"causal effect estimator" -> "causal inference"
"variance" -> "standard error"
}
Thank you for those ideas, smattr! I had been wondering how to do preprocessing like you mentioned in (2). I will find that useful for other uses. I wasn’t sure how useful it would be for this present goal, but I see that you employ (2) as part of (3). Nice idea!
I’m interested in using Python to accomplish my goal, but I don’t think I would want to embed the graph inside of the Python script. However, I’m guessing there’s a way to import the graph and then process it and then export the “highlighted” graph.
Thanks for getting me started on different approaches!
Here is a gvpr (http://www.graphviz.org/pdf/gvpr.1.pdf) program that takes a standard Graphviz input file and outputs the starting node in yellow, children (and edges) in green and other nodes and edges in lightgrey.
BEGIN{
node_t aNode, Start;
graph_t aGraph, Root;
int seenE[], seenN[];
int Ecnt=0;
string start;
/////////////////////////////////////////////////////////////////////
// the anEdge argument is just for bookkeeping
// each call creates a new instance
// so the nxtedge call does not over-write
void nodeTraverse(node_t thisNode, edge_t anEdge){
print("// NODE: ", thisNode.name, " seen: ", seenN[thisNode]);
if (seenN[thisNode]!=1){
seenN[thisNode]=1;
thisNode.fillcolor="green";
thisNode.style="filled";
for (anEdge = fstout(thisNode); anEdge; anEdge = nxtout(anEdge)){
print("// edge: ", anEdge.name, " ", anEdge.tail, " ", anEdge.head, " seen: ", seenE[anEdge]);
if (seenE[anEdge]==0){
anEdge.color="green";
seenE[anEdge]=1;
print("// recurse: ", anEdge.head);
nodeTraverse(anEdge.head, anEdge);
}
}
}
print ("// DONE: ", thisNode.name);
} // end of nodeTraverse
//////////////////////////////////////////////
}
BEG_G{
Root=$G;
start=ARGV[0];
print ("// start: ", start);
Start=isNode($G, start);
if (Start==NULL){
printf(2, "Error: unknown node >%s<\n", start);
Ecnt++;
}
if (Ecnt>0) exit(9);
}
//
// grey out all nodes & edges
// we will color the ones we want later
//
N{
$.style="filled";
$.fillcolor="lightgrey";
}
E{
$.color="lightgrey";
}
// now find all the children & color them & their edges
END_G{
nodeTraverse(Start); // no edge parameter on this call
// Start=isNode($G, start);
print("// Start: ", Start.name,Start.color);
Start.fillcolor="yellow";
}
The (Linux) command line is something like this. Note the -a estimator piece of the command that specifies the starting node
This is great! It worked just beautifully. I look forward to studying and learning from your code.
Thanks a lot, steveroush!
A note to anyone else who comes across this: to specify a starting node that has a space in it, use double-quotes and escape the space. e.g., for expected value, I put -a "expected\ value".
The examples in the cmd/gvpr/lib directory would be a good starting place. I refer to them frequently.
I probably have some of my own gvpr programs that are small enough & useful (?) enough to contribute.
I have mixed thoughts about gvpr in general:
+ delivered with the Graphviz package, so available on every OS. Makes it easy to share a program
+ pretty well documented
- no local variables. Makes it hard to write larger programs, makes it almost impossible to share chunks of code between programs.
- usually not useful to users who use Graphviz indirectly (via another language or package)
Are we certain there are no local variables? I tried to read lib/expr/exparse.y and could find some code where it appears to be opening local dictionaries for procedure stack frames but the code is uncommented and hard to understand.
Ahh, I stand corrected! The gvpr documentations says:
There is a single global scope, except for formal function parameters, and even these can interfere with the
type system. Also, the extent of all variables is the entire life of the program. It might be preferable for
scope to reflect the natural nesting of the clauses, or for the program to at least reset locally declared variables. For now, it is advisable to use distinct names for all variables.
However, an experiment shows that individual subroutines can / do have local variables, but they can’t share names with global variables!
This gvpr program (note the clever reuse of the variables I & J ):
BEGIN{
//string I, J; // <<< can't have globals & locals w/ same name!!
void sub1(){
int I,J;
for (I=1;I<=5;I++) print("// I: ", I);
}
void sub2(){
float I,J;
for (I=2.2;I<=5;I++) print("// I: ", I);
}
void sub3(){
string I,J;
I="yes";
J="no";
print("// ", I, " ",J);
}
void sub4(node_t I){
print("// Node name: ", I.name);
}
}
BEG_G{
sub1();
sub2();
sub3();
}
N{
sub4($);
}
Thanks for pointing me to that. I don’t think I would use that for this task, since I don’t want interactivity. But I will follow that thread since it seems useful for other things. Thanks