¶Should stack graphs just be Forth programs?

¶2022-07-31

Two anecdotes while chatting with Jason about writing stack graph rules for Rust.

I described partial paths as the unit of data that we actually store in the data store, and that we stitch together at query time to satisfy Code Nav queries. His immediate response was that THAT was what he was thinking about when constructing gadgets for different snippets of Rust code, and wished that he could author those directly instead of having to figure out which stack graph nodes and edges to create to bring about a particular partial path.

We were discussing similarities with Forth and other concatenative languages, and he pointed out that there are several stack manipulation primitives that we don't have. Things like dup, drop, rollX, etc.

On the one hand, we could possibly “fix” both of these in the small. For (1), add a way to construct partial paths directly, or maybe add a new node type that lets you provide an arbitrary pre- and postcondition. For (2), add those as new node types that you can construct, or implement them via the “arbitrary pre- and postcondition” node type.

But maybe...we should just be writing Forth programs? Is there a way to apply something analogous to the path finding and stitching algorithms to Forth programs? I can see the stack pre- and postcondition translating over pretty well, but stack graphs use the graph structure to encode quite a lot of the underlying semantics. So I'm not sure. But it's an interesting idea!

¶2022-08-10

Came across a good overview of concatenative programming languages that really reinforces this possibility. In particular, it calls out how functions in a concatenative language are typed in a way that strongly resemble a partial path's symbol stack pre- and post-conditions.

Concatenative programming languages

Digging further, the Factor language [Pestov2010] includes a “stack effect” notation, which describes how each word in the language effects the stack. This is also very much like partial path pre- and post-conditions, including a notion of “row variables” that lines up with our symbol stack variables, and the ability for elements of a stack effect to be nested stack effects — modeling how values on the stack can be quotations of code that should be executed later.

Pestov2010

Concatenative languages have multiple stacks; the minimum seems to be two, a control stack and a data stack. I think that in the stack graphs world, the symbol stack corresponds to the data stack, and the scope stack corresponds to the control stack. In a concatenative language, a continuation is “just a snapshot of the current stacks”:

Continuations in concatenative languages

¶2022-08-11

After talking with Beka, I think that “just write Forth” is the wrong goal to aim for, at least at first, because it conflates too many things. What if we were still using TSG to construct some kind of graph structure, but the underlying formalism that the paths represent is more like the execution formalism of a concatenative language like Forth or Factor?

As mentioned above, in the simplest execution model the data stack corresponds to our symbol stack, and the control stack corresponds to our scope stack. If we were to make this kind of change, we'd need to think about what kinds of values can appear on these stacks.

The scope stack would actually stay mostly the same. The control stack typically holds some kind of “instruction pointer”. For us, nodes still identify states or points of time in the execution of the “program”. So the node identifiers are still a good equivalent to an “instruction pointer”.

But what about the symbol stack? A concatenative data stack contains the values that your program can operate on. We'd need several kinds of values, some of which already exist or have analogues:

symbols (actual data, can't do anything to them right now except for compare them for equality)

“quotations” are how you pass around code to be executed later. Note that they are not closures, and do not carry a reference to an environment! I think that means that they are just code pointers, aka instruction pointers, aka node references just like we'd store on the control stack.

“continuations” In a concatenative language, a continuation is just a snapshot of the stacks. So we'd do exactly that — a continuation value would be a tuple of a symbol/data stack and a scope/control stack. Nothing more.

With this change, we still need to think about what other “node types” we'd need to model the various stack manipulation primitives that Forth/Factor/etc provide (dup, roll, drop, etc). But I think this model is enough to let us implement callcc.

This also helps crystallize what our partial paths are. They're not types like Factor's stack effect annotations. Instead, they're a concise summary of a set of execution traces. I.e., there is a programming language down there (and these changes make it properly Turing-complete), but programs in this language aren't written as a set of instructions, they're defined as the set of execution traces (the partial paths) that we want the program to produce. (And in particular, if we were to write the programs out in a more traditional instruction-based way, there'd be a lot of use of amb.)

Still-open questions:

Do all of the existing node types work with this new formalism? (i.e., we really really need to make sure our existing stack graph rules and data files work as-is. This would only be a way for us to add node types that new stack graph rules could choose to use.)

Can we still create a good representation for pre- and post-conditions, that work with the new kinds of values, and which still supports unification and concatenation such that our path-stitching algorithm still works quickly?