WIP: turn macro output into untyped AST #508

zerbina · 2023-01-02T00:24:56Z

This PR implements a transformation pass that turns untrusted AST as output from a macro into untyped AST. In this context, "untyped" currently means that the resulting AST:

has no type information
has no node flags
apart from nkSym, nkOpenSymChoice, and nkClosedSymChoice contains only nodes that are output by the parser
only contains syntax that is valid according to the grammar

The translation tries to approximate constructs present in typed AST with their untyped AST counterpart where possible. For example a Conv (Type "int") (Sym "x") is turned into Call (Sym "int") (Sym "x").

For each input symbol node, it is validated that the symbol is in scope (reachable from the current scope). If it is, the symbol is also used in the output AST -- it's turned into an identifier node otherwise, so that semantic analysis can decide what to do with it.

The overall goal is to move syntax and grammar checking out of the semantic analysis procedures by preventing ill-formed AST from ever reaching them. This should simplify sem a bit, as it can now focus on semantic analysis and no longer has to also validate the structure of the AST (separation of concerns). Illformed AST error can now also be encoded as nkError nodes and the checkSonsLen procedure becomes obsolete.

Removing the syntax/grammar checks from sem is not done as part of this PR and should happen as a follow up.

Details

For the symbol reachability validation, it is first checked if the symbol is present in the current scope (i.e. the one where the macro is expanded) or one of the scopes enclosing it. Then, it is tested whether the symbol is part of the top-level scope of the module it belongs to. If that's also not the case, the symbol is treated as not reachable. Since instantiated generics are not added to the symbol table, their symbols are never treated as reachable, and thus turned into identifier by the sanitizer.

As part of fixing the issues that are blocking the MIR, I've looked a bit more into sem recently, and one thing that I identified as adding a significant amount of complexity is that due to macros, sem cannot trust the AST it operates on. This inspired me to look into solutions, which lead me to experimenting with this sanitizer layer.

Notes for Reviewers

this a prototype for the most part
the compiler is able to bootstrap, but a good amount of constructs are not supported yet
gensym handling is not figured out yet

saem

Noted some comments for the sanitizer core branches.

Thanks again, this is awesome, it inspired me to write some thoughts down, I'll share them shortly via discussions.

saem · 2023-01-02T19:18:04Z

compiler/sem/sanitizer.nim

+    # can't be expressed in untyped AST. Also only present in post-transform
+    # AST
+    # XXX: we could throw away the 'env' part and only use the 'prc' -- I'm
+    #      not sure if that's a good idea however
+    invalid()


For this PR, I think dropping the env var might not be the way to go, maybe we want to go with:

write out locally the:

env type

procdef

call it to simulate it

my thinking

There are two paths here when it comes to defining what a "raisings" means:

go back in time

use "reverse" transforms that emit roughly equivalent untyped code

I think 1 is nice where we can reasonably rewind thing, but I don't believe this is a property we can preserve unless we keep a) a full history of transforms, b) forward transforms need to preserve things, c) that last point discounts fusing, and d) I don't think the user always wants a time machine.

I believe the correct answer is phase dependent, so from post-transf is likely too big a transform (hlo has run) and history rewind is likely what people do not want. We can't handle this sort of decisioning right now. If it's early sem (macro/template expansion), back to untyped is better as a history rewind.

We don't track enough data to do this, hence my suggestion above for now.

The problem is that there is no equivalent for nkClosure in the untyped AST. A nkClosure is essentially a nkTupleConstr that always has two elements and is always of tyProc type. Consider the following code:

proc aux(x: proc(p: int)) = discard proc prc() = var x = 0 var y: int proc inner(p: int) = x = 1 y = p inner(1) # 1 aux(inner) # 2 let cl = inner # 3 cl(1)

After transf the body of prc will look like this:

var :env internalNew(:env) var :env.x1 = 0 var :env.y2: int proc inner(p: int; :envP) = ... (inner, :env)(1) # 1 # ^ `nkClosure` aux((inner, :env)) # 2 # ^ `nkClosure` let cl = (inner, :env) # 3 # ^ `nkClosure` cl(1)

For 1 and 3 it's possible to turn the nkClosure back into untyped AST (knowledge of sibling nodes is required, however), bit for 2 it's not.

Right now, what a node unwinds to depends only on the node kind plus the context it is used in. Supporting unwinding the above transformed code would mean that what a node translates to is also dependent on the sub-nodes (and sometimes even sibling nodes).

My terrible idea here was that we wouldn't do that much recovery. Instead we'd emit the creation of the env etc... but that runs into all sorts of issues, with internalNew and a bunch more stuff leaking out.

Keeping it invalid for now makes sense.

Thanks for the detailed explanation with the before and after transformation.

saem · 2023-01-02T19:19:37Z

compiler/sem/sanitizer.nim

+    nkRecCase,
+    nkRecWhen:
+    # context dependent
+    invalid()


If the when is eliminated then no worries, but if not, wouldn't we continue sanitizing recursively?

The sanitizer logic for everything type definition related is not yet implemented, but once it is, both the condition and body AST of the nkRecWhen will also be considered.

saem · 2023-01-02T19:20:36Z

compiler/sem/sanitizer.nim

+  of nkStmtList:
+    result = stmts(c, n)
+  of nkImportStmt, nkImportExceptStmt, nkExportStmt, nkExportExceptStmt, nkFromStmt, nkIncludeStmt:
+    # QUESTION: should we perform any additional validation here or leave that


most of the validation should remain in sem, doing some basics here for sanity is fine, but I wouldn't "hurt" ourselves to do it here.

saem · 2023-01-02T19:21:50Z

compiler/sem/sanitizer.nim

+  of nkForStmt:
+    checkMinSonsLen(n, 3)
+    result = prepareFrom(n)
+    # note: this allows for code that the parser doesn't


Hmm, the syntax ast and the untyped ast spec diverging is concerning. 🤔

With how it's currently implemented, the sanitizer allows the following code:

for (a, b), (c, d), (e, f) in items: ...

while the parser doesn't. For what it's worth, with the recent semForVar refactoring, this would also be correctly analysed by sem.

saem · 2023-01-02T19:22:52Z

compiler/sem/sanitizer.nim

+    checkSonsLen(n, 1)
+    result = expr(c, n[0])
+  of nkObjDownConv, nkObjUpConv:
+    # XXX: these don't exist in typed AST prior to ``transf``. How should they


The types might change so the conversions should be deduced again, IMO.

Okay, I'm going to check how well the following unwinding works then:

ObjDownConv (x) -> Call (type-of-destination) (x)

saem · 2023-01-02T19:24:24Z

compiler/sem/sanitizer.nim

+        of nkElifExpr, nkElifBranch:
+          parseBranch(c, it, n.kind == nkIfExpr)
+        of nkElseExpr, nkElse:
+          # XXX: hm, also make sure that the 'else' is the last item? Or leave


Yes, I think this is the type of validation that we likely want to do here.

Although, this level of validation is likely at the edge of complexity we want to entertain, however.

The parser also only allows nkElse/nkElseExpr as the last item, so it makes additional sense to enforce that here too.

saem · 2023-01-02T19:25:31Z

compiler/sem/sanitizer.nim

+    result = newTreeI(nkInfix, n.info, ident(c, ".."), expr(c, n[0]), expr(c, n[1]))
+  of nkDotExpr:
+    result = parseDotExpr(c, n)
+  of nkCheckedFieldExpr:


This would be deduced again by sem, no?

Yep. The comment is a remnant of the time were it wasn't fully clear yet what the sanitizer should actually do, i.e. whether it should perform unwinding or translate into untyped AST that is roughly semantically equivalent.

saem · 2023-01-02T19:27:27Z

compiler/sem/sanitizer.nim

+  of nkNilLit:
+    result = newNodeI(nkNilLit, n.info)
+  of nkDotCall:
+    # XXX: hm, turn into an ``nkDotExpr`` + ``nkCall`` instead?


Yes, I think that would be a more true unwind.

The reason I hesitated here is because nkDotCall can appear in neither typed nor untyped AST a macro gets from sem. Looking at nkDotCall a bit closer, I think the more proper unwind would be also be only nkDotExpr. That is:

DotCall (x, y) -> DotExpr (y, x)

Good catch, that's indeed a truer reversal based on what dotTransformation does.

saem · 2023-01-02T19:33:45Z

compiler/sem/sanitizer.nim

+        if it.kind == nkExprColonExpr: parseColonExpr(c, it)
+        else:                          expr(c, it)
+
+  of nkError:


"technically" this could be:

if user error, return to error pragma

otherwise, extract the untyped wrongNode as that should always be a straight swap

If we enable error pragma blocks, then we might have an easier time:

{.error.}: this is not valid syntax without commas

Thoughts?

saem · 2023-01-04T09:03:08Z

Reports is a blight, there I've made my position absolute clear.

Right now this PR is stuck, because said trash has gummed up the works with Concepts and explain reporting, and the overly far reaching structured report hook. I have no idea why help messages and general CLI output really need to go through this overwrought machinery.

So, that means no further major changes that make this PR hard/impossible to merge because they poke at too much of the existing "reports" shite in the code base. Sorry. I hate type this out as much as you do reading it, I'm really not happy. The problem is that this blight is literally everywhere and has completely locked everything together.

I'm not a fan of doing of pulling the andon cord, but at present I'm not seeing any non-big bang changes to unblock things. In my latest commit I've written a bit of a plan for how to push through, along with starting down that road.

The options for what to do next are as follows:

push through with the PR, it doesn't have to be pretty but maximum pushback of anything "reports", the downside is this can take a while
disable concepts and explain tests so CI passes, they're flakey features and the downside is the output sucks for a while, but the upside is that the merge block is free

I'll cover the second option first as it's fairly straightforward: there are two or three concepts and {.explain.} related tests, everything else is passing. I just disable them, clean-up the PR, and merge it. Then continue excising this stuff from the codebase.

The first one means:

I've disabled localReport for SemReport, I use that to rip out SemReport except for an overwrought text rendering layer
I repeat the above for VMReports
I liberate help messages, regular program/CLI output, compiler tracing/internals telemetry etc... from this stuff, it should never have been a part of this
create logging procs on PContext and other more specific things, so diagnostics/telemetry/events go through those, with greater context hopefully that makes when to write/not easier to grok

Next steps, ideally we can squeeze some time to discuss tomorrow which strategy to take. I'm good to unblock merges as long as there is a path forward. So just need agreement on which option to take, 1, 2 or mystery option 3 which I'm hoping someone else can suggest.

If we can't discuss it tomorrow, I'm likely to go with Option 2 as I don't like blocking for any lengthy period of time.

@zerbina this is the message I alluded to in the other PR.

disruptek · 2023-01-04T11:00:18Z

You lose nothing by disabling the concepts stuff. Why not take that route in the service of progress; you can still take the same time to re-enable it later.

Does it really matter how many PRs this work goes into or how long it takes? The main thing is that you're improving the code for everyone. Good job.

saem · 2023-01-04T19:03:59Z

Apologies, I posted the above, comment in the wrong PR -- 1AM tiredness. Should have posted it here. Nevertheless, thank you all very much for the encouragement and support.

Correct PR: #505

saem · 2023-01-06T23:31:58Z

compiler/sem/sanitizer.nim

+  # IDEA: the new node kind ``nkSymSuggestion`` (or similar) could be
+  #       introduced. It communicates to sem that the symbol must be checked
+  #       for reachability (i.e. that the defining scope is reachable from the
+  #       usage site) first. This would allow to move the scope check out of the
+  #       sanitizer, in which case it no longer needs access to an ``PContext``
+  #       instance


Check my Understanding of the Problem

If I understand things correctly, what a macro can do to break visibility is around shadowed or overloaded symbols, some examples:

# example 1 let foo = 10 block: let foo = 11 echo foo # <-- a macro could choose to make this the outer foo

In the above a macro wants to emit an ast thingy, that is some path from the position of foo in echo foo to it's destination (relative to it) and/or some way to define the path from a top level/containing scope module/proc body, etc.

# example 2 proc foo(a: int) = echo "singular" proc foo(aaaaaa: varargs[int]) = echo "plural" foo 1 # <-- a macro could make this dispatch on the varargs and get plural

In the above a macro wants to emit an ast thingy, that is some way to unambiguously refer to a symbol that may or may not be overloaded.

Solution Idea(s)

A solution for the first example might look like an nkIdentQuery "start" "path" "parts" "to" "navigate". This might turn into something very akin to jq, xpath, etc... but I don't mind. To be useful the macro API likely needs a facility to get paths, eg:

proc getIdentPath(tree, child: NimNode): NimNode{nkIdentPath}, and

proc getIdentPathFrom(tree, start, dest: NimNode): NimNode{nkIdentPath}

In both those cases all nodes must be contained within tree.

A few "special" paths can be considered if we find them useful, like "{module}" (I just randomly chose {} to mean interpolate) for the containing module. Or {proc} for the containing proc, etc. Bonus points if we can use labels.

A solution for the the second example might be always give stable identifiers for routines that are derivable, even for a human running the algorithm by hand in at least simple cases, and then a macro author can refer to it either by formulating the identifier manually or querying for it. So perhaps it could be an nkIdentQuery <to be determined> as well, but it'll need a bit more of a think on the algorithm, it's basically name mangling. 😆

Aside, Kotlin has a facility like this, I didn't bring it up until now because I don't want to get caught up in syntax, it might not cover all cases, but does a fair few:

https://kotlinlang.org/spec/expressions.html#this-expressions (even though it's this I'm just focusing on the identifier resolution)

https://kotlinlang.org/docs/returns.html (note the 3rd example under Return to labels where things get implicit labels)

erlang is untyped, so they get away with disambiguating functions based on arity, so their imports look like -import(somemodule, [foo/1, foo/2]).; that's not specific enough for us I'm afraid

Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be genSym, which has the problem of producing a symbol instead of an unresolved identifier. genSym'ed symbols are a blocker for the sanitizer right now.

My original idea was to make genSym produce a unique identifier instead of an sfGenSym symbol, but maybe that's a solution orthogonal to nkIdentQuery.

If you're referring to the IDEA comment: that was me thinking about ways to get around the PContext dependency. Apologies if I'm completely misunderstanding what you mean.

Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be genSym, which has the problem of producing a symbol instead of an unresolved identifier. genSym'ed symbols are a blocker for the sanitizer right now.

Not just AST a macro generates, but I think example 1 is better solved by not having the issue in the first place. In other words, genSym is bad, but generating a unique identifier is what we want. At which point, we only get generated symbols for things the compiler has already figured out.

For example 2, that's a user (import, call site, etc) or macro trying to unambiguously refer to the symbol. The unique identifier doesn't solve that.

My original idea was to make genSym produce a unique identifier instead of an sfGenSym symbol, but maybe that's a solution orthogonal to nkIdentQuery.

After the unique identifier part which solves example 1 well, outside of letting the user disambiguate, but let's not bother yet. I think we're down to example 2, where query or "unambiguous fully qualified path" is required.

If you're referring to the IDEA comment: that was me thinking about ways to get around the PContext dependency. Apologies if I'm completely misunderstanding what you mean.

It's related, but I think that's an extra bonus from solving other issues.

In order to be able to eventually remove `considerQuotedIdent`, the sanitizer does a bit more than just sanitize and evaluates identifier construction expressions.

In addition, explicitly allow the type slot to be empty (`parseTypeExpr` is going to reject empty nodes).

In addition, `process` now dispatches to `parsePragmaList` for processing `nkPragma`.

* basic processing for formal and generic parameter lists * lambda-like expressions work

The implementation currently supports: * records (list, case, when) * enums * tuples * type-classes * var, ptr, ref, distinct, proc

While not entirely correct (yet), doing so is better than calling `process`.

The result is now properly initialized and `parsePragmaList` is used instead of the outdated `parsePragma`.

A typed `nkEmpty` was previously put into the name slot for `nkObjConstr` nodes. While `semTypeNode` allows this (it likely shouldn't), the sanitizer does not.

WIP: turn macro output into untyped AST

fdaad2c

haxscramper added this to the Semantic analysis phase refactoring milestone Jan 2, 2023

haxscramper added refactor Implementation refactor compiler General compiler tag compiler/sem Related to semantic-analysis system of the compiler lang/macro labels Jan 2, 2023

saem reviewed Jan 2, 2023

View reviewed changes

saem reviewed Jan 6, 2023

View reviewed changes

zerbina added 19 commits March 16, 2023 00:34

sanitizer: implement nkAccQuoted processing

4eec6a9

In order to be able to eventually remove `considerQuotedIdent`, the sanitizer does a bit more than just sanitize and evaluates identifier construction expressions.

sanitizer: make parseColonExpr applicable in general

7637d19

sanitizer: consider pragmas in identdefs

9506d06

In addition, explicitly allow the type slot to be empty (`parseTypeExpr` is going to reject empty nodes).

sanitizer: preliminary pragma list processing

d513b0e

sanitizer: implement parseTry

21421d5

sanitizer: fix nkType processing

71a9f11

sanitizer: fix nkPar/nkObjConstr processing

830e870

sanitizer: implement pragma-block processing

eee6d75

In addition, `process` now dispatches to `parsePragmaList` for processing `nkPragma`.

sanitizer: implement nkConstSection processing

91c969a

sanitizer: progress on callable definitions

d4d7b9b

* basic processing for formal and generic parameter lists * lambda-like expressions work

sanitizer: type sections are supported

6b959d4

sanitizer: basic implementation of parseTypeNode

62e950d

The implementation currently supports: * records (list, case, when) * enums * tuples * type-classes * var, ptr, ref, distinct, proc

sanitizer: call parseTypeNode for type expressions

e4e6e96

While not entirely correct (yet), doing so is better than calling `process`.

sanitizer: implement parseIdentVis

bbd9feb

sanitizer: implement nkDefer support

94784d9

sanitizer: fix the nkStaticStmt processing

3013d9d

sanitizer: allow basic type expressions in process

0df4a2c

sanitizer: fix processing logic for nkAsmStmt

57692a6

The result is now properly initialized and `parsePragmaList` is used instead of the outdated `parsePragma`.

vmcompilerserdes: use an nkType for object constructors

7b05d95

A typed `nkEmpty` was previously put into the name slot for `nkObjConstr` nodes. While `semTypeNode` allows this (it likely shouldn't), the sanitizer does not.

zerbina mentioned this pull request Oct 1, 2023

fix(sem): crash due to ill-formed if AST #933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: turn macro output into untyped AST #508

WIP: turn macro output into untyped AST #508

zerbina commented Jan 2, 2023

saem left a comment

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 4, 2023

saem Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 2, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem Jan 4, 2023

saem Jan 2, 2023

zerbina Jan 4, 2023

saem commented Jan 4, 2023

disruptek commented Jan 4, 2023

saem commented Jan 4, 2023

saem Jan 6, 2023

zerbina Jan 18, 2023

saem Jan 24, 2023

WIP: turn macro output into untyped AST #508

Are you sure you want to change the base?

WIP: turn macro output into untyped AST #508

Conversation

zerbina commented Jan 2, 2023

Details

Notes for Reviewers

saem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

my thinking

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saem commented Jan 4, 2023

disruptek commented Jan 4, 2023

saem commented Jan 4, 2023

Choose a reason for hiding this comment

Check my Understanding of the Problem

Solution Idea(s)

Choose a reason for hiding this comment

Choose a reason for hiding this comment