Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: turn macro output into untyped AST #508

Draft
wants to merge 20 commits into
base: devel
Choose a base branch
from

Conversation

zerbina
Copy link
Collaborator

@zerbina zerbina commented Jan 2, 2023

This PR implements a transformation pass that turns untrusted AST as output from a macro into untyped AST. In this context, "untyped" currently means that the resulting AST:

  • has no type information
  • has no node flags
  • apart from nkSym, nkOpenSymChoice, and nkClosedSymChoice contains only nodes that are output by the parser
  • only contains syntax that is valid according to the grammar

The translation tries to approximate constructs present in typed AST with their untyped AST counterpart where possible. For example a Conv (Type "int") (Sym "x") is turned into Call (Sym "int") (Sym "x").

For each input symbol node, it is validated that the symbol is in scope (reachable from the current scope). If it is, the symbol is also used in the output AST -- it's turned into an identifier node otherwise, so that semantic analysis can decide what to do with it.

The overall goal is to move syntax and grammar checking out of the semantic analysis procedures by preventing ill-formed AST from ever reaching them. This should simplify sem a bit, as it can now focus on semantic analysis and no longer has to also validate the structure of the AST (separation of concerns). Illformed AST error can now also be encoded as nkError nodes and the checkSonsLen procedure becomes obsolete.

Removing the syntax/grammar checks from sem is not done as part of this PR and should happen as a follow up.

Details

For the symbol reachability validation, it is first checked if the symbol is present in the current scope (i.e. the one where the macro is expanded) or one of the scopes enclosing it. Then, it is tested whether the symbol is part of the top-level scope of the module it belongs to. If that's also not the case, the symbol is treated as not reachable. Since instantiated generics are not added to the symbol table, their symbols are never treated as reachable, and thus turned into identifier by the sanitizer.


As part of fixing the issues that are blocking the MIR, I've looked a bit more into sem recently, and one thing that I identified as adding a significant amount of complexity is that due to macros, sem cannot trust the AST it operates on. This inspired me to look into solutions, which lead me to experimenting with this sanitizer layer.

Notes for Reviewers

  • this a prototype for the most part
  • the compiler is able to bootstrap, but a good amount of constructs are not supported yet
  • gensym handling is not figured out yet

@haxscramper haxscramper added refactor Implementation refactor compiler General compiler tag compiler/sem Related to semantic-analysis system of the compiler lang/macro labels Jan 2, 2023
Copy link
Collaborator

@saem saem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted some comments for the sanitizer core branches.

Thanks again, this is awesome, it inspired me to write some thoughts down, I'll share them shortly via discussions.

Comment on lines +948 to +952
# can't be expressed in untyped AST. Also only present in post-transform
# AST
# XXX: we could throw away the 'env' part and only use the 'prc' -- I'm
# not sure if that's a good idea however
invalid()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I think dropping the env var might not be the way to go, maybe we want to go with:

  1. write out locally the:
    1. env type
    2. procdef
  2. call it to simulate it

my thinking

There are two paths here when it comes to defining what a "raisings" means:

  1. go back in time
  2. use "reverse" transforms that emit roughly equivalent untyped code

I think 1 is nice where we can reasonably rewind thing, but I don't believe this is a property we can preserve unless we keep a) a full history of transforms, b) forward transforms need to preserve things, c) that last point discounts fusing, and d) I don't think the user always wants a time machine.

I believe the correct answer is phase dependent, so from post-transf is likely too big a transform (hlo has run) and history rewind is likely what people do not want. We can't handle this sort of decisioning right now. If it's early sem (macro/template expansion), back to untyped is better as a history rewind.

We don't track enough data to do this, hence my suggestion above for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that there is no equivalent for nkClosure in the untyped AST. A nkClosure is essentially a nkTupleConstr that always has two elements and is always of tyProc type. Consider the following code:

proc aux(x: proc(p: int)) = discard

proc prc() =
  var x = 0
  var y: int
  
  proc inner(p: int) =
    x = 1
    y = p

  inner(1)       # 1
  aux(inner)     # 2
  let cl = inner # 3
  cl(1)

After transf the body of prc will look like this:

var :env
internalNew(:env)

var :env.x1 = 0
var :env.y2: int

proc inner(p: int; :envP) = ...

(inner, :env)(1)       # 1
# ^ `nkClosure`
aux((inner, :env))     # 2
#   ^ `nkClosure`
let cl = (inner, :env) # 3
#        ^ `nkClosure`
cl(1)

For 1 and 3 it's possible to turn the nkClosure back into untyped AST (knowledge of sibling nodes is required, however), bit for 2 it's not.

Right now, what a node unwinds to depends only on the node kind plus the context it is used in. Supporting unwinding the above transformed code would mean that what a node translates to is also dependent on the sub-nodes (and sometimes even sibling nodes).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My terrible idea here was that we wouldn't do that much recovery. Instead we'd emit the creation of the env etc... but that runs into all sorts of issues, with internalNew and a bunch more stuff leaking out.

Keeping it invalid for now makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation with the before and after transformation.

nkRecCase,
nkRecWhen:
# context dependent
invalid()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the when is eliminated then no worries, but if not, wouldn't we continue sanitizing recursively?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanitizer logic for everything type definition related is not yet implemented, but once it is, both the condition and body AST of the nkRecWhen will also be considered.

of nkStmtList:
result = stmts(c, n)
of nkImportStmt, nkImportExceptStmt, nkExportStmt, nkExportExceptStmt, nkFromStmt, nkIncludeStmt:
# QUESTION: should we perform any additional validation here or leave that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the validation should remain in sem, doing some basics here for sanity is fine, but I wouldn't "hurt" ourselves to do it here.

of nkForStmt:
checkMinSonsLen(n, 3)
result = prepareFrom(n)
# note: this allows for code that the parser doesn't
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the syntax ast and the untyped ast spec diverging is concerning. 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With how it's currently implemented, the sanitizer allows the following code:

for (a, b), (c, d), (e, f) in items: ...

while the parser doesn't. For what it's worth, with the recent semForVar refactoring, this would also be correctly analysed by sem.

checkSonsLen(n, 1)
result = expr(c, n[0])
of nkObjDownConv, nkObjUpConv:
# XXX: these don't exist in typed AST prior to ``transf``. How should they
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types might change so the conversions should be deduced again, IMO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm going to check how well the following unwinding works then:

ObjDownConv (x) -> Call (type-of-destination) (x) 

of nkElifExpr, nkElifBranch:
parseBranch(c, it, n.kind == nkIfExpr)
of nkElseExpr, nkElse:
# XXX: hm, also make sure that the 'else' is the last item? Or leave
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the type of validation that we likely want to do here.

Although, this level of validation is likely at the edge of complexity we want to entertain, however.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser also only allows nkElse/nkElseExpr as the last item, so it makes additional sense to enforce that here too.

result = newTreeI(nkInfix, n.info, ident(c, ".."), expr(c, n[0]), expr(c, n[1]))
of nkDotExpr:
result = parseDotExpr(c, n)
of nkCheckedFieldExpr:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be deduced again by sem, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The comment is a remnant of the time were it wasn't fully clear yet what the sanitizer should actually do, i.e. whether it should perform unwinding or translate into untyped AST that is roughly semantically equivalent.

of nkNilLit:
result = newNodeI(nkNilLit, n.info)
of nkDotCall:
# XXX: hm, turn into an ``nkDotExpr`` + ``nkCall`` instead?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be a more true unwind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I hesitated here is because nkDotCall can appear in neither typed nor untyped AST a macro gets from sem. Looking at nkDotCall a bit closer, I think the more proper unwind would be also be only nkDotExpr. That is:

DotCall (x, y) -> DotExpr (y, x)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that's indeed a truer reversal based on what dotTransformation does.

if it.kind == nkExprColonExpr: parseColonExpr(c, it)
else: expr(c, it)

of nkError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"technically" this could be:

  • if user error, return to error pragma
  • otherwise, extract the untyped wrongNode as that should always be a straight swap

If we enable error pragma blocks, then we might have an easier time:

{.error.}:
  this is not valid syntax without commas

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

@saem
Copy link
Collaborator

saem commented Jan 4, 2023

Reports is a blight, there I've made my position absolute clear.

Right now this PR is stuck, because said trash has gummed up the works with Concepts and explain reporting, and the overly far reaching structured report hook. I have no idea why help messages and general CLI output really need to go through this overwrought machinery.

So, that means no further major changes that make this PR hard/impossible to merge because they poke at too much of the existing "reports" shite in the code base. Sorry. I hate type this out as much as you do reading it, I'm really not happy. The problem is that this blight is literally everywhere and has completely locked everything together.

I'm not a fan of doing of pulling the andon cord, but at present I'm not seeing any non-big bang changes to unblock things. In my latest commit I've written a bit of a plan for how to push through, along with starting down that road.

The options for what to do next are as follows:

  1. push through with the PR, it doesn't have to be pretty but maximum pushback of anything "reports", the downside is this can take a while
  2. disable concepts and explain tests so CI passes, they're flakey features and the downside is the output sucks for a while, but the upside is that the merge block is free

I'll cover the second option first as it's fairly straightforward: there are two or three concepts and {.explain.} related tests, everything else is passing. I just disable them, clean-up the PR, and merge it. Then continue excising this stuff from the codebase.

The first one means:

  • I've disabled localReport for SemReport, I use that to rip out SemReport except for an overwrought text rendering layer
  • I repeat the above for VMReports
  • I liberate help messages, regular program/CLI output, compiler tracing/internals telemetry etc... from this stuff, it should never have been a part of this
  • create logging procs on PContext and other more specific things, so diagnostics/telemetry/events go through those, with greater context hopefully that makes when to write/not easier to grok

Next steps, ideally we can squeeze some time to discuss tomorrow which strategy to take. I'm good to unblock merges as long as there is a path forward. So just need agreement on which option to take, 1, 2 or mystery option 3 which I'm hoping someone else can suggest.

If we can't discuss it tomorrow, I'm likely to go with Option 2 as I don't like blocking for any lengthy period of time.

@zerbina this is the message I alluded to in the other PR.

@disruptek
Copy link
Contributor

You lose nothing by disabling the concepts stuff. Why not take that route in the service of progress; you can still take the same time to re-enable it later.

Does it really matter how many PRs this work goes into or how long it takes? The main thing is that you're improving the code for everyone. Good job.

@saem
Copy link
Collaborator

saem commented Jan 4, 2023

Apologies, I posted the above, comment in the wrong PR -- 1AM tiredness. Should have posted it here. Nevertheless, thank you all very much for the encouragement and support.

Correct PR: #505

Comment on lines +179 to +184
# IDEA: the new node kind ``nkSymSuggestion`` (or similar) could be
# introduced. It communicates to sem that the symbol must be checked
# for reachability (i.e. that the defining scope is reachable from the
# usage site) first. This would allow to move the scope check out of the
# sanitizer, in which case it no longer needs access to an ``PContext``
# instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my Understanding of the Problem

If I understand things correctly, what a macro can do to break visibility is around shadowed or overloaded symbols, some examples:

# example 1
let foo = 10
block:
  let foo = 11
  echo foo       # <-- a macro could choose to make this the outer foo

In the above a macro wants to emit an ast thingy, that is some path from the position of foo in echo foo to it's destination (relative to it) and/or some way to define the path from a top level/containing scope module/proc body, etc.

# example 2
proc foo(a: int) = echo "singular"

proc foo(aaaaaa: varargs[int]) = echo "plural"

foo 1 # <-- a macro could make this dispatch on the varargs and get plural

In the above a macro wants to emit an ast thingy, that is some way to unambiguously refer to a symbol that may or may not be overloaded.

Solution Idea(s)

A solution for the first example might look like an nkIdentQuery "start" "path" "parts" "to" "navigate". This might turn into something very akin to jq, xpath, etc... but I don't mind. To be useful the macro API likely needs a facility to get paths, eg:

  • proc getIdentPath(tree, child: NimNode): NimNode{nkIdentPath}, and
  • proc getIdentPathFrom(tree, start, dest: NimNode): NimNode{nkIdentPath}

In both those cases all nodes must be contained within tree.

A few "special" paths can be considered if we find them useful, like "{module}" (I just randomly chose {} to mean interpolate) for the containing module. Or {proc} for the containing proc, etc. Bonus points if we can use labels.

A solution for the the second example might be always give stable identifiers for routines that are derivable, even for a human running the algorithm by hand in at least simple cases, and then a macro author can refer to it either by formulating the identifier manually or querying for it. So perhaps it could be an nkIdentQuery <to be determined> as well, but it'll need a bit more of a think on the algorithm, it's basically name mangling. 😆


Aside, Kotlin has a facility like this, I didn't bring it up until now because I don't want to get caught up in syntax, it might not cover all cases, but does a fair few:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be genSym, which has the problem of producing a symbol instead of an unresolved identifier. genSym'ed symbols are a blocker for the sanitizer right now.

My original idea was to make genSym produce a unique identifier instead of an sfGenSym symbol, but maybe that's a solution orthogonal to nkIdentQuery.

If you're referring to the IDEA comment: that was me thinking about ways to get around the PContext dependency. Apologies if I'm completely misunderstanding what you mean.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be genSym, which has the problem of producing a symbol instead of an unresolved identifier. genSym'ed symbols are a blocker for the sanitizer right now.

Not just AST a macro generates, but I think example 1 is better solved by not having the issue in the first place. In other words, genSym is bad, but generating a unique identifier is what we want. At which point, we only get generated symbols for things the compiler has already figured out.

For example 2, that's a user (import, call site, etc) or macro trying to unambiguously refer to the symbol. The unique identifier doesn't solve that.

My original idea was to make genSym produce a unique identifier instead of an sfGenSym symbol, but maybe that's a solution orthogonal to nkIdentQuery.

After the unique identifier part which solves example 1 well, outside of letting the user disambiguate, but let's not bother yet. I think we're down to example 2, where query or "unambiguous fully qualified path" is required.

If you're referring to the IDEA comment: that was me thinking about ways to get around the PContext dependency. Apologies if I'm completely misunderstanding what you mean.

It's related, but I think that's an extra bonus from solving other issues.

In order to be able to eventually remove `considerQuotedIdent`, the
sanitizer does a bit more than just sanitize and evaluates identifier
construction expressions.
In addition, explicitly allow the type slot to be empty (`parseTypeExpr`
is going to reject empty nodes).
In addition, `process` now dispatches to `parsePragmaList` for
processing `nkPragma`.
* basic processing for formal and generic parameter lists
* lambda-like expressions work
The implementation currently supports:
* records (list, case, when)
* enums
* tuples
* type-classes
* var, ptr, ref, distinct, proc
While not entirely correct (yet), doing so is better than calling
`process`.
The result is now properly initialized and `parsePragmaList` is used
instead of the outdated `parsePragma`.
A typed `nkEmpty` was previously put into the name slot for
`nkObjConstr` nodes. While `semTypeNode` allows this (it likely
shouldn't), the sanitizer does not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/sem Related to semantic-analysis system of the compiler compiler General compiler tag lang/macro refactor Implementation refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants