Tuesday, May 28, 2013

Onward

After much soul-searching... actually, not that much... I've decided to continue working on Loyc and EC# even though I discovered a really nice language in Nemerle. Personally, I wouldn't mind switching development to Nemerle, but I didn't want to throw away all the work I'd already done on LLLPG, and I still think that the goals of the Loyc project cover a lot of ground that Nemerle is not intended to cover, so continuing to work on Loyc is not simply a waste of time.

Nemerle has a great macro system, and I have a lot to learn about how it works. Whatever I learn, I will document it and apply that knowledge to EC#. Assuming I'm smart enough to decipher how it works, EC# will end up with a great macro system like Nemerle's.

Loyc development will continue in C#, since (A) there is so much C# code already; (B) I'm going to write a EC# compiler and it should be self-hosting, which is impossible if I introduce Nemerle code; and (C) I want to attract new developers, and requiring them to use a new programming language might drive them away.

I wrote a big long blog post about LLLPG and a new syntax I'm developing called LES, but just as I was about to post it I discovered a serious design flaw in LES. I'll see if I can iron that out before the post... by which time LLLPG will be almost done! I just got simple syntactic predicates working this morning, so all that is left is the parser which will finally allow you to easily give LLLPG an input grammar.

Saturday, May 25, 2013

Status update: Loyc tree rewritten

I have made and implemented the changes that I planned to make to the Loyc tree. The code was rewritten from scratch; the new code is not as fancy or complete, but it's easier to use.
  • Red nodes are gone (and GreenNode is now called LNode, for "Loyc Node")
  • Mutable nodes are gone, and the data type for Args and Attrs has changed to RVList<T> (Reverse V-List--since this tends to make people think of Recreational Vehicles, I'm thinking about changing the name to simply VList, with the forward VList still being named FVList). RVList provides a natural way to support the common pattern of gathering a list of statements and then freezing them into a read-only node; you can create a list with RWList<LNode>, mutate it however you like (similar performance to List<T>), and then when the list is complete, call ToRVList() to convert it to RVList<LNode> instantly.
  • There are now three kinds of nodes: symbols (now called Ids, i.e. identifiers), literals, and calls. I'm considering whether to add a fourth type for nodes that have both a name and a value, perhaps calling it the Token kind (StdTriviaNode already has both a Name and Value, but it is currently a subclass of StdIdNode.).
  • An empty Name is now allowed. A literal now has a blank name (instead of #literal) and a method that calls anything other than a simple symbol will also have a blank Name. Note: the property will still never return null.
  • I decided to represent expressions in parenthesis using a call with the empty target @``, that is, the empty identifier. But now I am considering whether to eliminate the representation of parenthesis completely, because it's a significant inconvenience for node printers: if parenthesis are represented explicitly, a printer cannot add parenthesis to clarify precedence without changing the syntax tree. So far the EC# printer has managed to avoid adding parenthesis in almost all cases, but there's still the issue of "tightly bound attributes": if the "f" part of "f(x)" has an attribute, it must be printed with the special "grouping parenthesis", ##(), as in ##([attr] f)(x). It's clunky, but it works*. But now I'm designing another language (more on that in my next post) and I haven't figured out what to do about this yet. Perhaps I should just define an exception where "the parenthesis don't count if there are attributes inside"; we'll see.
During the redesign I decided on some small changes to the representation of certain expressions in EC#:
  • The '.' operator is now treated more like a normal binary operator; a.b.c is now represented #.(#.(a, b), c) rather than #.(a, b, c) mainly because it's easier that way, and because the second representation doesn't buy anything significant other than a need for special-casing.
  • int x = 0 will now be represented #var(int, x = 0) rather than #var(int, x(0)). I chose the latter representation initially because it is slightly more convenient, because you can always learn the name of the declared variable by calling var.Args[1].Name. However, I decided that it was more important for the syntax tree to be predictable, with obvious connections between normal and prefix notations. Since I decided that alias X = Y; was to be represented #alias(X = Y, #()), it made sense for the syntax tree of a variable declaration to also resemble its C# syntax. There's another small reason: C++ has both styles Foo x(y) and Foo x = y; if Loyc were to ever support C++, it would make sense to use #var(Foo, x(y)) and #var(Foo, x = y) for these two cases, and I believe C#'s variable declarations are semantically closer to the latter. (Note: another possibility was #var(int, x) = 0, but I decided this wasn't an improvement, it would just shift the pain around.)
  • An constructor argument list is required on all types using the #new operator, e.g. new int[] { x } must have an empty set of arguments on int[], i.e. #new(#of(#[],int)(), x); this rule makes the different kinds of new expressions easier to interpret by making them consistent with each other.
  • A missing syntax element is now represented by an empty symbol instead of the symbol #missing.
  • I've decided to adopt the "in-expression" generics syntax from Nemerle as an unambiguous alternative to angle brackets: List.[int] means List<int> and the printer will use this syntax in cases where angle brackets are ambiguous.
  • By popular demand, constructors will be written this(...) instead of new(...), since both D and Nemerle use the latter notation.
  • The \ and $ characters have been swapped; \S now denotes a symbol S, while $S now denotes a substitution. Originally EC# was designed just as an extension of C#, so \ made sense as a substitution operator for string interpolation because it doesn't hurt backward compatibility: "Loaded '\(filename)' successfully". But now that my focus has shifted to multi-language interoperability, $ makes more sense, as it is used for string interpolation in at least five other languages and it makes sense to use the same character for both string substitution and code substitution.
LLLPG is also improved; I have implemented two more major features:
  • Implemented support for terminals of unknown value. I observed that if the inputs are integers (or enums) LLLPG shouldn't have to know the actual integer value of each integer. The user should be able to use a symbolic name like "NUMBER" or "DQString" without actually telling LLLPG what the value is. So I eliminated the code-snippet generator for Symbols and replaced it with GeneralCodeGenHelper, which supports inputs of any data type: integers, enums, Symbols, strings, whatever. Note that switch() statements require constant cases, so switch() must be disabled for non-constant possibilities such as Symbols. A constructor argument controls whether GeneralCodeGenHelper is allowed to generate switch() statements (ideally individual terminal symbols could be marked constant or non-constant, but a global switch will suffice for now.)
  • Replaced redundant Match() calls with Skip(), i.e. if prediction learns that "LA(0) is in the set ('0'..'9')" then a test like MatchRange('0', '9', '.', '.') is redundant and can be replaced by Skip(). Also, I eliminated redundant Check() calls. These two features together are called "prematch analysis".
Only two major features are left and LLLPG 1.0 will be ready:
  • Syntactic predicate support (only semantic predicates work right now)
  • Input from source code, and a command-line interface. Currently grammars have to be constructed with overloaded C# operators and printed at run-time, which of course is clumsy.

Tuesday, May 7, 2013

Redesigning the Loyc tree code

The original (and current) design for the classes that represent Loyc trees has some problematic properties:
  1. You can't tell if a node is mutable or immutable without a runtime check.
  2. Green nodes are hard to work with because they use relative positioning.
  3. Red nodes (the Node class) are hard to work with because they have parents.
  4. The "f" part of "f(x)" is not (in general) an independent node--although it could be.
First of all, it is a hybrid mutable-immutable design: you can create trees that are mutable, and then freeze them to get immutable trees. This way, they support multiple programming styles depending on what you prefer. If you'd like to modify trees in-place, you can; if you like to work with immutable trees and contruct them bottom-up, you can. In theory, it sounds useful and flexible.

However, the original design does not benefit from type safety. You cannot say "this method requires a mutable tree" or "this method requires an immutable tree". It's non-obvious when a method takes or returns a "Node", what exactly that means.

I created an optionally-persistent hash-set recently that overcomes this problem by having two separate data types for mutable and immutable sets: Set<T> (immutable) and MSet<T> (mutable); both of these are wrappers around a helper data type InternalSet<T> which contains the algorithms, and the user can convert a set from mutable to immutable (or vice versa) in O(1) time, so I've been considering doing something comparable for Loyc nodes.

Parenting is another problem. My current implementation is inspired by Roslyn and has two kinds of nodes, green and red, the meaning of which is explained by Eric Lippert on his blog. The green nodes are cachable, so that for example the subtree for "Console.WriteLine" can be re-used to save memory if it appears in multiple places in a source file; they also use relative positioning to help support incremental parsing if I ever get around to implementing that. These factors allow Loyc trees to be very lightweight, but the latter fact makes it very hard to manipulate green trees without losing all information about positioning (or worse, causing some calculated positions to be incorrect). Therefore, end-users are generally supposed to use red nodes instead of green ones, leaving green nodes mainly as carefully-constructed products of the parser.

However, each red node has a parent, and this turns out to be very inconvenient when transforming syntax trees. Each node can only have a single parent, and if you insert a Node that has a parent as a child of another Node, you get an InvalidOperationException. The compiler cannot help detect the problem, and correcting the problem requires that you first "detach" the Node from its parent. To make matters worse, detaching is an O(N) operation (in general) because the array of Nodes in the parent has to be shifted left (after the spot where the detached Node used to be); plus, if the Node's parent is frozen, detaching causes a ReadOnlyException. Now, you don't actually have to detach; you can clone the node instead, but cloning all over the place could also hurt performance.

The way I have defined Loyc trees also makes me slightly uncomfortable. Recall that a node essentially has three parts:
  1. The attribute list (may be empty)
  2. The head (one of: a Symbol, a literal Value, or a Head node)
  3. The argument list (optional, and if present, may be empty)
(There are also cosmetic and diagnostic parts, such as the node's location in a source file, but these three parts are the "semantically important" parts.)

A couple of things make me uncomfortable about #2, the head portion. First of all, a call such as "f(x)" normally consists of two nodes: the call "f(x)" is one Node, and the symbol "x" is another Node. "f" is not a Node but merely a Symbol. The advantage of this representation is that it saves memory, since a separate Node does not have to be allocated for "f", we only need memory for the Symbol "f", and this memory is shared every time "f" appears in any source file. But I can certainly imagine that you might want to do something with "just the head part" of a node, ignoring the argument list and the attributes; for example you might want to print "just the head part" as a string, and there is no convenient way to do this right now.

A second problem is that there is a weird oddity in the code right now, because the Head part can be a Node rather than a Symbol, and my code currently treats the Node "f" as equivalent to the Symbol "f". This makes me uncomfortable because Loyc trees are not fully round-trippable like they are supposed to be; if you take the tree "f(x)" where "f" is a Node, print it as text and re-parse it, the output is "f(x)" where "f" is a Symbol--a slightly different tree that prints out the same way.

An alternative way to interpret this case is that if a Node serves as a Head and has no attributes or arguments, it should be interpreted as being in parenthesis. In that case, when "f" is a Node serving as a Head, it must have parenthesis around it: "(f)(x)". The reason I did not take this route is because allowing "f" to be a Node (without parenthesis) allows it to have its own positioning information. So in the case of code like "2 + 3" (which is "#+(2, 3)" as a Loyc tree in prefix notation), if "#+" is a separate node then it can have its own positioning information, indicating that the "+" sign is two characters to the right of its parent node "2 + 3". On the other hand, this positioning information is perhaps not needed for all method calls, which seems to be an argument in favor of allowing simple Symbols to be heads; and we cannot simply remove the "Symbol" part of a Node, for how would we represent a Node that is just a Symbol?

All of the above issues are "design smells", but I am not yet confident about how to eliminate the stink.

Here are my thoughts so far. First of all, I am convinced that immutable nodes are more convenient to work with and should be the default choice. Mutable nodes--assuming that I keep support for mutable nodes at all--should have their own data type, but I'm not sure if it should be a completely separate data type or a derived class. Certainly, it's good if code that analyzes immutable nodes can work with mutable nodes "for free", although some code will want to know "for sure" that no other code will modify a tree. The common interface between immutable and mutable nodes could be placed in an INodeReader interface, which allows mutable and immutable nodes to be completely separate, but code that operates on INodeReader instead of a base class would be slower, and implementing INodeReader is a burden on me because the interface needs its own special argument and attribute lists (it must return lists of INodeReader instead of the actual node type).

I have an idea that I think would allow nodes to be cached and relocated, without the inconvenience that currently makes green nodes problematic. If this idea works, the green nodes will become the preferred form in which to manipulate syntax trees, while red nodes would only exist as a way to track parents.

Dealing with parenting has been annoying and I want the problem to go away. I'm thinking about getting rid of parents--and red nodes--completely, at least temporarily while I finish off LLLPG, the parser generator. Perhaps the parentable nodes (red nodes) could be added later as a special kind of node, derived from the same base class as the green nodes; the red nodes may be simple wrappers, consisting of a reference to a green node plus a reference to a parent. All the methods would simply be forwarded to the green node.

Changing the subject now, I need to find a new way to look at Loyc nodes. Clearly, all Loyc nodes can have attributes and position/width/style information, so it's natural to have a common base class with common data. Apart from this common information, nodes represent the following forms:
  • Just an identifier: simple_symbol
  • Call an identifier: simple_symbol()
  • Just a literal: "literal"
  • Call a literal: "literal"() (weird, but syntactically legal)
  • Call a complex head: Console.WriteLine(), Foo(x)(y), Foo<x>(y)
  • Node in parenthesis: (x + 1). I'd like to represent parens explicitly in order to more faithfully represent the input, even though Loyc trees are not designed to support everything the way Roslyn and NRefactory can; for example, there's no obvious way to support round-tripping of syntax errors.
If there were algebraic data types in C#, I could represent the head and arguments using something like
data NodeData = Symbol
              | Literal object
              | InParens Node
              | Call Node List<Node>
In this representation, 'Node' is position information, plus attributes, plus a NodeData. A "Call" consists of the thing being called, plus a list of arguments. Or we could use the LISP representation of a call:
data LispNode = Symbol
              | Literal object
              | InParens Node
              | Call List<Node>
In the LISP style, the thing being called is the first item in the List (and in LISP, the List is a linked list rather than an array.) The LISP style definitely has advantages, but I'm not sure I'm comfortable mixing the call target with the call's arguments like this.

So what should I do in C#? I can certainly map the ADT to C# with something like:
public class Node {
    /* position information and attributes go here */
}
public class SymbolNode : Node {
    public Symbol Name { get; }
    ...
}
public class LiteralNode : Node {
    public object Value { get; }
    ...
}
public class ParensNode : Node {
    public Node Child { get; }
    ...
}
public class CallNode : Node {
    public Node Head { get; }
    public IList Args { get; }
    ...
}
The issues with this mapping are:
  • C# doesn't have features for deconstructing nodes easily; there's no 'match' statement
  • Lots of run-time conversions are required: you'd be downcasting constantly, harming performance. You could use visitors though, avoiding some casts (I expect casting to have a significant penalty because the classes above would not be sealed--in fact they'd probably be abstract, to allow support for red nodes, repositioning to account for editing in an IDE, etc.--and casting is rumored to cost more when the target type does not exactly match the actual type). I've never been a fan of the visitor pattern by the way, but it's grown on me recently as I finally grok it. The visitor pattern is brittle with respect to changes in the class hierarchy, but that's not really a problem for Loyc trees, because fundamentally new kinds of nodes should never be needed.
  • When you add red nodes and other wrapper classes, the total number of classes could be quite high.
  • A simple call such as "f()" will require two nodes, whereas the current implementation only needs a single node. It occurs to me, however, that the child node "f" could be implemented as a temporary object, constructed on-demand and immediately discarded (calling the Name property rather than Target would avoid constructing the temporary object).
  • Currently, a mutable node can be converted in-place between any of the possible node types, e.g. you can simply write node.IsCall = true to add an empty argument list. That kind of thing is not possible when we use these separate classes.
I guess I should keep in mind that EC# will support deconstruction and pattern-matching when it is complete, and the current mutation abilities are not very important to keep. The performance implications still matter, however, and I always love to save memory.