Saturday, April 20, 2013

The Loyc tree and prefix notation in EC#

Update: the Loyc tree code has been rewritten and the original concept has been modified from the concept described in this post (for example, the red/green tree concept has been dropped for the sake of simplicity and other reasons).

As I designed the macro system for Enhanced C#, I thought a lot about what kind of syntax tree would be appropriate--not just for EC# but for tools that could process multiple languages. One of Loyc's goals is conversion between languages, and another goal is to support multiple language syntaxes (perhaps each with a macro system!) in a single project. A third goal is to assist IDE tools (in particular, to be compatible with incremental parsing). And, as always, I wanted a data structure that is fast and uses memory efficiently.

I considered using LISP-style lists, but they didn't fit into C# very well. In particular I didn't feel that they were the right way to model complex declarations such as
 [Serializable] struct X<T> : IX<T> where T : class, Z { ... }
Besides, LISP lists don't hold any metadata such as the original source file and line number of a piece of code. And how would I address the needs of IDEs--incremental parsing, for example? So I fiddled around with different ideas until I found something that felt right for C# and other languages descended from Algol. Eventually I decided to call my idea the "Loyc node" or "Loyc tree". The Loyc tree is really two independent ideas that I developed concurrently: one idea for the concept of the tree as viewed by an end-user (i.e. other developers), and another idea for how the implementation works.

The concept of a Loyc node involves three main parts: a "head", an "argument list" and "attributes". The implementation involves two parallel trees, "red" and "green", inspired by a similar idea in Microsoft Roslyn.

A Loyc tree is basically "XML for code"--but wait a minute, I hate XML. Maybe "JSON for code"? Well, just as all XML elements have the same structure, all Loyc nodes have the same structure, also. Just as there is a clear and obvious mapping from XML to the code representation of XML (such as an XML DOM or XElement), the mapping from EC# to a Loyc tree is similarly transparent, when the EC# code is written in prefix notation.

In EC# prefix notation, the above struct "X<Y>" looks like this:
 [Serializable]
 #struct(#of(X, [#where(#class, Z)] T),
         #(#of(IX, T)),
         #{}( ... ));
But wait, let's start with some simpler examples:
 x += percent * 0.01;       // normal EC#
 #+=(x, #*(percent, 0.01)); // prefix notation
 
 public int X;              // normal EC#
 [#public] #var(#int, X);   // prefix notation
 
 using System.Collections.Generic;              // normal
 #import(#.(#.(System, Collections), Generic)); // prefix
     // (#using is reserved for the using(...) statement)

 if (condition) { y(); }    // normal EC#
 @#if(condition, #{}(y())); // prefix notation
 
 Foo(x, y, z);              // normal EC#
 Foo(x, y, z);              // prefix notation
In the pairs above, the EC# parser produces the same syntax tree for the two lines in each pair. EC# accepts prefix notation anywhere that a normal statement or expression is allowed. Prefix notation is based on function-call notation, as you can clearly see in the last example.

Note: the "@" in "@#if" avoids confusion between if-statements and the preprocessor directive "#if". The preprocessor in EC# is obsolete and not necessary, but it still exists. I wanted to use the "#" character to indicate a "special name" in prefix notation, but somehow we must avoid conflicts with the preprocessor. The parser converts the identifier "@#if" to "#if" internally, just as "@foo" actually means "foo" in plain C#.

The Loyc syntax tree

In most compilers, the syntax tree is very strongly typed, with separate classes or data structures for, say, variable declarations, binary operators, method calls, method declarations, unary operators, and so forth. Loyc, however, only has a single data type, Node, for all nodes*. There are several reasons for this:
  • Simplicity. Many projects have thousands of lines of code dedicated to the AST (abstract syntax tree) data structure itself, because each kind of AST node has its own class. Simplicity means I write less code and users learn to use it faster.
  • Serializability. Loyc nodes can always be serialized to a plain text "prefix tree" and deserialized back to objects, even by programs that are not designed to handle the language that the tree represents**. This makes it easy to visualize syntax trees or exchange them between programs.
  • Extensibility. Loyc nodes can represent any programming language imaginable, and they are suitable for embedded DSLs (domain-specific languages). Since nodes do not enforce a particular structure, they can be used in different ways than originally envisioned. For example, most languages only have "+" as a binary operator, that is, with two arguments. If Loyc had a separate class for each AST, there would probably be a PlusOperator class derived from BinaryOperator, with a LeftChild and a RightChild. But since there is only one node class, a "+" operator with three arguments is easy; this is denoted by #+(a, b, c) in EC# source code. The EC# compiler won't understand it, but it might be meaningful to another compiler or to a macro.
* In fact, there are a family of node classes, but this is just an implementation detail.
** Currently, the only supported syntax for plain-text Loyc trees is EC# syntax, either normal EC# or prefix-tree notation.

EC# syntax trees are stored in a universal format that I call a "Loyc tree". All nodes in a Loyc tree consist of up to four parts:
  1. An attribute list (the Attrs property)
  2. A Value
  3. A Head or a Name (if a node has a Head, Name refers to Head.Name)
  4. An argument list (the Args property)
The EC# language does not allow (2) and (3) to appear together (specifically, a Value can only be represented in source code if the Name is "#literal"). Therefore, you can think of Value, Head and Name as a discriminated union known as "the head part of a node". There is no easy and efficient way to represent a discriminated union in .NET, so all five properties (Attrs, Value, Head, Name, Args) are present on all nodes.

As you've seen, almost any Loyc node can be expressed in EC# using either "prefix notation" or ordinary code. The basic syntax of prefix notation is
 [attributes] head(argument_list)
where the [attributes] and (argument_list) are both optional, and the head part could be a simple name. For example, the EC# statement
 [Foo] Console.WriteLine("Hello");
is a single Node object with three children: Foo, Console.WriteLine, and "Hello". Foo is an attribute, Console.WriteLine is a Head, and "Hello" is an argument. Each of these children is a Node too, but neither Foo nor "Hello" have children of their own. The Head, Console.WriteLine, is a Node named "#." with two arguments, Console and WriteLine. The above statement could be expressed equivalently as
 [Foo] #.(Console, WriteLine)("Hello");
This makes its structure explicit, but the infix dot notation is preferred. Finally, Console and WriteLine are nodes that only have a Name (no Attrs, no Args, no Head, no Value).

Conceptually, Loyc trees have either a Head node or a Name symbol but not both. Foo, Console, WriteLine, and #. are all node names, while Console.WriteLine is a head node. However, you can always ask a node what its Name is; if the node has a Head rather than a Name, Name returns Head.Name. Thus, #. is the Name of the entire statement.

Attributes can only appear at the beginning of an expression or statement. Use parenthesis to clarify your intention if necessary, but please note that parenthesis are represented explicitly in the syntax tree, not discarded by the parser. Parenthesis cause a node to be inserted into the head of another node, so
 (x())
is a node with no arguments, that has a Head that points to another node that represents x(). Attributes have lower precedence than everything else, so
 [Attr] x = y;
associates the attribute Attr with the "#=" node, not with the "x" node.

Unlike C# attributes, EC# attributes can be any list of expressions, and do not imply any particular semantics. You can attach any expression as an attribute to any other statement or expression, e.g.
 [4 * y << z()]
 Console.WriteLine("What is this attribute I see before me?");
When the time comes to generate code, the compiler will warn you that it does not understand what the hell "4 * y << z()" is supposed to mean, but otherwise this statement is legal. Attributes serve as an information side-channel, used for instructions to macros or to the compiler. Macros can use attributes to receive information from users, to store information in a syntax tree temporarily, or to communicate with other macros.

You can mix prefix notation with "normal" EC# in various ways. For example, all of the following versions of "MulDiv" are equivalent:
 int MulDiv(int a, int b, int c) { 
     return (int)((long)a * b / c); 
 }
 int MulDiv(int a, int b, int c) { 
     #return((int)(#cast(a, long) * b / c)); 
 }
 #def(#int, MulDiv, #(int a, int b, int c), { 
     return (int)((long)a * b / c); 
 });
 #def(#int, MulDiv, #(#var(#int, a), #var(int, b), #var(int, c)), 
     #{}(#return(
         #cast((#/(#*(#cast(a, #long), b), c)), #int)
     )) 
 );

Statement/expression equivalence

An important feature of EC# is that the difference between "statements" and "expressions" is only syntactic, not semantic, i.e. the difference is only skin-deep. Any EC# expression can be a statement, and any statement can be written in the form of an expression. The EC# parser needs to know whether to expect expressions or statements in a given location, but once a syntax tree is built, the distinction between statements and expressions disappears. Here is an example that mixes statements and expressions:
 if (foo.Bar > 0 && #{
     var stats = GatherStatistics(foo);
     stats.Avg > foo.Bar && stats.StdDev < 1.0
 })
 {
     Frobnicate(foo, stats);
 }
Roughly, "#{" causes the parser to switch to statement mode, allowing you to write statements where an expression is expected. Of course, plain C# doesn't work this way. Therefore, the conversion to plain C# can sometimes be messy and difficult.

The EC# parser also does not distinguish between "executable places" and "places for declarations". For example, this code can be parsed:
    Console.WriteLine("This statement is illegal!");
    using System.Collections.Generic;

    void foo() {
        WriteLine("This statement makes more sense!");

        int Ecks {
            int x = 0;
            get { x }
            set { x = value; }
            namespace WTF {
                Console.WriteLine("This is pretty weird!");
            }
        }
    }
Although the parser understands this code, the EC# compiler (as a whole) would reject it, because code cannot run outside of methods (except macros), and namespaces cannot be located inside methods or properties.

Note: EC# is not usable yet. This post exists because I want to publish the idea of Loyc trees. I'll update this when the parser is ready. But once the parser is ready, you will still not be able to write code in EC#. Making a programming language by myself is a long process!

Thursday, April 18, 2013

Loyc and Nemerle

I created this blog to be about Loyc, the Language of Your Choice project, which is about making general tools for creating and extending programming language, converting code between languages, and creating tools for IDEs. However, quite simply Loyc is too big a project for one person. I couldn't find the time for it, and I couldn't quite figure out how to make an extensible programming system.

Last August I decided to reboot with something much more modest: to design and implement a series of improvements to C#. That way I hoped to gain enough experience to do something bigger, later. To increase my chance of success, I changed to a part-time worker at work, leaving more time for Loyc.

I was going to start by modifying NRefactory, and my "Enhanced C#" would simply compile down to plain C#. I couldn't figure out how to accomplish my end goal--a fully extensible language--so I decided to simply improve the existing C# language with a series of specific and modest features such as the "null dot" or safe navigation operator (now denoted ??.), the quick-binding operator (::), forwarding clauses, return value covariance, C++-style templates, "static if", blah blah blah.

However, once I was done drafting the new language, I noticed that despite all the work I'd put into merely the draft alone, the new language still didn't address one of the most requested features from C# users: "Provide a way for INotifyPropertyChanged to be implemented for you automatically on classes".

INotifyPropertyChanged is a simple interface for allowing code to subscribe to an object to find out when it changes. For example:
interface INotifyPropertyChanged {
 event PropertyChangedEventHandler PropertyChanged;
}
class Person : INotifyPropertyChanged {
 public event PropertyChangedEventHandler PropertyChanged;
 
 string _name;
 public string Name
 {
  get { return _name; }
  set {
   if (_name != value) {
    _name = Value;
    Changed("Name");
   }
  }
 }
 string _address;
 public string Address
 {
  ... same thing ...
 }
 DateTime _dateOfBirth;
 public DateTime DateOfBirth
 {
  ... same thing ...
 }
 void Changed(string prop)
 {
  if (PropertyChanged != null)
   // The .NET "EventArgs" concept is stupid, but I digress
   PropertyChanged(this, new PropertyChangedEventArgs(prop));
 }
}
The problem is that you need 10 new lines of code for every new property in your class. Couldn't the language have some kind of feature to make this easier? But I didn't want to have a feature that was specifically designed for INotifyPropertyChanged and nothing else. Besides, there are different ways that users might want the feature to work: maybe some people want the event to fire even if the value is not changing. Maybe some people want to do a reference comparison while others want to do object.Equals(). Maybe it can't be fully automatic--some properties may need some additional processing besides just firing the event. How could a feature built into the language be flexible enough for all scenarios?

I decided at that point that it was time to learn LISP. After studying it briefly, I figured out that what would really help in this case is macros. Macros would allow users to provide a code "template" that is expanded for each property. To make a medium-length story short, I came up with an interchange format for syntax trees called "Loyc trees", and I dropped my EC# 1.0 draft--which was never made--to create EC# 2.0; many of the features of EC# 1.0 could be accomplished with macros anyhow. But since I wasn't satisfied with any of the readily-available parser generators, I also decided to make my own LL(k) parser generator. The initial version is about 75% complete right now (no, no, just the parser generator, sorry); the main missing features are syntactic "and" predicates (an optional feature that provides unlimited lookahead), and the ability for the parser generator to parse its own input (I'm using C# operator overloading for now.)

So the other day I finally started writing the parser for EC# 2.0, which would also be the parser generator's native language... when I rediscovered Nemerle.

I'd looked at Nemerle a few years back, but it didn't seem very impressive at the time; it just appeared to be a C#-like language with some type inference, algebraic data types, and pattern matching (and my memory is bad but I think it mainly targeted Linux (Mono)). Those are all good things, but my interest has always been in extensibility. I never knew until now that in fact Nemerle already has LISP-style macros (well, not quite LISP-style: a Nemerle macro cannot exist in the same project as the code to which it is applied; but that's okay because EC# wasn't going to support such an advanced feature either (not at first, although I want to support it eventually).)

I don't know whether Nemerle has always been extensible, and I simply didn't realize it, or whether macros are a new feature. In either case, this discovery will certainly impact my development of Loyc and EC#. In principle, Nemerle is so similar to EC# that I should perhaps even halt development of EC# and just use Nemerle.

My first impression is that Nemerle's macros seem to lack the simplicity of my design, but they are clearly more powerful. So far, I have only figured out how to support syntactic macros in EC#. I haven't figured out how to allow a macro to access semantic knowledge of a program--to look up types and members, or to introspect the "meaning" of some code and then to modify that code. Nemerle seems to have multiple "flavors" of macros that can accomplish different things, which is very exciting. Another neat fact is that both myself and Nemerle's people decided to use what I call a "tree parser"--the lexer output is grouped by parenthesis, square brackets and braces, before it is sent to the main parser. This makes it easier to support extensibility (although EC#'s parser will not be extensible), and makes it easy for the parser to look beyond anything in parentheses.

So where do I go from here? I'm not sure yet. I'll see what I can learn about the Nemerle compiler and macro system, before I can decide on a course of action. I am particularly interested in the base language--the language before any macros are added--but I have found only one page on that topic, and the beginning of that article is hard to understand. Oh well. (Argh! They keep saying that basic features of the language such as the "if" expression are defined by macros, but since Nemerle has no "goto" statement, I can't imagine how the "if" expression could decompose into anything simpler.) I also found this "extended course" in macros, and other information.

Wikipedia's Nemerle page claims that "the core developers of Nemerle were hired by the Czech software development company JetBrains." However, the core developers appeared to be Michał Moskal and Kamil Skalski, and Google seemed to indicate that those two worked for Microsoft and Google respectively, not JetBrains. So I contacted Michał, who told me that neither of them were working on Nemerle! He said there "were a few people, mostly from Russia, still working on it." But who exactly is left? I have no idea.

The webpage is in disarray. The first time I tried to download Nemerle, I found the download link on the front page of the wiki, but this points to downloads that are 2 years old, making me wonder if Nemerle was dead (I found the correct download link later). I wanted to talk to Nemerle's developers, but the link to the "Forum" on Nemerle's front page was broken, so next I found this contact page, but the links to the mailing list archives do not work, and the subscription link requires login credentials. Finally I found this "news" page, which hasn't been updated since 2004! Finally, Michał pointed me to Nemerle's Google Group, but clicking "New Topic" caused the error "An error occurred while communicating with the server." Yikes!

Finally, I was able to ask a couple of questions about Nemerle--but so far, the only response is from someone that cannot understand my English. It turns out that the first language of Nemerle is Russian! Could it be that I won't be able to work with the Nemerle people due to a language barrier?

In conclusion, we should all learn Esperanto or something.
Por finiĝi, ni ĉiuj devus lerni Esperanton aŭ ion.

(I used to be very interested in the language boo, but the boo people wouldn't talk to me, and in general did not document boo's advanced features at all. I don't know if the Nemerle people will talk to me, but at least Nemerle offers far more articles about its features than boo ever did, even if you only count the English articles (note, it's been a couple of years now since I checked on boo).)