Tuesday, December 17, 2013

Bogus ambiguity warnings in LLLPG

Sometimes LLLPG gives me a warning that I just can't fathom. I'm currently writing the parser for Enhanced C#, and I got a very puzzling warning:
using TT = TokenType;

partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   alias("(" = TT.LParen);
   alias(")" = TT.RParen);
   alias("[" = TT.LBrack);
   alias("]" = TT.RBrack);
   alias("." = TT.Dot);
   alias("!" = TT.Not);
   alias("<" = TT.LT);
   alias(">" = TT.GT);
   alias("," = TT.Comma);

   // Complex identifier, e.g. accepts x.y or x but rejects x
   token ComplexId() @[
      IdAtom 
      RestOfId()
   ];
   rule IdAtom @[
      (TT.Id|TT.TypeKeyword)
   ];
   rule RestOfId() @[
      // Warning: Optional branch is ambiguous for input such as «TT.Dot TT.Id _»
      // ((TT.Dot|TT.Not|TT.LT), (TT.Id|TT.TypeKeyword|TT.LParen|TT.LBrack|TT.GT), ~()) 
      TParams()?
      DotRestOfId()?
   ];
   rule DotRestOfId() @[
      "." IdAtom
      RestOfId()
   ];
   rule TParams() @[ // type parameters
      ( "<" (ComplexId ("," ComplexId)*)? ">"
      // Also allow Nemerle-style type parameters (e.g. Dictionary.[int,string])
      | "." "[" "]"
      // D-style type parameters (Dictionary!(int, string))
      //| "!" "(" ")"
      )
   ];
}};
I actually started with a 600-line parser and slowly reduced it to this minimal test case. In fact the grammar is unambiguous, so why does this warning occur? (and by the way, LLLPG's example case varies depending on small details of the grammar; at first it always said «TT.LT TT.GT _» but would give some other example if the aliases were removed.) If you can't figure it out, I don't blame you; I actually attached a debugger to see LLLPG's internal state at the time of the warning, and I couldn't figure it out either.

The generated code is wrong for RestOfId(), but correct for the other rules. How is it wrong?

Notice that the code for RestOfId(), as written, accepts ".>" or "<[" as a way of starting TParams(); this isn't really why the rule is wrong though, because ".>" or "<[" are invalid inputs, so it isn't super important which path is taken for those cases. This illustrates what LLLPG is "thinking" though: it has "mixed together" the two arms of TParams, "<" (ComplexId ("," ComplexId)*)? ">" and "." "[" "]". If it sees "<" (from arm 1 of TParams) and then "[" (from arm 2 of TParams), it will call TParams.
void RestOfId()
{
   TokenType la0, la1;
   la0 = LA0;
   if (la0 == TT.Dot || la0 == TT.LT) {
      switch (LA(1)) {
      case TT.LBrack:
      case TT.GT:
      case TT.Id:
      case TT.TypeKeyword:
         TParams();
         break;
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
The real problem happens when it sees "." (from arm 2 of TParams) followed by TT.Id (from arm 1 of TParams). In that case RestOfId() should skip TParams() and call DotRestOfId instead. But no, it calls TParams.

This explains why the warning occurs. If, inside RestOfId, LLLPG does not separate the logic for the two arms of TParams, LLLPG gets confused and thinks that "." followed by an identifier is a valid input for TParams as well as DotRestOfId. Therefore it can't decide which rule to call, prints a warning, and (because it is greedy by default) calls TParams.

This is exactly the same sort of problem that you used to get in ANTLR 2 with ANTLR's linear approximate lookahead. Although Terrance Parr (author of ANTLR) clearly documented how it works and what it means, users (even me!) find its behavior unexpected and confusing.

By default, LLLPG's lookahead is more powerful than linear approximate lookahead; the example given by Terrance, (A B | C D) | A D, is no problem for LLLPG. I could go into some detail about how my lookahead differs from ANTLR 2, but I think most readers just want to know "how do I fix this" and the answer is simple: add the [FullLLk] attribute, either just to the rule that is giving you trouble, or the entire grammar (the "LLLPG" statement):
partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   ...
   [FullLLk] // eliminates the bogus warning, and fixes the generated code
   rule RestOfId() @[
      TParams()?
      DotRestOfId()?
   ];
   ...
}};
The FullLLk option enables "true" LL(k) prediction, and forces LLLPG to analyze sub-decisions independently. So the decision of whether to call TParams requires not one, but two separate sequences of tests, for the two branches inside TParams:
void RestOfId()
{
   TokenType la0, la1, la2;
   la0 = LA0;
   if (la0 == TT.LT) { // branch 1 of TParams
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword) {
         switch (LA(2)) {
         case TT.GT:
         case TT.Comma:
         case TT.Dot:
         case TT.LT:
            TParams();
            break;
         }
      } else if (la1 == TT.GT)
         TParams();
   } else if (la0 == TT.Dot) { // branch 2 of TParams
      la1 = LA(1);
      if (la1 == TT.LBrack) {
         la2 = LA(2);
         if (la2 == TT.RBrack)
            TParams();
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
Full LL(k) mode doesn't always work perfectly, and may make LLLPG run slower, which is why it is not enabled by default. But usually, it works fine. If you still get the same ambiguity warning after enabling Full LL(k), check over your grammar carefully, because it is probably genuinely ambiguous.

Update: By the way, I later noticed that ComplexId, RestOfId and DotRestOfId can be combined into a single rule:
[FullLLk]
token ComplexId() @[
    IdAtom
    ( "::" IdAtom )? // omitted from simplified grammar
    TParams()?
    (   "." IdAtom
        TParams()?
    )*
];