Monthly Archives: December 2013

Antlr 4 with C# and Visual Studio 2012

It’s been more than a year since I posted anything to Programming Pages, so I figure I should rectify that (been busy mainly with my alter-ego, Physics Pages).

Recently, I had another look at the parsing package Antlr and discovered that a whole new version has come out (Antlr 4), along with a book The Definitive Antlr 4 Reference, written by Antlr’s author, Terence Parr. However, as with Antlr 3, the book is written exclusively in Java, so a fair bit of detective work is needed to discover how to use it with C# in Visual Studio. The code required both to specify the lexer and parser, and to write the supporting C#, has pretty well completely changed from Antlr 3, so a fresh tutorial is needed.

Installation in Visual Studio

Support for Antlr4 exists for Visual Studio 2012 (and presumably 2010, although I no longer have this installed so I can’t test it), but not, it seems, for Visual Studio 2013. To get the files you need, visit the Antlr download page and get the ‘ANTLR 4 C#.dll’ zip file. Next, click on the ‘C# releases’ link, and from the page that loads, scroll down to ‘Visual Studio 2010 Extensions’ and click on the ‘ANTLR Language Support’ link, then ‘Download’ the Visual Studio plugin from that page, then install it, making sure you’ve selected the correct version of Visual Studio in the dialog.

To incorporate Antlr 4 into a VS project, create a new project in VS. In the project directory, create a folder named Reference and within the Reference folder create another folder called Antlr4. Unzip the entire contents of the ‘ANTLR 4 C#.dll’ zip file into the Antlr4 folder. You’ll need to do this for each project you create.

Then, in Solution Explorer, right-click on the project and select ‘Unload project’. Then right click on the project name (it’ll say ‘unavailable’ after it, but that’s OK) and select ‘Edit <Your project name>.csproj. Scroll to near the end of the file where you should find the line:

  <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />

Insert the following lines directly after this line:

  <PropertyGroup>
    <!-- Folder containing Antlr4BuildTasks.dll -->
    <Antlr4BuildTaskPath>$(ProjectDir)..\Reference\Antlr4</Antlr4BuildTaskPath>
    <!-- Path to the ANTLR Tool itself. -->
    <Antlr4ToolPath>$(ProjectDir)..\Reference\Antlr4\antlr4-csharp-4.0.1-SNAPSHOT-complete.jar</Antlr4ToolPath>
  </PropertyGroup>
  <Import Project="$(ProjectDir)..\Reference\Antlr4\Antlr4.targets" />

Then right-click on the project name and select ‘Reload project’.

Finally, you’ll need to add a reference to the Antlr4 runtime in your project to provide access to the API. Right-click on References and select the correct version of the runtime dll from the Antlr4 folder you created above. The version should match the version of .NET that you’re using in your project, so if you’re using .NET 4.5, load the file ‘Antlr4.Runtime.v4.5.dll’. Now your project should be all set.

One final note: you’ll need Java to be installed in order to run Antlr4! This is true even if you’re coding in C#, since Antlr4 is written in Java and calls it to generate the code for your lexer and parser.

Writing an Antlr 4 grammar

You should now be in a position to start work on the actual code. If you installed the VS extension above, you can add an Antlr4 grammar to your project by right-clicking on the project name and selecting Add –> New Item. In the dialog you should see 7 Antlr items; 3 of these are for Antlr 4 files, with the other 4 being for Antlr 3. Since we’ll be working with both a lexer and a parser, select ‘ANTLR 4 Combined Grammar’, change the name of the file to whatever you like, and click ‘Add’.

As an illustration, we’ll generate a grammar for the good old four-function calculator, so we’ll call the project Calculator. In Solution Explorer you should find a file called Calculator.g4; this is where you write your parser and lexer rules.  The initial code in this file looks like this:

grammar Calculator;

@parser::members
{
	protected const int EOF = Eof;
}

@lexer::members
{
	protected const int EOF = Eof;
	protected const int HIDDEN = Hidden;
}

/*
 * Parser Rules
 */

compileUnit
	:	EOF
	;

/*
 * Lexer Rules
 */

WS
	:	' ' -> channel(HIDDEN)
	;

Don’t worry about the stuff up to line 12. What we’re interested in are the parser and lexer rules. The syntax for these has changed significantly from Antlr 3, to the extent that any grammar files you may have written for the earlier version very probably won’t work in Antlr 4.  However, the new syntax is much closer to the more standard ways of representing these rules in other systems like lex and yacc (if you’ve never heard of these, don’t worry; we won’t be using them).

Here’s the grammar file as modified for our calculator:

grammar Calculator;

@parser::members
{
	protected const int EOF = Eof;
}

@lexer::members
{
	protected const int EOF = Eof;
	protected const int HIDDEN = Hidden;
}

/*
 * Parser Rules
 */

prog: expr+ ;

expr : expr op=('*'|'/') expr	# MulDiv
	 | expr op=('+'|'-') expr	# AddSub
	 | INT					# int
	 | '(' expr ')'			# parens
	 ;

/*
 * Lexer Rules
 */
INT : [0-9]+;
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
WS
	:	(' ' | '\r' | '\n') -> channel(HIDDEN)
	;

First, look at the lexer rules. The INT token is defined using the usual regular expression for one or more digits. The four arithmetic operations are given labels that we’ll use later. Finally, we’ve modified the WS (whitespace) label so it includes blanks, returns and newlines. The ‘-> channel(HIDDEN)’ just tells the lexer to ignore whitespace.

Now look at the parser rules. The first rule is ‘prog’, which is defined as one or more ‘expr’s. The ‘expr’ is defined a single INT, or a single expr enclosed in parentheses, or two exprs separated by one of the four arithmetic operators.

Note that each line in the expr declaration has a label preceded by a # symbol at the end. These are not comments; rather they are tags that are used in writing the code that tells the parser what to do when each of these expressions is found.

This is where the biggest difference between Antlr 3 and Antlr 4 occurs: in Antlr 4, there is no code in the target language (C# here) written in the grammar file. All such code is moved elsewhere in the program. This makes the grammar file language-independent, so once you’ve written it you could drop it in to another project and use it support a Java application, or any other language for which Antlr 4 is defined.

Using the grammar in a C# program

So how exactly do you use the grammar to interpret an input string? Before we can use the lexer and parser, we need to write the code that should be run when each type of expression is parsed from the input. To do this, we need to write a C# class called a visitor. Antlr 4 has provided a base class for your visitor class, and it is named CalculatorBaseVisitor<Result>.  It is a generic class, and ‘Result’ is the data type that is returned by the various methods inside the class. Your job is to override some or all of the methods in the base class so that they run the code you want in response to each bit of parsed input.

In our case, we want the calculator to return an int for each expression it calculates, so create a new class called (say) CalculatorVisitor and make it inherit CalculatorBaseVisitor<int>. Your skeleton class looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Calculator
{
  class CalculatorVisitor : CalculatorBaseVisitor<int>
  {
  }
}

To see what methods you need to override, it’s easiest to use VS’s Intellisense. Within the class type the keywords ‘public override’ after which Intellisense should pop up with a list of methods you can override. Look at the methods that start with ‘Visit’. Among these you should find a method for each label you assigned in the ‘expr’ definition in the grammar file above. Thus you should have VisitInt, VisitParens, VisitMulDiv and VisitAddSub. These are the methods you need to override. The other Visit methods you can ignore, as the versions provided in the base class will work fine.

Here’s the complete class. We’ll discuss the code in a minute:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Calculator
{
  class CalculatorVisitor : CalculatorBaseVisitor<int>
  {
    public override int VisitInt(CalculatorParser.IntContext context)
    {
      return int.Parse(context.INT().GetText());
    }

    public override int VisitAddSub(CalculatorParser.AddSubContext context)
    {
      int left = Visit(context.expr(0));
      int right = Visit(context.expr(1));
      if (context.op.Type == CalculatorParser.ADD)
      {
        return left + right;
      }
      else
      {
        return left - right;
      }
    }

    public override int VisitMulDiv(CalculatorParser.MulDivContext context)
    {
      int left = Visit(context.expr(0));
      int right = Visit(context.expr(1));
      if (context.op.Type == CalculatorParser.MUL)
      {
        return left * right;
      }
      else
      {
        return left / right;
      }
    }

    public override int VisitParens(CalculatorParser.ParensContext context)
    {
      return Visit(context.expr());
    }
  }
}

First, notice that the argument of each method (called ‘context’) is different in each case, and corresponds to the tag used to define the method. Each ‘context’ object contains information required to evaluate it, and this content can be determined by how you defined the various expr lines in the grammar file above. (BTW, you may need to rebuild the project to get VS’s Intellisense to work here.)

Take the VisitInt() method first. This is called when the object being parsed is a single integer, so what you want to return is the value of this integer. We can get this by calling the INT() method of the context object, and then GetText() from that. This returns the integer as a string, so we need to use int.Parse to convert it to an int.

Now look at the VisitParens() method at the bottom. Here, the contents of the parentheses could be an expr of arbitrary complexity, but we want to return whatever that expr evaluates to as the result. This is what the inherited Visit() method does: it takes an expr as an argument and calls the correct method depending on the type of the expr. The expr() method being called from the ‘context’ returns this expr (which will have a tag attached to it to say whether it’s an Int, or an AddSub, or whatever) and Visit will call the correct method to evaluate this expr.

Finally, VisitAddSub and VisitMulDiv work pretty much the same way. Both of these represent binary operators, so there are two subsidiary exprs to evaluate before the operator is applied. In each case, we evaluate the left and right operands by calling Visit(context.expr(0)) and Visit(context.expr(1)) respectively. Then we check which operator is in the ‘context’. Note that in the grammar file above we defined a parameter called ‘op’ as the operator. Also note that the labels we gave to the four arithmetic operators in the lexer rules show up as fields within the Calculator Parser class, so we can compare the Type of ‘op’ with these labels to find out which operator we’re dealing with. Once we know that, we can return the correct calculation.

At long last, we’re ready to look at the code that uses all this stuff. Here’s the Main() function:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using Antlr4.Runtime;
using Antlr4.Runtime.Misc;
using Antlr4.Runtime.Tree;

namespace Calculator
{
  class Program
  {
    static void Main(string[] args)
    {
      StreamReader inputStream = new StreamReader(Console.OpenStandardInput());
      AntlrInputStream input = new AntlrInputStream(inputStream.ReadToEnd());
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      IParseTree tree = parser.prog();
      Console.WriteLine(tree.ToStringTree(parser));
      CalculatorVisitor visitor = new CalculatorVisitor();
      Console.WriteLine(visitor.Visit(tree));
    }
  }
}

Note the ‘using’ statements at the top, required to access Antlr4.

We create a StreamReader from the console on line 17 so we can type in our expressions. Then on line 18 we pass the string read from the stream (via inputStream.ReadToEnd()) to an AntlrInputStream.

Important note: AntlrInputStream is supposed to accept a raw Stream object for its input, but I couldn’t get this to work. The program just hung up when attempting to read from the Stream directly. It seems this is a bug in Antlr  4 so may be fixed in a future release.

The lexer is then created with the AntlrInputStream, the tokens from the lexer are saved in a CommonTokenStream, which is then passed to the parser. The parser is then run by calling parser.prog(), which calls the ‘prog’ rule (defined in the grammar file above) that initializes the program. The output from the parser is saved in an IParseTree, which is then passed to a CalculatorVisitor to do the evaluations. Finally, the result of the parse + evaluation is printed.

To run the program, type one or more expressions in the console (you can separate them by newlines or blanks). When you’re done, type control-Z on a line by itself and the output from the program should then be displayed.