ANTLR | Programming pages

Tag Archives: ANTLR

ANTLR with C# – using an Abstract Syntax Tree (AST)

July 1, 2012 – 3:38 PM

The ANTLR example we’ve seen in the last two posts (part 1 and part 2) produced a simple calculator that accepted integers and the four arithmetic operators and calculated the answer. This sort of parser is fine if you need to parse the input once only. However, suppose we wanted to write an application that allowed the user to enter a mathematical function f(x) as an algebraic expression and then draw a graph of that function between two values of x. In that case, we’d need to generate a number of values of f(x) for various values of x between the endpoints. One way of doing this is to parse the input string repeatedly, each time passing in the required value of x. However, this is rather inefficient, and ANTLR offers a better alternative: generating an abstract syntax tree or AST, and then using the AST to evaluate the input expression.

So what is an AST? To understand this, we need to understand what a parser does when it parses an input string. For example, with our simple calculator, we can give the input string 3+4*5. Since multiplication has a higher precedence than addition, 4*5 is done first. The parser treats the input as a tree like this:

To evaluate the tree, it is traversed or ‘walked’ in a depth first manner, which means that we start at the root ‘+’ and go as deep into the tree as we can along each child, then evaluate the operations as we come back out of the tree. Thus going down the left branch we encounter 3, which is saved until the result of the right branch is found. Going down the right branch as far as possible we encounter * then 4. Backing up from 4 to * we then go down the right branch and find 5. Both children of * are now determined so we can apply the * operator to get 20. This determines the right branch of the +, so we can now apply that operator and get 3+20=23.

In the more general situation mentioned above, where we need to evaluate the input several times, perhaps with different values at some of the nodes, it would clearly be less work if we didn’t have to generate the tree afresh each time before we traverse it. ANTLR allows us to generate an AST from a grammar file and then define a tree grammar which uses the AST as input, rather than the original string. Essentially what we need to do is write a grammar in the usual way and then write a second grammar that operates on the tree nodes rather than the original grammar rules. This might sound like more work (OK, it is), but usually writing the tree grammar is easier than writing the original grammar, and we are rewarded with a more efficient program.

To illustrate how this is done, we’ll expand our calculator grammar to an algebraic function evaluator. To this end, we want the numbers it uses to be doubles instead of ints. We also want to support a few more operations, such as raising to a power (using the ^ operator) and the unary minus for negating an expression. The user can enter an algebraic expression using x as the variable, and then specify the range of x values over which f(x) is to be calculated. Here’s the complete grammar file which implements this, and which generates the AST. We’ll explain the new syntax required for AST generation afterwards.

grammar Polynomial;

options {
    language=CSharp3;
    TokenLabelType=CommonToken;
    output=AST;
    ASTLabelType=CommonTree;
}

tokens {
  UNARYMINUS;
}

@lexer::namespace{Polynomial}
@parser::namespace{Polynomial}

// START:expr
public startProg
  : expr { System.Console.WriteLine($expr.tree.ToStringTree());};

expr
	:  multExpr
		( '+'^ multExpr
		| '-'^ multExpr
		)*
    ;

multExpr
    :   powerExpr
      ('*'^ powerExpr
      | '/'^ powerExpr
      )*
    ;

powerExpr
    :   unaryExpr ('^'^ unaryExpr)?
    ;

unaryExpr
    :   '-' atom -> ^(UNARYMINUS atom)
    |   atom
    ;

atom
	:   DOUBLE
    |   ID
    |   '('! expr ')'!
    ;
// END:expr

// START:tokens
ID  :  'x' ;
DOUBLE :   '-'? '0'..'9'+ ('.' '0'..'9'*)?;
// Use uppercase Skip() for C#
WS  :   (' '|'\t'|'\r'|'\n')+ {Skip();} ;
// END:tokens

(If you’re wondering about the grammar name, I originally designed the grammar for calculating polynomials, but it grew in the telling.)

Note that we’ve added 3 lines to the options section on lines 5 to 7. Actually, the options are those generated by default by the Visual Studio plugin, so you can just leave them as they are if you’re starting a new grammar file.

On line 10 we have a ‘tokens’ section, in which we define a single token, UNARYMINUS. Since the – sign is used for two purposes (subtraction and negation), the parser gets confused at the tree stage, so I’ve used a token as a label for the unary minus. We’ll see how this works when we define the tree grammar in a minute.

The remainder of the grammar should be fairly self-explanatory if you’ve read the earlier posts, except for the bits that generate the AST. This bit does require some careful thought as to what nodes you want to be in the AST and how they should be structured.

We’ve inserted a startProg rule at line 18. This isn’t really needed here, but when you’re starting out with ASTs it’s useful to see that actual tree that is produced, so that’s what this rule does in addition to calling the expr rule which is where the actual parsing takes place.

Now look at the ‘expr’ rule on line 21. We’ve defined it as a multExpr on its own, or as two multExprs joined by + or -. The solo multExpr on line 22 is unadorned since we want this node to be placed in the AST as it is.

The + rule on line 23 however has 3 parts to it (the left and right multExpr operands and the + operator). Comparing with the diagram above, we see that we want this expression to be placed in the tree with the + as the root and the two multExprs as its children. This is indicated on line 23 by placing a ^ after the term that is to be the root of the node, in this case, the + sign.

The same technique is used in creating the other AST nodes in the expr, multExpr and powerExpr rules. The unaryExpr on line 39 is a bit different. In order to distinguish this use of ‘-‘ from the subtraction rule in expr, we want the node in the AST to use the UNARYMINUS token as the root node rather than a bare ‘-‘ symbol. To do this we’ve used a rewrite rule. This is defined by using the arrow -> followed by the form we want the node in the AST to have for this rule. Nodes in the AST always have the form (root child1 child2…), that is, the first node is the root and the other nodes are its children. Thus here UNARYMINUS is the root and the atom is its single child.

Finally, look at the atom rule on line 44. An atom consists of a DOUBLE, which matches a double floating point number, or an ID, which here we’ve restricted to the single variable name ‘x’, or an expr in parentheses. The parentheses serve only to fence off an expression from other terms on either side, and once the expr has been identified, the parenthses are no longer needed. We therefore don’t need them in the AST. To exclude a term from the AST, place a ! after it, as we’ve done here.

Now we can look at the tree grammar. To create a tree grammar file, you can use Visual Studio’s Add New Item dialog and select ANTLR Tree Grammar. Here’s the complete file:

tree grammar PolynomialTree;

options {
    language=CSharp3;
    ASTLabelType=CommonTree;
    tokenVocab=Polynomial;
}

@namespace{Polynomial}
@header {
using System;
}

// START:node
public start[double x] returns [double value]
: a = node[x] {$value = a;};

node [double x] returns [double value]
  :   ^('*' a = node[x] b = node[x]) {$value = a * b;}
	|   ^('/' a = node[x] b = node[x]) {$value = a / b;}
  |   ^('+' a = node[x] b = node[x]) {$value = a + b;}
  |   ^('-' a = node[x] b = node[x]) {$value = a - b;}
  |   ^('^' a = node[x] b = node[x]) {$value = Math.Pow(a, b);}
  |   ^(UNARYMINUS a = node[x]) {$value = -a;}
  |   ID {$value = x;}
  |   DOUBLE {$value = Double.Parse($DOUBLE.text);}
  ;

// END:node

Notice this is defined as a ‘tree grammar’ (not just a ‘grammar’) on line 1. In the options on line 6 we’ve specified a ‘tokenVocab’. When ANTLR processes the original grammar file it produces a file containing the tokens used in that file, and to ensure that the tree grammar uses the same tokens, we load in that file. The tokens file has the same name as the grammar with the suffix ‘.tokens’. If you want to see it, it’s located under your project folder in obj\x86\Debug.

The start rule on line 15 is the entry point into the tree. Note that rule names in the tree grammar don’t have to match those in the original grammar, since you’re effectively defining a new grammar with tree nodes as input instead of a string. You might want to make the names the same, but I’ve kept them different here to show that they are in fact separate entities.

Since we want to walk the tree with various values for x, we need to define the start rule so that it accepts a parameter. This is done by enclosing the parameter in square brackets (NOT parentheses, as you’d do in a normal C# method call). The start rule also returns the result of the calculation as ‘value’. Its only action (on line 16) is to call a ‘node’ and pass x along to that node. Again, note that square brackets are used to call a rule with a parameter.

The meat of the tree grammar is in the ‘node’ rule on line 18. We list all the node types that can occur and define an action for each. We must begin each compound node (one that contains more than one term) with a ^ and enclose the node in parentheses; apart from that, it’s much the same as a rule in the original grammar.

On line 24, we make use of the UNARYMINUS token to recognize a unary minus operator. On line 25, the value of the parameter x is assigned whenever an ID is encountered in the tree, and on line 26 we parse a double floating point number.

We don’t have any references to the various rules like expr, multExpr and so on that were in the original grammar, since the original grammar took care of all that and built an AST where the nodes had a much more uniform structure. The precedence of the various operators is built into the AST (remember that lower down nodes are processed first in a depth-first traversal), so we don’t need to specify that either.

Finally, we need some C# code to use the AST to do some real calculations. Here’s the program:

using Antlr.Runtime;
using Antlr.Runtime.Tree;
using System;
using System.IO;
using System.Text;

namespace Polynomial
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.Write("Enter expression: ");
      string expression = Console.ReadLine();
      Console.Write("Enter low value of x: ");
      double xLow = Double.Parse(Console.ReadLine());
      Console.Write("Enter high value of x: ");
      double xHigh = Double.Parse(Console.ReadLine());
      Stream exprStream = new MemoryStream(ASCIIEncoding.Default.GetBytes(expression));

      // Use the parser to build the AST
      ANTLRInputStream input = new ANTLRInputStream(exprStream);
      PolynomialLexer lexer = new PolynomialLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      PolynomialParser parser = new PolynomialParser(tokens);
      var result = parser.startProg();

      // Use the tree to do the evaluation
      CommonTree tree = (CommonTree)result.Tree;
      CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
      PolynomialTree walker = new PolynomialTree(nodes);
      double xStep = (xHigh - xLow) / 10.0;
      for (double x = xLow; x <= xHigh; x += xStep)
      {
        nodes.Reset();
        double value = walker.start(x);
        Console.WriteLine("f(" + x + ") = " + value);
      }
    }
  }
}

On lines 13 to 18 we read in the data from the user. The parser needs to read its input from an ANTLRInputStream, and that requires a C# Stream object, so we need to convert the string containing the algebraic expression into a Stream, which is done on line 19.

Next we need to call the parser based on the original grammar to build the AST. This is done on lines 22 to 26. These steps are pretty much the same as in the original use of the parser from the previous post. The difference is that the ‘result’ returned from the parser on line 26 is an AST rather than the result of a calculation. (Its actual data type is pretty horrendous, so we’ve used a ‘var’ to declare it.) This call to startProg() will print out the AST.

Once we’ve got the AST, we build the tree parser on lines 28 to 31. The CommonTreeNodeStream on line 30 acts as an input stream for the tree parser, and the PolynomialTree object ‘walker’ on line 31 is the parser that will do the evaluation.

The loop on line 33 calls the walker for each value of x and prints out the result. Note that we need reset the ‘nodes’ stream before each call since after the walker processes this stream, the marker in the stream is at the end of the input.

Here’s a typical run of the program:

Enter expression: x^2 - (-3.14 + (4.5 - 9.03/x))*(5.43 - x)
Enter low value of x: 10
Enter high value of x: 30
(- (^ x 2) (* (+ -3.14 (- 4.5 (/ 9.03 x))) (- 5.43 x)))
f(10) = 102.08849
f(12) = 147.991275
f(14) = 202.12755
f(16) = 264.40975625
f(18) = 334.78925
f(20) = 413.236845
f(22) = 499.733968181818
f(24) = 594.2682375
f(26) = 696.831080769231
f(28) = 807.416375
f(30) = 926.01963

There’s no error checking, so if the user makes a mistake entering the expression or enters a value of x that causes a math error (like division by zero), the program will just crash, but error handling is a whole new ball game (and is probably harder than writing the original grammar), so we’ll leave it here for now.

By growescience | Posted in C# | Also tagged Abstract syntax trees, AST, tree grammars, unary minus | Comments (0)

ANTLR with C# – adding actions to rules

June 30, 2012 – 4:52 PM

In the last post, we saw how to install ANTLR in Visual Studio and write a simple grammar. We could run the parser and get it to check some input for errors, but so far nothing happens when a correct string is input. In this post, we’ll see how to add some actions for each rule in the grammar.

Although the grammar rules are written in ANTLR’s special syntax, actions are written in the language for which ANTLR produces its code, which in this case is C#. We’ll begin with a quick review of our grammar file so far:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

/*
 * Parser Rules
 */

public addSubExpr
	: multDivExpr (( '+' | '-' ) multDivExpr)*;

multDivExpr
  : INT (( '*' | '/' ) INT)*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

The grammar accepts a single line of input consisting of the four arithmetic operations acting on non-negative integers. No variables or parentheses are allowed. As such, there are four operations for which we need to add actions. Well, actually there is one more action, since we need to get the program to convert a string representation of an integer to C#’s int data type, and we’re allowing a single integer as valid input.

Code for an action is inserted directly into the rules at the point where we want it to be run. The easiest way to see what’s going on is to have a look at the modified grammar file with actions included. Here it is:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

@header {
using System;
}
/*
 * Parser Rules
 */

public addSubExpr returns [int value]
	: a = multDivExpr {$value = a;}
  ( '+' b = multDivExpr {$value += b;}
  | '-' b = multDivExpr {$value -= b;})*;

multDivExpr returns [int value]
  : a = INT {$value = Int32.Parse($a.text);}
  ( '*' b = INT {$value *= Int32.Parse($b.text);}
  | '/' b = INT {$value /= Int32.Parse($b.text);})*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

We’re going to have the grammar return the result of the calculation so it can be used elsewhere in our master program.

First, notice the @header statement on line 10. You place any intitalization statements that the C# code will need in here. Since we’ll need C#’s System.Int32.Parse() method to convert from a string to an int, we put a ‘using System’ statement in the header. We could, of course, just write all calls to Parse() using its full path, but as with regular C# code, the using statement saves a lot of typing.

Now have a look at the parser and lexer rules themselves. The lexer rules haven’t changed at all, since all the lexer does is extract tokens from the input and feed them to the parser, and the parser is where all the work is done on interpreting the input.

Next, look at the multDivExpr rule on line 22. This rule should calculate the product or quotient of two integers and return the result. We’ve added a ‘returns [int value]’ phrase to this rule. This is ANTLR’s syntax for defining the return value from a rule. (Actually, the latest version of ANTLR allows more than one value to be returned (it encapsulates all return values into a new class), but we won’t need that here.) Parameter lists in ANTLR are enclosed in square brackets rather than parentheses as in C#. The ‘value’ parameter is a local variable whose scope is the entire rule in which it’s defined.

On line 23, we deal with the part of the rule which matches the first integer. We define another local variable, ‘a’, to hold the contents of the token INT. Remember that at this stage, INT is a string of digits, and not an int. Thus we need to add an action that converts INT to a C# int, and that’s what the code in braces on line 23 does. The $a.text is ANTLR syntax which extracts the string from the local variable ‘a’. When you need to access an ANTLR variable, you prefix it with $.

Line 24 deals with multiplication. We save the second INT in the variable ‘b’, and return the product of the current $value with the int extracted from ‘b’. The variable ‘a’ is still in scope on line 24, so we could have written for this line:

  ( '*' b = INT {$value = Int32.Parse($a.text) * Int32.Parse($b.text);}

However, using the current value of $value is easier.

Line 25 handles division in the same way.

The addSubExpr rule on line 17 works in pretty much the same way, although there are a couple of points to note.

First, we’ve declared addSubExpr as ‘public’. By default, the code ANTLR produces for each rule is private, so you won’t be able to access it from outside the parser class. Since addSubExpr is the top level rule in our grammar, it’s the one we’ll need to call from outside to parse the input, so we need to make sure it’s public.

Also, note on line 18 that we save the result of the first multDivExpr in a local variable ‘a’, and we assign ‘a’ directly to $value without using a $ in front of the ‘a’ or accessing any of its properties. We can do this because multDivExpr was defined as returning an int, and an int is just what we need in the calculation. In more complex cases, we might use some non-primitive data type as the return type, in which case we’d need to access one of its fields.

Lines 19 and 20 use the same technique for performing the addition and subtraction.

The code that calls this modified version of the parser is

using System;
using System.IO;
using Antlr.Runtime;
using Antlr.Runtime.Misc;

namespace AntlrTest01
{
  class Program
  {
    public static void Main(string[] args)
    {
      Stream inputStream = Console.OpenStandardInput();
      ANTLRInputStream input = new ANTLRInputStream(inputStream);
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      int answer = parser.addSubExpr();
      Console.WriteLine("Answer = " + answer);
    }
  }
}

This is the same as in the previous post, except that this time we save the return value from the call to the parser’s addSubExpr() method and print it out as the answer.

In the next post, we’ll have a look at ANTLR’s other method defining grammars: the abstract syntax tree or AST.

By growescience | Posted in C# | Also tagged grammar, grammar actions | Comments (2)

ANTLR with C# – a simple grammar

June 28, 2012 – 2:05 PM

This post describes how to get started with Antlr version 3. For version 4, see here (note that Antlr4 breaks most of the functionality in previous versions).

ANTLR (ANother Tool for Language Recognition) is a tool for defining and implementing your own computer languages. Most of us won’t ever create a fully-fledged language like C# or Java, but it is often useful and efficient to create little languages for specialist situations, rather than writing clumsy string-parsing code. ANTLR is an ideal tool for handling situations like that, although it can also tackle large problems.

ANTLR is written in Java, and most of the examples (and the only book I know of; there is a new edition of this book due out later this year, but from the blurb it seems it will still concentrate on Java) are given in Java. However, ANTLR is available for quite a few languages and since I use mainly C# I thought a post on getting started with ANTLR in C# and Visual Studio would be useful.

First, we’ll look at getting ANTLR installed in Visual Studio. Start by visiting the C# download page. Download the antlr-dotnet-csharpbootstrap file and the ANTLR language support listed under Visual Studio 2010 Extensions. You should also get the documentation using the link at the top of the download page, although this documentation is a bit sketchy.

Install the VS extension (it’s a self-installing file). I had a problem with the installation in that the Intellisense for grammar files wouldn’t work, and Visual Studio complained of an exception whenever a grammar file was loaded. Eventually I discovered this was due to one of the files in the extension installation not being put in the right place. If you get this problem, open up the extension installer file Tvl.VisualStudio.Language.Antlr3.vsix (it’s in standard zip format so WinZip or WinRar should open it) and copy the file Tvl.VisualStudio.Shell.OutputWindow.Interfaces.dll to C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\CommonExtensions\Microsoft\Editor. (The (x86) in the Program Files folder will be there if you’re running a 32-bit version of Visual Studio under 64-bit Windows 7. If you’re using 32-bit Windows 7, it’s just the Program Files folder.)

Setting up an ANTLR project in Visual Studio

There are a few steps that need to be performed each time you create an ANTLR project in VS2010. These steps are described in the documentation for a project called CoolTool, but I’ll run through them here for completeness.

First, create your new project (we’ll call it AntlrTest01) in VS. Then create a new folder called Reference in the solution’s directory, and unzip the contents of the bootstrap file you downloaded above into this folder. You’ll need to copy this folder to each new project unless you want to store it in a central location and refer to it in the next step, but be aware that if you want to distribute your program, these files will need to go along with it, so it’s often handier to just have a copy of them with your project.

Back in VS, unload your project by right-clicking on its name in Solution Explorer and selecting ‘Unload project’. Then edit the csproj file by right-clicking on the project name again and selecting ‘Edit AntlrTest.csproj’. Scroll down to the bottom of this file (it’s in XML, so moderately readable) and find the line

  <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />

Just after this line, add the lines:

  <PropertyGroup>
    <!-- Folder containing AntlrBuildTask.dll -->
    <AntlrBuildTaskPath>$(ProjectDir)..\Reference\Antlr</AntlrBuildTaskPath>
    <!-- Path to the ANTLR Tool itself. -->
    <AntlrToolPath>$(ProjectDir)..\Reference\Antlr\Antlr3.exe</AntlrToolPath>
  </PropertyGroup>
  <Import Project="$(ProjectDir)..\Reference\Antlr\Antlr3.targets" />

This points VS at the ANTLR files you unzipped earlier. You can see the files’ location is given relative to the ProjectDir, so if you want to install your files somewhere else, you’ll need to change path in the csproj file.

When you’re done, save the csproj file and then right click on the project name and select ‘Reload project’.

One final task is needed before we can start coding. You’ll need to add a reference to the ANTLR dll in your project. In Solution Explorer, right click on References and select Add Reference, then click the Browse tab. The file you want is Antlr3.Runtime.dll, and it’s in the Antlr folder which is inside the References folder that you unzipped the file into earlier. You’re now ready to write your ANTLR project.

A simple grammar

An ANTLR grammar is defined by a grammar, which is a description of the allowable words and symbols and of the rules that combine these words and symbols. A grammar in ANTLR is written using a special syntax (that is, not in C#) so you’ll need to get familiar with at least some of that syntax to write a simple grammar.

There are actually two ways you can get ANTLR to implement a language. The first is by writing the C# code that is to be executed as part of each grammar rule. Although this is the simplest way, it’s not very efficient since it requires that the input string (the code in which language statements are written) be parsed from scratch each time it is to be interpreted. A more efficient way is to convert the input code into an abstract syntax tree or AST, and then use the AST for future operations. We’ll look only at the first method in this post.

To start with, we’ll write a grammar that implements a simple four-function calculator. We won’t add any frills, so we won’t include things like variable names or even parentheses.

Begin by adding a combined lexer and parser to your project. Right click on the project name and select Add New Item. If you’ve installed the VS extension above properly, you should see four ANTLR items in the list of Visual C# Items. Select ANTLR Combined Parser, change its name to Calculator.g3 and click Add. You’ll see that three files have been added to your project: the grammar file Calculator.g3 and two C# source files, one for the lexer and one for the parser.

By default, the grammar file generated in VS produces the AST version of the parser, which isn’t what we want in this simple example. To change it, edit the ‘options’ section so that the start of the file looks like this:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

ANTLR generates two classes (one for the lexer and one for the parser) from the grammar file, and the last two lines above tell it to put these classes in the AntlrTest01 namespace (the same namespace that contains the rest of the project).

We’ve mentioned the words ‘lexer’ and ‘parser’ a few times without defining them. These two beasts are fundamental to creating a language generator. Basically a lexer reads the input stream (usually a string) and breaks it up into tokens. It is the job of the lexer to identify which are valid tokens and which aren’t, and to generate error messages if it finds invalid input at the token level. For example, in our simple grammar we will accept integers and the four operators +, -, * and / as valid tokens. Anything else should produce an error.

The lexer feeds the tokens to the parser, which attempts to match the tokens to the rules we will define in the grammar file. We will need to add code to the parser so that it performs some function when some tokens match a valid rule (for example, when the expression 3 + 4 is fed into the parser), or generates an error message if the tokens don’t match a rule.

The first step in writing a language is defining the grammar. At that stage, the only function of the lexer and parser is to check the input to see if it’s valid, without doing any actual calculations. The easiest way to see how this works is to look at the complete grammar file:

grammar Calculator;

options {
    language=CSharp3;
}

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

/*
 * Parser Rules
 */

public addSubExpr
	: multDivExpr (( '+' | '-' ) multDivExpr)*;

multDivExpr
  : INT (( '*' | '/' ) INT)*;

/*
 * Lexer Rules
 */

INT : '0'..'9'+;
WS :  (' '|'\t'|'\r'|'\n')+ {Skip();} ;

Look at the lexer rules (at the bottom) first. We’ve defined two tokens. INT is the token for an integer, and consists of the characters 0 through 9 repeated one or more times. The + sign at the end is the usual regular expression syntax for ‘one or more occurrences’. WS stands for ‘whitespace’ and defines which characters are considered as blanks. Here we’ve include the blank itself, and the tab, return and newline characters. The Skip() function in braces at the end tells the lexer to skip over whitespace when it’s reading the input.

Now have a look at the parser. We’ve defined two rules: addSubExpr and multDivExpr. multDivExpr is defined as an INT followed by zero or more expressions of the form: * or / followed by another INT. (The * at the end is again regular expression syntax for ‘zero or more occurrences’.) Thus a valid multDivExpr would be a single integer like 42, or expressions like 42*12 or 42*12/6. Notice that since we’ve told the lexer to ignore whitespace, we can insert blanks between the tokens and it won’t affect the result.

Finally, look at the addSubExpr rule. It consists of a single multDivExpr, or a multDivExpr followed by zero or more expressions of the form + or – followed by another multDivExpr. So a valid addSubExpr would be, again, a single integer, or a single multDivExpr like those in the previous paragraph, or an expression like 42*12+34, or 42*12+34/2 or 42*12+34-22 and so on.

If you haven’t made any errors typing this in, you should now be able to build the project in VS without errors. However, we haven’t actually written any C# code yet, so if you run the program nothing will happen.

Before we add some C#, though, it’s worth looking at the Class View in VS. You’ll see that there are two classes called CalculatorLexer and CalculatorParser and if you look at their list of methods and properties, you’ll see a whole pile of entries. This is the code that ANTLR generates from your grammar file. Feel free to have a look at the code, but don’t modify any of it since (a) it will probably break it and (b) any changes you make will get overwritten the next time you change the grammar file.

Finally, how do we incorporate the grammar into a C# program? Here’s what is needed to use the lexer and parser in C#:

using System;
using System.IO;
using Antlr.Runtime;
using Antlr.Runtime.Misc;

namespace AntlrTest01
{
  class Program
  {
    public static void Main(string[] args)
    {
      Stream inputStream = Console.OpenStandardInput();
      ANTLRInputStream input = new ANTLRInputStream(inputStream);
      CalculatorLexer lexer = new CalculatorLexer(input);
      CommonTokenStream tokens = new CommonTokenStream(lexer);
      CalculatorParser parser = new CalculatorParser(tokens);
      parser.addSubExpr();
    }
  }
}

Note the ‘using’ statements at the top that we need to get access to the ANTLR libraries.

First we open the standard input (via the console) as the input stream and use this to create an ANTLRInputStream. This stream is passed to the lexer, which produces a stream of tokens which is then passed to the parser. Finally we call addSubExpr(), which tries to apply the addSubExpr rule to the input.

We can run this program and then type some input into the console. When you’re done with the input, type Control-Z on a line by itself. Control-Z is the Windows character for ‘end of file’ and tells the parser the input is complete.

What you’ll find is that, no matter what you type in (even it’s incorrect), you’ll get no output. This is because we haven’t provided either code for handling correct input or error messages for incorrect input. We’ll leave the former for the next post, but we can provide some rudimentary error handling to finish off this post.

When a parser or lexer method detects invalid input, it calls the method ReportError(). This method is provided only as a virtual stub in the generated code, so if you want some output when an error is detected, you’ll need to override it. We can place the following method in Calculator.g3.lexer.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorLexer
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in lexer at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

And a similar method override can be placed in Calculator.g3.Parser.cs:

using Antlr.Runtime;
using System;

namespace AntlrTest01
{
  partial class CalculatorParser
  {
    public override void ReportError(RecognitionException e)
    {
      base.ReportError(e);
      Console.WriteLine("Error in parser at line " + e.Line + ":" + e.CharPositionInLine);
    }
  }
}

Now if you enter some incorrect input like 3+d you’ll get an error message:

3+d
^Z
Error in lexer at line 1:2
Error in parser at line 2:0

The lexer complains about character 2 (the location is zero-based so that’s the third character), since ‘d’ isn’t a valid token (remember we are accepting only integers, whitespace and the four operators). The parser complains about an error at the start of line 2, since the lexer has fed it only the two tokens 3 and +, which doesn’t match either of the parser’s rules.

Hopefully that’s enough to get started with ANTLR. Next time round we’ll look at adding some code to get the parser to actually do the calculations.

By growescience | Posted in C# | Also tagged computer languages, grammar, lexer, parser | Comments (8)

Programming pages

Tag Archives: ANTLR

ANTLR with C# – using an Abstract Syntax Tree (AST)

ANTLR with C# – adding actions to rules

ANTLR with C# – a simple grammar

Setting up an ANTLR project in Visual Studio

A simple grammar

Pages

Archives

Blog Stats

Meta

Tags