Writing a parser: Calculator example

at 2008-02-10 in Examples by friebe (0 comments)

In many situations, we need to transform user input to machine readable data. Mostly, a regular expression, sscanf or even very simple handling via string functions will be sufficient. For more complex (and forgiving) input, and when we want to tell users what went wrong where these methods may not be enough.

Imagine a situation in which you would like to evaluate mathematical expressions, so, for example, the user would be inputting something like 1 + 2 or -1 / (3 * 6) and you would like output the expression's result. Something like this:

Calculator console screenshot

Using eval is not an option because you're not in control over what is passed - this creates security problems!

To get started with a parser, we will first have to write a grammar. The XP framework comes with a technology to generate so-called LALR parsers from these - don't worry about its technical details for the moment, though.



The grammar
The grammar files you need to write are easy to read and resemble the syntax used in Yacc. Here's a basic skeleton:
  %{
// Top: Package and uses go here
%}

// Tokens and associativity definitions

%%
// Rules go here
%%

Top
Sourcecode included in top is placed above the generated parser's class declaration statement. Here's what we'll start with:
  %{
$package= 'math';

uses(
'math.Constant',
'lang.types.Integer',
'lang.types.Double'
);
%}

Tokens
The tokens section contains a list of tokens that consist of more than one character. A character on its own has a token number of its ASCII value and the character as value (e.g. the "+" sign is #43). The lexer is responsible for producing the tokens, so it can decide to return "mod" as one token. In this case, we need to declare this token (the name doesn't matter, but is suggested to be written in all-uppercase).

We'll define two tokens by adding a line as follows:
  %token T_INTEGER T_DOUBLE

Rules
This is the interesting part! Here's the general skeleton for a rule:
  name:
rule
| rule1 token1 { $$= new Rule1($1, $2); }
| rule2
;
This can be read as: "The rule name contains either a rule, a rule1 (followed by the token token1, in which case a Rule1 object containing the token values of rule1 and token1 is created and returned - via $$) or a rule2."

Confused?:-)


Our first grammar
The first rule needs to be named "start". This is for the parser generator to know where to set the entry point. So we'll start out as follows:
  start:
expression { /* May be omitted: $$= $1; */ }
;

expression:
T_INTEGER { $$= new math·Constant(new Integer($1)); }
| T_DOUBLE { $$= new math·Constant(new Double($1)); }
| '(' expression ')' { $$= $2; }
;

Read this as follows: "At the start, we expect an expression. An expression may be an integer number (in which case we create - and return - an Integer instance wrapped by a Constant), a double number (same procedure, but use a Double instead), or an expression surrounded by brackets (which will simply return the expression inside, to be used for grouping).

Putting it together
This command will generate our parser:
$ sh ~/devel/xp/trunk/ports/technologies/opt/jay/generate.sh grammar/calculator.jay php5 'math·' > math/Parser.class.php

After that, we'll have to create a lexer which will tokenize the input into tokens. For the expression 1 + 2.5, it should return T_INTEGER(1), 43('+') and T_DOUBLE(2.5) (no tokens are returned for whitespace).

Here's an abbreviated version:
  class  extends AbstractLexer {
const DELIMITERS=
" +-*/()\r\n\t";
protected
$tokenizer= NULL;

public
function __construct($expression) {
$this->tokenizer= new StringTokenizer($expression, self::DELIMITERS, TRUE);
}

public
function advance() {
do
{
if
(!$this->tokenizer->hasMoreTokens()) return FALSE;
$token= $this->tokenizer->nextToken(self::DELIMITERS);

// Ignore whitespace
if (FALSE !== strpos(" \n\r\t", $token)) continue;

if
(FALSE !== strpos(self::DELIMITERS, $token) && 1 == strlen($token)) {
$this->token= ord($token);
$this->value= $token;
} else if (ctype_digit($token)) {
$this->token= ::T_INTEGER;
$this->value= intval($token);
} else if (2 == sscanf($token, '%d.%d', $whole, $fraction)) {
$this->token= ::T_DOUBLE;
$this->value= doubleval($token);
} else {
throw
new IllegalStateException('Unknown token "'.$token.'"');
}

break;
} while (1);

return TRUE;
}
}

More rules
By now, we can only parse number constants as mathematical expressions, which of course isn't satisfying yet. So we'll extend the grammar to allow for simple mathematical operations:
expression:
T_INTEGER { $$= new math·Constant(new Integer($1)); }
| T_DOUBLE { $$= new math·Constant(new Double($1)); }
| expression '+' expression { $$= new math·Addition($1, $3); }
| expression '-' expression { $$= new math·Subtraction($1, $3); }
| expression '*' expression { $$= new math·Multiplication($1, $3); }
| expression '/' expression { $$= new math·Division($1, $3); }
| '(' expression ')' { $$= $2; }
;

This looks rather intuitive, doesn't it? And it should just work out fine, right? Unfortunately not: When regenerating the parser, we'll see the following appear: phpJay: 16 shift/reduce conflicts. What the hell?

The problem is here that this does not completely specify how to handle the following input: a + b + c - is that (a + b) + c or is it a + (b + c)? Of course, it doesn't matter here (it would if we had a + b * c, though!)

So we will need to tell the parser generator which way to structure the input. We will use the following:
  %left '+' '-' 
%left '*' '/'
...and will see the conflicts have disappeared again:-)

Quoting The Lex & Yacc Page:
This describes the precedence and associativity of the four arithmetic operators. Plus and minus are left associative, and have lower precedence than star and slash, which are also left associative.


Because we also want to support negative numbers (the tokenizer will split "-3" into #45("-") and T_NUMBER(3)) and also want something like "-(4 * 3)" to work, we will add a rule reading:
  | '-' expression { $$= new math·Multiplication(new math·Constant(new Integer(-1)), $2); }

The complete grammar can be seen here.


Making it work
Now that we have the parser and lexer in place, we will want to try out the functionality. Therefore we will have to create the classes referenced in the grammar. The inheritance tree looks like this:
  + lang.Object
|--- math.Constant
`--+ math.Operation
|--- math.Addition
|--- math.Subtraction
|--- math.Multiplication
`--- math.Division
Both Constant and Operation classes implement the interface math.Expression (empty for the moment, we will need it later on!)

The Constant class is simply a wrapper around a lang.types.Number:
  class  extends Object implements  {
protected
$number= NULL;

public
function __construct(Number $number) {
$this->number= $number;
}

public
function toString() {
return
$this->getClassName().'('.xp::stringOf($this->number).')';
}
}

The Operation class is an abstract base class for the concrete operation implementations:
  abstract class  extends Object implements  {
protected
$lhs= NULL,
$rhs= NULL;

public
function __construct( $lhs, $rhs) {
$this->lhs= $lhs;
$this->rhs= $rhs;
}

public
function toString() {
return
$this->getClassName().'('.xp::stringOf($this->lhs).', '.xp::stringOf($this->rhs).')';
}
}

The concrete implementation for addition, subtraction, multiplication and divisions are empty subclasses of the above class:
  class  extends  {
}

As the parser's parse() method returns whatever is returned from the start rule, we can use the following to see the parsing results:
  echo xp::stringOf(create(new ())->parse(new ($input)));

Here's what will be output:
  • For the input string "1", we will see math.Constant(lang.types.Integer(1))
  • For the input string "1.5", we will see math.Constant(lang.types.Double(1.5))
  • Same applies if we surround input by brackets as in "(1)"
  • Given an expression "3 * 5", we will read math.Multiplication(math.Constant(lang.types.Integer(3)), math.Constant(lang.types.Integer(5)))
  • If we do not pass anything, we'll receive a syntax error reading "Unexpected end-of-file" - our grammar does not allow for empty expressions!



Making it useful
We have now succeeded in parsing expressions from the input source. Of course this is no real help just to see the object created but we'd like to see some results. For this purpose, we'll add an evaluate() method to the Expression interface as follows:
  interface  {
public
function evaluate();
}
...and then add functionality in the implementing classes.

The Constant class will simply return the contained number:
  class  extends Object implements  {
// ...
public function evaluate() {
return
$this->number;
}
}

The Operation class will unbox the left- and righthand sides' evaluation results to primitives, pass them to a perform method implemented by the concrete implementations and box the value back to a lang.types.Number.
  abstract class  extends Object implements  {
// ...
protected abstract function perform($lhs, $rhs);

public
function evaluate() {
return Primitive::boxed
($this->perform(
Primitive::unboxed
($this->lhs->evaluate()),
Primitive::unboxed
($this->rhs->evaluate())
));
}
}

This way, the code in the Addition class is quite simple (same goes for all the other operations):
  class  extends  {
protected
function perform($lhs, $rhs) {
return
$lhs + $rhs;
}
}



Command line
Finally, we'll create a command line class to provide a calculator utility:
  uses('util.cmd.Command', 'math.Lexer', 'math.Parser');

class Calculator extends Command {

#[@arg(position= 0)]
public function setExpression($expr) {
$this->expr= $expr;
}

public
function run() {
$this->out->writeLine(create(new ())->parse(new ($this->expr))->evaluate());
}
}

This can now be called from the command line $ xpcli cmd.Calculator [expression] and will output its evaluation's result.


Of course, this is a very simple example for a very simple situation, but has hopefully given you a head start at using the parser generator distributed with the XP framework and maybe it's sparked some ideas on how to integrate it in your application:-)

The complete sourcecode can be found in the experiments directory



Subscribe

You can subscribe to the XP framework's news by using RSS syndication.


Categories

News
General
PHP5
Announcements
RFCs
Further reading
Examples
Editorial
EASC
Experiments
Unittests
Databases
5.8-SERIES
Unicode
Language
5.9-SERIES
6.0-SERIES

Related

Find related articles by a search for «Writing».