at 2008-02-10
in Examples
by friebe
(0 comments)
In many situations, we need to transform user input to machine readable data. Mostly, a regular expression, sscanf or even very simple handling via string functions will be sufficient. For more complex (and forgiving) input, and when we want to tell users what went wrong where these methods may not be enough.
Imagine a situation in which you would like to evaluate mathematical expressions, so, for example, the user would be inputting something like 1 + 2 or -1 / (3 * 6) and you would like output the expression's result. Something like this:
 Using eval is not an option because you're not in control over what is passed - this creates security problems!To get started with a parser, we will first have to write a grammar. The XP framework comes with a technology to generate so-called LALR parsers from these - don't worry about its technical details for the moment, though.
The grammar The grammar files you need to write are easy to read and resemble the syntax used in Yacc. Here's a basic skeleton:
%{ // Top: Package and uses go here %} // Tokens and associativity definitions %% // Rules go here %% Top Sourcecode included in top is placed above the generated parser's class declaration statement. Here's what we'll start with:
%{ $package= 'math'; uses( 'math.Constant', 'lang.types.Integer', 'lang.types.Double' ); %} Tokens The tokens section contains a list of tokens that consist of more than one character. A character on its own has a token number of its ASCII value and the character as value (e.g. the "+" sign is #43). The lexer is responsible for producing the tokens, so it can decide to return "mod" as one token. In this case, we need to declare this token (the name doesn't matter, but is suggested to be written in all-uppercase).
We'll define two tokens by adding a line as follows:
%token T_INTEGER T_DOUBLE Rules This is the interesting part! Here's the general skeleton for a rule:
name: rule | rule1 token1 { $$= new Rule1($1, $2); } | rule2 ; This can be read as: "The rule name contains either a rule, a rule1 (followed by the token token1, in which case a Rule1 object containing the token values of rule1 and token1 is created and returned - via $$) or a rule2."
Confused?
Our first grammar The first rule needs to be named "start". This is for the parser generator to know where to set the entry point. So we'll start out as follows:
start: expression { /* May be omitted: $$= $1; */ } ;
expression: T_INTEGER { $$= new math·Constant(new Integer($1)); } | T_DOUBLE { $$= new math·Constant(new Double($1)); } | '(' expression ')' { $$= $2; } ; Read this as follows: "At the start, we expect an expression. An expression may be an integer number (in which case we create - and return - an Integer instance wrapped by a Constant), a double number (same procedure, but use a Double instead), or an expression surrounded by brackets (which will simply return the expression inside, to be used for grouping).
Putting it together This command will generate our parser: $ sh ~/devel/xp/trunk/ports/technologies/opt/jay/generate.sh grammar/calculator.jay php5 'math·' > math/Parser.class.php
After that, we'll have to create a lexer which will tokenize the input into tokens. For the expression 1 + 2.5, it should return T_INTEGER(1), 43('+') and T_DOUBLE(2.5) (no tokens are returned for whitespace).
Here's an abbreviated version:
class extends AbstractLexer { const DELIMITERS= " +-*/()\r\n\t"; protected $tokenizer= NULL; public function __construct($expression) { $this->tokenizer= new StringTokenizer($expression, self::DELIMITERS, TRUE); } public function advance() { do { if (!$this->tokenizer->hasMoreTokens()) return FALSE; $token= $this->tokenizer->nextToken(self::DELIMITERS); if (FALSE !== strpos(" \n\r\t", $token)) continue; if (FALSE !== strpos(self::DELIMITERS, $token) && 1 == strlen($token)) { $this->token= ord($token); $this->value= $token; } else if (ctype_digit($token)) { $this->token= ::T_INTEGER; $this->value= intval($token); } else if (2 == sscanf($token, '%d.%d', $whole, $fraction)) { $this->token= ::T_DOUBLE; $this->value= doubleval($token); } else { throw new IllegalStateException('Unknown token "'.$token.'"'); } break; } while (1);
return TRUE; } } More rules By now, we can only parse number constants as mathematical expressions, which of course isn't satisfying yet. So we'll extend the grammar to allow for simple mathematical operations:
expression: T_INTEGER { $$= new math·Constant(new Integer($1)); } | T_DOUBLE { $$= new math·Constant(new Double($1)); } | expression '+' expression { $$= new math·Addition($1, $3); } | expression '-' expression { $$= new math·Subtraction($1, $3); } | expression '*' expression { $$= new math·Multiplication($1, $3); } | expression '/' expression { $$= new math·Division($1, $3); } | '(' expression ')' { $$= $2; } ; This looks rather intuitive, doesn't it? And it should just work out fine, right? Unfortunately not: When regenerating the parser, we'll see the following appear: phpJay: 16 shift/reduce conflicts. What the hell?
The problem is here that this does not completely specify how to handle the following input: a + b + c - is that (a + b) + c or is it a + (b + c)? Of course, it doesn't matter here (it would if we had a + b * c, though!)
So we will need to tell the parser generator which way to structure the input. We will use the following: %left '+' '-' %left '*' '/' ...and will see the conflicts have disappeared again
Quoting The Lex & Yacc Page:
This describes the precedence and associativity of the four arithmetic operators. Plus and minus are left associative, and have lower precedence than star and slash, which are also left associative.
Because we also want to support negative numbers (the tokenizer will split "-3" into #45("-") and T_NUMBER(3)) and also want something like "-(4 * 3)" to work, we will add a rule reading:
| '-' expression { $$= new math·Multiplication(new math·Constant(new Integer(-1)), $2); } The complete grammar can be seen here.
Making it work Now that we have the parser and lexer in place, we will want to try out the functionality. Therefore we will have to create the classes referenced in the grammar. The inheritance tree looks like this:
+ lang.Object |--- math.Constant `--+ math.Operation |--- math.Addition |--- math.Subtraction |--- math.Multiplication `--- math.Division Both Constant and Operation classes implement the interface math.Expression (empty for the moment, we will need it later on!)
The Constant class is simply a wrapper around a lang.types.Number:
class extends Object implements { protected $number= NULL; public function __construct(Number $number) { $this->number= $number; } public function toString() { return $this->getClassName().'('.xp::stringOf($this->number).')'; } } The Operation class is an abstract base class for the concrete operation implementations:
abstract class extends Object implements { protected $lhs= NULL, $rhs= NULL; public function __construct( $lhs, $rhs) { $this->lhs= $lhs; $this->rhs= $rhs; } public function toString() { return $this->getClassName().'('.xp::stringOf($this->lhs).', '.xp::stringOf($this->rhs).')'; } } The concrete implementation for addition, subtraction, multiplication and divisions are empty subclasses of the above class:
class extends { } As the parser's parse() method returns whatever is returned from the start rule, we can use the following to see the parsing results:
echo xp::stringOf(create(new ())->parse(new ($input))); Here's what will be output:
- For the input string "1", we will see math.Constant(lang.types.Integer(1))
- For the input string "1.5", we will see math.Constant(lang.types.Double(1.5))
- Same applies if we surround input by brackets as in "(1)"
- Given an expression "3 * 5", we will read math.Multiplication(math.Constant(lang.types.Integer(3)), math.Constant(lang.types.Integer(5)))
- If we do not pass anything, we'll receive a syntax error reading "Unexpected end-of-file" - our grammar does not allow for empty expressions!
Making it useful We have now succeeded in parsing expressions from the input source. Of course this is no real help just to see the object created but we'd like to see some results. For this purpose, we'll add an evaluate() method to the Expression interface as follows:
interface { public function evaluate(); }...and then add functionality in the implementing classes.
The Constant class will simply return the contained number:
class extends Object implements { public function evaluate() { return $this->number; } } The Operation class will unbox the left- and righthand sides' evaluation results to primitives, pass them to a perform method implemented by the concrete implementations and box the value back to a lang.types.Number.
abstract class extends Object implements { protected abstract function perform($lhs, $rhs); public function evaluate() { return Primitive::boxed($this->perform( Primitive::unboxed($this->lhs->evaluate()), Primitive::unboxed($this->rhs->evaluate()) )); } } This way, the code in the Addition class is quite simple (same goes for all the other operations):
class extends { protected function perform($lhs, $rhs) { return $lhs + $rhs; } }
Command line Finally, we'll create a command line class to provide a calculator utility:
uses('util.cmd.Command', 'math.Lexer', 'math.Parser'); class Calculator extends Command { public function setExpression($expr) { $this->expr= $expr; } public function run() { $this->out->writeLine(create(new ())->parse(new ($this->expr))->evaluate()); } } This can now be called from the command line $ xpcli cmd.Calculator [expression] and will output its evaluation's result.
Of course, this is a very simple example for a very simple situation, but has hopefully given you a head start at using the parser generator distributed with the XP framework and maybe it's sparked some ideas on how to integrate it in your application
The complete sourcecode can be found in the experiments directory
|
Subscribe
You can subscribe to the XP framework's news by using RSS syndication.
CategoriesNews General PHP5 Announcements RFCs Further reading Examples Editorial EASC Experiments Unittests Databases 5.8-SERIES Unicode Language 5.9-SERIES
RelatedFind related articles by a search for «Writing».
|