Sunday, November 15, 2015

Combining lex and yacc

In the previous post, I posted an example of a simple lexer with lex, where the action for each rule was to print out what rule it matched.

When lex is used with yacc, however, each lex rule may just return an integer which corresponds to a given token type, defined in a header file output by yacc. For example:

File: logic.l
%{
#include "y.tab.h"
%}

%%

not { return NOT; }
or  { return OR; }
and { return AND; }
\(  { return LPAREN; }
\)  { return RPAREN; }
false { return FALSE; }
true  { return TRUE; }

\n { return 0; }  // End marker

. ;  // Ignore everything else.

%%

In this example, the header generated by yacc is included in the header section of the file; each of the rules just returns one of the token IDs defined in y.tab.h.

Of special note is the special token which has a return value of 0, which indicates the end of input.

The yacc file looks like this:

File: logic.y
%{
#include 
%}

%token TRUE FALSE AND OR NOT LPAREN RPAREN

%%

expr    : literal
        | unary
        | binary
        | parens
        { printf("Parsed expr.\n"); }
        ;

literal : TRUE
        | FALSE
        { printf("Parsed literal.\n"); }
        ;

unary   : NOT expr
        { printf("Parsed not expr.\n"); }
        ;

binary  : expr AND expr
        | expr OR expr
        { printf("Parsed binary expr.\n"); }
        ;

parens  : LPAREN expr RPAREN
        { printf("Parsed paren expr.\n"); }
        ;

%%

int main() {
  yyparse();
  printf("Called yyparse once; terminating.\n"); 
}

yyerror(s) char *s; {
  fprintf(stderr, "yyerror called with: %s\n", s);
}

As with the lex file, the yacc file has a header section which can contain C code, and then a sequence of rules. In this header section we can include libraries like stdio.h, which allows us to use printf below.

After the header section (delimited by %{ %}) we have a list of token types. Token type names are traditionally written in all upper-case, although this is just a convention.

In this example, expr, literal, unary, binary and parens all match a particular types of expressions. The first rule listed is the "start" symbol; if there is a sequence of tokens that can be parsed using the first rule after the "end marker" token (0), then the expression is considered parsed (accepted); otherwise there is a syntax error.

Below the list of rules we have some C code, which gets copied verbatim into the C source file that yacc outputs. The above lex and yacc file can be made into a working binary by running the commands:

$ lex logic.l
$ yacc -d logic.y
$ cc -c -o lex.yy.o lex.yy.c
$ cc -c -o y.tab.o y.tab.c
$ cc -o calc lex.yy.o y.tab.o -ly -ll

Lex by default produces lex.yy.c, as before; yacc produces y.tab.c, and since the option -d is passed, it also produces y.tab.h which contains the token definitions. Then lex.yy.c and y.tab.c can be separately compiled into object files, and these object files can be linked together into a binary.

To make this more convenient, it's a good idea to use a Makefile:

File: Makefile
all: calc

clean:
        rm *.c *.h *.o calc

calc: lex.yy.o y.tab.o
        cc -o calc lex.yy.o y.tab.o -ly -ll

lex.yy.o: lex.yy.c y.tab.h

y.tab.o: y.tab.c y.tab.h

y.tab.h y.tab.c: logic.y
        yacc -d logic.y

lex.yy.c: logic.l
        lex logic.l

Finally, we can run the program:

$ ./calc
true and (not false)
Parsed literal.
Parsed not expr.
Parsed paren expr.
Parsed expr.
Called yyparse once; terminating.

And if there's a syntax error (no complete expression before the end marker token (0)):

false and
Parsed literal.
yyerror called with: syntax error
Called yyparse once; terminating.

The next step would be to build some kind of parse tree and evaluate it.

No comments:

Post a Comment