Saturday, November 7, 2015

Starting to look at lex

Last week I came across a copy of lex & yacc and started to read it. For my first toy project to learn the content, I'd like to make some kind of a "logic calculator", which might, for example, behave like this:

> true and false
false
> false or not false
true
> a = true, b = false, c = false
> (b or not a) and c
false

The first step in this task is to make a lexer to tokenize the input -- to convert the input into a sequence of tokens. With lex, a lexer is defined in a special domain specific language. Example 1:

File: logic.l
%%

not { printf("NOT: %s\n", yytext); }
or  { printf("OR: %s\n", yytext); }
and { printf("AND: %s\n", yytext); }
\(  { printf("LPAREN: %s\n", yytext); }
\)  { printf("RPAREN: %s\n", yytext); }
false { printf("FALSE: %s\n", yytext); }
true  { printf("TRUE: %s\n", yytext); }

. ;  // Ignore everything else.

%%

Things to note: The lex file contains a rules section, marked at the start and end with %%. In the minimal example, a lex example can just contain a sequence of rules. Each rule consists of a regular expression followed by a C block which contains the code to execute when the lexer finds input that matches that regular expression. Inside of that C block, the string that matched the regular expression is available as yytext.

Once we have a lex file like the one above, we can use lex to make a C source file from it:

$ lex logic.l

This creates a file called lex.yy.c (1791 lines!) which can then be compiled and run.

$ cc lex.yy.c -o lexer -ll

$ ./lexer
(true and false) ...
LPAREN: (
TRUE: true
AND: and
FALSE: false
RPAREN: )

^C

In this example, each of the rules in the lexer files prints some text to stdout and doesn't actually return any token type, but often when lex lexer is used inside a parser, the block of code for each of the rules will return an integer, which represents a token type.

No comments:

Post a Comment