Quinten Yearsley's Study Blog: 2015

Tuesday, December 29, 2015

Writing New Year's Resolutions

As time passes, we all become different versions of ourselves, although not necessarily much different. This year, I want to become a better version of myself, and I believe that this requires changing habits.

According to The Power of Habit, each habit has a cue, a routine, and a reward; changing habits often involves changing the routine part for an existing habit; this requires setting specific new habits to replace the old ones.

So, with that min mind, here's what I wish for this new year:

I want to be patient and considerate. Whenever I notice myself become frustrated, I will reflect on why it happened and what a better attitude would have been.
I want to be physically healthy, and be conscious of my eating and exercise habits.
I want to regularly learn new things.

And in work:

I want to be reliable by being systematic in task management.
I want to produce quality work.
I want to take new opportunities.

Monday, December 21, 2015

Simple symbol table with yacc

In the last post about yacc, we got yacc to evaluate expressions and print out the evaluated value.

In programs that involve setting and looking up variable values, there is usually some kind of symbol table. At the top of the yacc file, I've added the following global symbol table and related functions:

%{
#include <stdio.h>

#define TABLE_SIZE 26
#define UNSET -1

int SYMBOL_TABLE[TABLE_SIZE];
int lookup(int symbol_num);
void set(int symbol_num, int value);
void reset_table();

%}

In this example, the table symbol is just big enough to fit one variable per letter of the alphabet. The table is expected to be initialized with a special "unset" value before use.

The rule section of the yacc file has been modified to include an "assignment statement" type expression. When an assignment statement is encountered, the symbol table is modified. When a variable is encountered and is being used for its value, the value of the expression is looked up in the symbol table.

%%

start   : stmt { printf($1 ? "true\n" : "false\n"); }
        | expr { printf($1 ? "true\n" : "false\n"); }
        ;

expr    : NOT expr { $$ = ! $2; }
        | expr AND expr { $$ = ($1 && $3); }
        | expr OR expr { $$ = ($1 || $3); }
        | LPAREN expr RPAREN { $$ = $2; }
        | NAME { $$ = lookup($1); }
        | FALSE { $$ = 0; }
        | TRUE { $$ = 1; }
        ;

stmt    : NAME ASSIGN expr { set($1, $3); $$ = $3; }
        ;

%%

We're using two new tokens, so they have to be included in the lex file rules section:

=   { return ASSIGN; }
[a-z] { yylval = yytext[0] - 'a'; return NAME; }

The set and lookup functions still have to be defined, and they can be defined at the bottom of the yacc file after the rules section:

int lookup(int symbol_num) {
  printf("lookup %d\n", symbol_num);
  if (SYMBOL_TABLE[symbol_num] == UNSET ||
      symbol_num < 0 || symbol_num >= TABLE_SIZE) {
    return 0;  // false by default
  }
  return SYMBOL_TABLE[symbol_num];
}

void set(int symbol_num, int value) {
  printf("set %d %d\n", symbol_num, value);
  if (symbol_num < 0 || symbol_num >= TABLE_SIZE) {
    return;  // do nothing by default
  }
  SYMBOL_TABLE[symbol_num] = value;
}

After this is added, the parser also accepts assignment expressions and looks up variables in a table.

> a = not true      
false
> b = (true and true)
true
> a or b
true
> a and b
false

Sunday, December 13, 2015

Ningbonese Tones

How many tones does Ningbonese (Ningbo dialect) have?

This question is more complicated than it appears, and to explain why, one must go back to the history of tones in Chinese language:

In the beginning (Middle Chinese), there were four tones: level (平), rising (上), departing (去) and checked (入). Then, by Late Middle Chinese, however, each tone split into two, yin (阴) and yang (阳), according to whether the initial consonant is voiced or not.

Because the specific tone contour only depended on which of the four tones a word belongs to, along with the initial consonant, it could still be said that there are four tones, but 8 possible "tone contours". Wu Chinese, including Ningbo Dialect, preserves both the voiced and voiceless distinction as well as the yin and yang distinction -- so it could be said that there are four tones, but more possible tone contours.

So how many "tone contours" are there? First, of all, looking at single characters, according to online sources, the tones are as follows:

Tone name	Tone Contour	Example	English
阴平 yin level	53 High rising	刀 tau	knife
阳平 yang level	24 Mid rising	逃 dau	escape
阴上 yin rising	35 Mid rising	岛 tau	island
阳上 yang rising	213 Low rising	导 dau	lead
阴去 yin departing	44 Mid-high level	到 tau	arrive
阳去 yang departing	213 Low rising	道 dau	path
阴入 yin checked	55 High level	得 tah	get
阳入 yang checked	12 Low rising	达 dah	reach

Yang rising and yang departing have the same contour above; so, according to this table, for individual syllables, there may be seven tone contours. However, to my untrained ears, I can't tell the difference between several of them, so I can only distinguish 6 tone contours:

Tone contour	Examples
Falling	刀
Rising	逃岛导道
High	到
High with stop	得
Rising with stop	达

Wu Chinese also has Tone Sandhi, so the actual tone contour of a word depends on the words surrounding it -- and that's a topic for another post!

Sunday, December 6, 2015

Ningbonese Vowels

Wu Chinese exhibits several features which Mandarin doesn’t have -- rounded front vowels (like German schön) and nasal vowels (like French bon). Let's take a look!

Firstly, there are the basic simple vowel sounds found in most languages:

	IPA	Example	English
a	a	买 ma	buy
e	e	海 he	sea
ae	ɛ	蛋 dae	egg
i	i	天 thi	heaven
o	o	沙 so	sand
u	u	古 ku	ancient

It's worth noting that the "o" is fairly close to "u". There are also some straightforward diphthongs:

ei	ɐi	对 tei	correct
ua	ua	快 khua	fast
ia	ia	爹 tia	dad
io	io	小 shio	small

There is also the very short vowel which is also found in the Mandarin reading of words like 子, 次, or 四. And in Ningbo Dialect there is also rounded version of this vowel, found in some words like 水:

y	ɿ	四 sy	four
yu	ʮ	水 syu	water

Two vowels which may be easy to confuse are the vowels in 好 and 火; the vowel in 好 is just the open-mid back rounded vowel ɔ, whereas the vowel in 火 sounds more like the sound in house.

au	ɔ	好 hau	good
ou	əu	火 hou	fire

The vowels in Ningbo Dialect which involve rounding are:

iu	y	区 chiu	district
eu	œʏ	头 deu	head
ieu	iʏ	手 shieu	hand
oe	ø	短 toe	short

A subset of the vowels can be followed by a nasal consonant -- unlike Mandarin, there is no distinction between front (alveolar) and back (velar) nasal sounds. Note, that for some people oen may rhyme with on.

on	oŋ	东 ton	east
en	əŋ	村 tsen	village
oen	øŋ	春 tshoen	spring
in	iŋ	冰 pin	ice

For two of the vowels, instead of being followed by nasal consonants, the vowels themselves can become nasalized! Note that in the Wu Chinese Association's romanization system, these are also spelled with an "n"; there is no ambiguity because for each type of vowel it can either be nasalized or followed by a nasal consonant.

an	ã	生 san	birth
aon	ɔ̃	汤 thaon	soup

Finally: Wu Chinese preserves the "checked tone" of Old Chinese, where syllables can end with stop consonants. The only stop consonant that words can end with is the glottal stop (ʔ).

ah	aʔ	法 fah	law
oeh	øʔ	雪 soeh	snow
ih	iʔ	笔 pih	pen
oh	oʔ	木 moh	wood
iueh	yəʔ	月 yueh	moon
ioh	yoʔ	吃 chioh	eat

Primary source: the Wu Chinese Association website and this post on Baidu Tieba.

Sunday, November 29, 2015

Ningbonese Consonants

In general, Wu Chinese dialects have a similar set of initial consonants as Mandarin, with one major difference: Wu Chinese preserves the voiced initial consonants of Old Chinese -- so there's a distinction between 白 (voiced) and 百 (unvoiced). Additionally, Wu Chinese does not have retroflex consonants (Mandarin zh/ch/sh/r).

It may be difficult for learners to distinguish the voiced and voiceless un-aspirated consonants; it's helpful to note that syllables with voiced initial consonants always have a light (阳) tone, while syllables with voiceless initial syllables always have dark (阴) tones.

There are three groups of three stop consonants: voiceless un-aspirated, aspirated, and voiced p, t, and k:

	IPA	Example	English
p	p	八 pah	eight
ph	pʰ	漂亮 phio lian	beautiful
b	b	朋友 ban yeu	friend
t	t	多少 tou shio	how much
th	tʰ	天亮 thi nyan	morning
d	d	头发 deu fah	hair
k	k	交关 chio kuae	very
kh	kʰ	困觉 khuen kau	sleep
g	g	番茄 fae ga	tomato

There are also two groups: "c/ch" is like Mandarin "j/q", and is usually followed by an "i" sound; "ts/tsh" is like Mandarin "z/c".

ts	ts	钟头 tson deu	hour
tsh	tsʰ	出 tshoeh	out
dz	dz	茶 dzo	tea
c	tɕ	今末 cih mah	today
ch	tɕʰ	去 chiu	go
j	dʑ	其 ji	him/her

There are four groups of fricative consonants, each with voiced and unvoiced versions. It can be particularly difficult to hear the voiced glottal fricative "gh" and the voiced palatal fricative "zh", since these sounds do not appear in English or Mandarin.

f	f	飞机 fi ci	airplane
v	v	不好 vah hau	not good
h	h	好 hau	good
gh	ɦ	后 gheu	back
s	s	水 syu	water
z	z	辰光 zoen kuaon	time
sh	ɕ	晓得 shio tah	know
zh	ʑ	前 zhi	front

Finally, there are the sonorant consonants -- unlike Mandarin, this includes a velar nasal (ng) and a palatal nasal (ny).

m	m	明朝 min cio	tomorrow
ny	ȵ	人 nyin	person
ng	ŋ	我 ngo	me
n	n	侬 nou	you
l	l	来 le	come

Reference: Wu-chinese.com.

Saturday, November 21, 2015

Giving values to expressions in yacc

In the last post, I started to use yacc, but in that example, when rules were matched, the action for each message was just to print out a message; the tokens were not combined or evaluated in any way.

Inside an action block after a grammar rule expr : X Y Z, $$ is the value of the left-hand side (the output when that rule is used to reduce a sequence of expressions), and expressions on the right hand side are in the variables $1, $2, $3, etc.

For example, in our logic expression grammar we can have:

expr    : NOT expr { $$ = ! $2 }
        | expr AND expr { $$ = ($1 && $3); }
        | expr OR expr { $$ = ($1 || $3); }
        | LPAREN expr RPAREN { $$ = $2; }
        | TRUE { $$ = 1; }
        | FALSE { $$ = 0; }
        ;

Each alternative (separated by |) can have an action associated with it; each of these rules can combine the right hand side values in a different way to set $$.

With this, we can now parse and evaluate complex expressions:

File: logic.y

%{
#include 
%}

%token TRUE FALSE AND OR NOT LPAREN RPAREN
%left OR
%left AND
%left NOT

%%

start   : expr {
          if ($1) {
            printf("true\n");
          } else {
            printf("false\n");
          }
        }

expr    : NOT expr { $$ = ! $2; }
        | expr AND expr { $$ = ($1 && $3); }
        | expr OR expr { $$ = ($1 || $3); }
        | LPAREN expr RPAREN { $$ = $2; }
        | TRUE { $$ = 1; }
        | FALSE { $$ = 0; }
        ;

%%

extern FILE *yyin;

int main() {
  do {
    printf("> ");
    yyparse();
  } while (!feof(yyin));
}

yyerror(s) char *s; {
  fprintf(stderr, "yyerror called with: %s\n", s);
}

The %token section at the top, as before, specifies the tokens and allows yacc to generate a header file that's also used by the lexer. the %left rules above specify operator precedence order and associativity so that the rules are not ambiguous.

The first rule, start has one action associated with it, which is to print out the value of a parsed expression. In this grammar, there are several possible patterns that can comprise an expr; all of these rules are recursive except for true and false, which have primitive values.

In this example, the main function now prints a prompt, reads a sequence of tokens and parses, and repeats until EOF (ctrl-D on the terminal) is entered.

Example:

$ ./calc
> false and false and false and false 
false
> true and (false or not false)
true
> true and
yyerror called with: syntax error
>

Sunday, November 15, 2015

Combining lex and yacc

In the previous post, I posted an example of a simple lexer with lex, where the action for each rule was to print out what rule it matched.

When lex is used with yacc, however, each lex rule may just return an integer which corresponds to a given token type, defined in a header file output by yacc. For example:

File: logic.l

%{
#include "y.tab.h"
%}

%%

not { return NOT; }
or  { return OR; }
and { return AND; }
\(  { return LPAREN; }
\)  { return RPAREN; }
false { return FALSE; }
true  { return TRUE; }

\n { return 0; }  // End marker

. ;  // Ignore everything else.

%%

In this example, the header generated by yacc is included in the header section of the file; each of the rules just returns one of the token IDs defined in y.tab.h.

Of special note is the special token which has a return value of 0, which indicates the end of input.

The yacc file looks like this:

File: logic.y

%{
#include 
%}

%token TRUE FALSE AND OR NOT LPAREN RPAREN

%%

expr    : literal
        | unary
        | binary
        | parens
        { printf("Parsed expr.\n"); }
        ;

literal : TRUE
        | FALSE
        { printf("Parsed literal.\n"); }
        ;

unary   : NOT expr
        { printf("Parsed not expr.\n"); }
        ;

binary  : expr AND expr
        | expr OR expr
        { printf("Parsed binary expr.\n"); }
        ;

parens  : LPAREN expr RPAREN
        { printf("Parsed paren expr.\n"); }
        ;

%%

int main() {
  yyparse();
  printf("Called yyparse once; terminating.\n"); 
}

yyerror(s) char *s; {
  fprintf(stderr, "yyerror called with: %s\n", s);
}

As with the lex file, the yacc file has a header section which can contain C code, and then a sequence of rules. In this header section we can include libraries like stdio.h, which allows us to use printf below.

After the header section (delimited by %{ %}) we have a list of token types. Token type names are traditionally written in all upper-case, although this is just a convention.

In this example, expr, literal, unary, binary and parens all match a particular types of expressions. The first rule listed is the "start" symbol; if there is a sequence of tokens that can be parsed using the first rule after the "end marker" token (0), then the expression is considered parsed (accepted); otherwise there is a syntax error.

Below the list of rules we have some C code, which gets copied verbatim into the C source file that yacc outputs. The above lex and yacc file can be made into a working binary by running the commands:

$ lex logic.l
$ yacc -d logic.y
$ cc -c -o lex.yy.o lex.yy.c
$ cc -c -o y.tab.o y.tab.c
$ cc -o calc lex.yy.o y.tab.o -ly -ll

Lex by default produces lex.yy.c, as before; yacc produces y.tab.c, and since the option -d is passed, it also produces y.tab.h which contains the token definitions. Then lex.yy.c and y.tab.c can be separately compiled into object files, and these object files can be linked together into a binary.

To make this more convenient, it's a good idea to use a Makefile:

File: Makefile

all: calc

clean:
        rm *.c *.h *.o calc

calc: lex.yy.o y.tab.o
        cc -o calc lex.yy.o y.tab.o -ly -ll

lex.yy.o: lex.yy.c y.tab.h

y.tab.o: y.tab.c y.tab.h

y.tab.h y.tab.c: logic.y
        yacc -d logic.y

lex.yy.c: logic.l
        lex logic.l

Finally, we can run the program:

$ ./calc
true and (not false)
Parsed literal.
Parsed not expr.
Parsed paren expr.
Parsed expr.
Called yyparse once; terminating.

And if there's a syntax error (no complete expression before the end marker token (0)):

false and
Parsed literal.
yyerror called with: syntax error
Called yyparse once; terminating.

The next step would be to build some kind of parse tree and evaluate it.

Saturday, November 7, 2015

Starting to look at lex

Last week I came across a copy of lex & yacc and started to read it. For my first toy project to learn the content, I'd like to make some kind of a "logic calculator", which might, for example, behave like this:

> true and false
false
> false or not false
true
> a = true, b = false, c = false
> (b or not a) and c
false

The first step in this task is to make a lexer to tokenize the input -- to convert the input into a sequence of tokens. With lex, a lexer is defined in a special domain specific language. Example 1:

File: logic.l

%%

not { printf("NOT: %s\n", yytext); }
or  { printf("OR: %s\n", yytext); }
and { printf("AND: %s\n", yytext); }
\(  { printf("LPAREN: %s\n", yytext); }
\)  { printf("RPAREN: %s\n", yytext); }
false { printf("FALSE: %s\n", yytext); }
true  { printf("TRUE: %s\n", yytext); }

. ;  // Ignore everything else.

%%

Things to note: The lex file contains a rules section, marked at the start and end with %%. In the minimal example, a lex example can just contain a sequence of rules. Each rule consists of a regular expression followed by a C block which contains the code to execute when the lexer finds input that matches that regular expression. Inside of that C block, the string that matched the regular expression is available as yytext.

Once we have a lex file like the one above, we can use lex to make a C source file from it:

$ lex logic.l

This creates a file called lex.yy.c (1791 lines!) which can then be compiled and run.

$ cc lex.yy.c -o lexer -ll

$ ./lexer
(true and false) ...
LPAREN: (
TRUE: true
AND: and
FALSE: false
RPAREN: )

^C

In this example, each of the rules in the lexer files prints some text to stdout and doesn't actually return any token type, but often when lex lexer is used inside a parser, the block of code for each of the rules will return an integer, which represents a token type.