Friday, August 9, 2019

Ningbonese Study Resources

Ningbonese Study Resources

The resources for available for Wu Chinese varieties in general are very limited compared to other large language groups, and most of the resources are targeted at Chinese speakers, not English speakers.

General resources

The Wu Chinese Society (wu-chinese.com) has put together an excellent website with dictionaries, a forum for discussion, and a guide to romanization. Though there are many romanization schemes for Wu; the one used by the Wu Chinese Society is Tongyong Wuyu Pinyin (通用吴语拼音).

Another group is 吴语学堂, for which there are two sites: www.wugniu.com and www.goetian.net. These seem to be relatively new, somewhat grassroots sites.

Books

There are a number of good Chinese books for learning Ningbonese, but they may be hard to obtain outside of China, such as:


Audio

There are some audio resources available on Ximalaya FM, the podcast and audio show platform:

Video

The biggest YouTube channel that has resources for Wu Chinese is  吳儂軟語 Wu Chinese. You may also have great look finding videos on the many Chinese-language video sites, e.g. Bilibili.

Wikipedia

There is a Wu Chinese edition of Wikipedia. Due to the relatively non-standardized and non-unified nature of Wu Chinese, there is a usage guide for editors, which is fairly useful. The article 吴语 from the Chinese Wikipedia is also well-written.

Sunday, December 3, 2017

Udacity 120 Part 1: Intro to Classification Methods

The first part of Udacity UD120 introduces classification: the problem of predicting which category new objects belong to based on prior examples.

General concepts from the course

Often in the course, the objects to be classified are shown as 2d points in a scatter plot. Such scatter plots are easy to visualize, so they are often used for instructional purposes. In such a scatter plot, the features are the x and y values, each point has a label or class. In real data, there are usually more than two features (thus more than two dimensions in the feature space. Classification is usually a matter of finding features that have predictive power.

The decision surface, or decision boundary is a boundary separating class regions in the feature space. There is an n-dimensional feature space when you have n features; The space is divided by n-1-dimensional decision surfaces into classes.

In machine learning, the best practice is to train and test on different sets of data to avoid over-fitting to training data. General rule: Save about 10% of the data and use it as the testing set.

A supervised machine learning algorithm may have different degree of sensitivity to changing based on new examples, and there may be a trade-off between over-fitting and under-fitting, or between bias and variance.

Four Classification Algorithms


Naive Bayes

The Gaussian Naive Bayes classifier is based (somehow) on Bayes Rule. Some background about Bayes Rule was discussed in class, in the context of a medical test for cancer, where Pos and Neg mean positive and negative test result, and C means"patient has cancer":

The sensitivity P(Pos|C) is also known as true positive rate, whereas specificity P(Neg|~C) is the true negative rate. The prior P(C) is the probability before the test, and the posterior P(C|Pos), which is often what we're interested in, is the probability given the result of the test.

In order to calculate the posterior, we first calculate the joint probability P(C, Pos) = P(C)P(Pos|C), then divide by the normalizer P(Pos). That is, the posterior P(C|Pos) = P(C, Pos) / P(Pos) = P(C)P(Pos|C) / P(Pos). This is how Bayes Rule is often stated.

SVM

SVM (Support Vector Machines) are another classification algorithm, which involve separating the regions of different classes in such a way that maximizes the margin -- the distance between the decision boundary and the points in the different classes. At the same time, some outliers may permitted.

Non-linear boundaries are possible via the kernel trick. Without understanding how it works, users of machine learning libraries like can create SVM models with different types of boundaries by specifying different kernels.

Decision Trees

Decision trees are binary trees that can be used to classify objects by answering a series of yes/no questions. Decision trees are an old standard for the problem of classification, which have the advantage of being more understandable by humans than other learned models.

The problem of learning a decision tree from a training set of data involves finding the best way to split the sample points. In the class the method discussed for splitting the samples was to calculate the information gain for each possible split under consideration, where information gain is the difference between the entropy without splitting, and the weighted average entropy after splitting.

k-Nearest Neighbors

k-Nearest Neighbors, abbreviated as k-NN, is a classification scheme where an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

To predict the label of a new point in the feature space, we find the k nearest neighbors in the space. Whatever is the most common label among those points is the predicted label of the new point.

For k-nearest neighbors, much of the work of classification is deferred until the time of classification. That is, "training" can be super fast, and classification can be slower, especially if there are many features and the number of neighbors to find is high.

Sunday, October 8, 2017

Udacity CS259: Software Debugging Course Notes

I recently finished going through Udacity's CS259: Software Debugging class.

One insight from the beginning of the course was that we should treat bugs as mysteries to be solved, which we can explore systematically using the scientific method.

To help us reason about about the state of the program, the teacher, Andreas Zeller, uses some interesting terms:
  • defect: The teacher prefers the term defect rather than bug; bug suggests something external, whereas defect suggests that the program was constructed incorrectly.
  • infection: An infection is when, during the execution of a program, the state is incorrect.
  • infection origin: The first point of infection.
  • failure: The failure is the incorrect output that we actually observe on the surface when the program is executed.
A defect causes infection, which then spreads and causes failure. Debugging is working backward from the failure to the infection origin to the defect itself.

7 Stages of Debugging

This course gives the following description of the process of debugging after discovering a failure.

  1. Track the problem using a bug tracker. Bug trackers can yield useful statistics.
  2.  Reproduce the problem. This may only occur in a certain environment.
  3. Automate and simplify to make a test case that reproduces the problem.
  4. Find possible infection origins. That is, sources where the incorrectness started.
  5. Focus on most likely origins. Code smells, past problems, etc.
  6. Isolate the infection chain. Use the scientific method. Set up experiments.
  7. Correct the defect. Then verify and follow up.

"Automating the Boring Tasks"

Usually, when it comes to debugging, the most basic tools that we have are print debugging and step debuggers.

But one of the main messages of this class was that we can use programs to help us reason about programs: aspects of debugging can be automated. For example:

  • Finding the minimal input that causes a particular output (e.g. delta debugging)
  • Finding differences between a failing and passing run of a program
  • Finding lines in a project that are correlated with failure

Possible project inspiration

Zeller analyzed the Mozilla codebase by analyzing issues in the the Bugzilla issue tracker and related patches and the files they touch. This could be done for most large software projects by mining the issue tracker data. As a simple way to start:

  • List all fixed bugs of some type.
  • List commits associated with those bugs.
  • List files touched by those commits.
  • This should yield a listing of bug-prone files.

Tuesday, December 29, 2015

Writing New Year's Resolutions

As time passes, we all become different versions of ourselves, although not necessarily much different. This year, I want to become a better version of myself, and I believe that this requires changing habits.

According to The Power of Habit, each habit has a cue, a routine, and a reward; changing habits often involves changing the routine part for an existing habit; this requires setting specific new habits to replace the old ones.

So, with that min mind, here's what I wish for this new year:

  1. I want to be patient and considerate. Whenever I notice myself become frustrated, I will reflect on why it happened and what a better attitude would have been.
  2. I want to be physically healthy, and be conscious of my eating and exercise habits.
  3. I want to regularly learn new things.
And in work:

  1. I want to be reliable by being systematic in task management.
  2. I want to produce quality work.
  3. I want to take new opportunities.

Monday, December 21, 2015

Simple symbol table with yacc

In the last post about yacc, we got yacc to evaluate expressions and print out the evaluated value.

In programs that involve setting and looking up variable values, there is usually some kind of symbol table. At the top of the yacc file, I've added the following global symbol table and related functions:

%{
#include <stdio.h>

#define TABLE_SIZE 26
#define UNSET -1

int SYMBOL_TABLE[TABLE_SIZE];
int lookup(int symbol_num);
void set(int symbol_num, int value);
void reset_table();

%}

In this example, the table symbol is just big enough to fit one variable per letter of the alphabet. The table is expected to be initialized with a special "unset" value before use.

The rule section of the yacc file has been modified to include an "assignment statement" type expression. When an assignment statement is encountered, the symbol table is modified. When a variable is encountered and is being used for its value, the value of the expression is looked up in the symbol table.

%%

start   : stmt { printf($1 ? "true\n" : "false\n"); }
        | expr { printf($1 ? "true\n" : "false\n"); }
        ;

expr    : NOT expr { $$ = ! $2; }
        | expr AND expr { $$ = ($1 && $3); }
        | expr OR expr { $$ = ($1 || $3); }
        | LPAREN expr RPAREN { $$ = $2; }
        | NAME { $$ = lookup($1); }
        | FALSE { $$ = 0; }
        | TRUE { $$ = 1; }
        ;

stmt    : NAME ASSIGN expr { set($1, $3); $$ = $3; }
        ;

%%

We're using two new tokens, so they have to be included in the lex file rules section:

=   { return ASSIGN; }
[a-z] { yylval = yytext[0] - 'a'; return NAME; }

The set and lookup functions still have to be defined, and they can be defined at the bottom of the yacc file after the rules section:

int lookup(int symbol_num) {
  printf("lookup %d\n", symbol_num);
  if (SYMBOL_TABLE[symbol_num] == UNSET ||
      symbol_num < 0 || symbol_num >= TABLE_SIZE) {
    return 0;  // false by default
  }
  return SYMBOL_TABLE[symbol_num];
}

void set(int symbol_num, int value) {
  printf("set %d %d\n", symbol_num, value);
  if (symbol_num < 0 || symbol_num >= TABLE_SIZE) {
    return;  // do nothing by default
  }
  SYMBOL_TABLE[symbol_num] = value;
}

After this is added, the parser also accepts assignment expressions and looks up variables in a table.

> a = not true      
false
> b = (true and true)
true
> a or b
true
> a and b
false

Sunday, December 13, 2015

Ningbonese Tones

How many tones does Ningbonese (Ningbo dialect) have?

This question is more complicated than it appears, and to explain why, one must go back to the history of tones in Chinese language:

In the beginning (Middle Chinese), there were four tones: level (平), rising (上), departing (去) and checked (入). Then, by Late Middle Chinese, however, each tone split into two, yin (阴) and yang (阳), according to whether the initial consonant is voiced or not.

Because the specific tone contour only depended on which of the four tones a word belongs to, along with the initial consonant, it could still be said that there are four tones, but 8 possible "tone contours". Wu Chinese, including Ningbo Dialect, preserves both the voiced and voiceless distinction as well as the yin and yang distinction -- so it could be said that there are four tones, but more possible tone contours.

So how many "tone contours" are there? First, of all, looking at single characters, according to online sources, the tones are as follows:

Tone nameTone ContourExampleEnglish
阴平 yin level53 High rising刀 tauknife
阳平 yang level24 Mid rising逃 dauescape
阴上 yin rising35 Mid rising岛 tauisland
阳上 yang rising213 Low rising导 daulead
阴去 yin departing44 Mid-high level到 tauarrive
阳去 yang departing213 Low rising道 daupath
阴入 yin checked55 High level得 tahget
阳入 yang checked12 Low rising达 dahreach

Yang rising and yang departing have the same contour above; so, according to this table, for individual syllables, there may be seven tone contours. However, to my untrained ears, I can't tell the difference between several of them, so I can only distinguish 6 tone contours:

Tone contourExamples
Falling
Rising逃岛导道
High
High with stop
Rising with stop

Wu Chinese also has Tone Sandhi, so the actual tone contour of a word depends on the words surrounding it -- and that's a topic for another post!

Sunday, December 6, 2015

Ningbonese Vowels

Wu Chinese exhibits several features which Mandarin doesn’t have -- rounded front vowels (like German schön) and nasal vowels (like French bon). Let's take a look!

Firstly, there are the basic simple vowel sounds found in most languages:

IPAExampleEnglish
aa买 mabuy
ee海 hesea
aeɛ蛋 daeegg
ii天 thiheaven
oo沙 sosand
uu古 kuancient

It's worth noting that the "o" is fairly close to "u". There are also some straightforward diphthongs:

eiɐi对 teicorrect
uaua快 khuafast
iaia爹 tiadad
ioio小 shiosmall

There is also the very short vowel which is also found in the Mandarin reading of words like 子, 次, or 四. And in Ningbo Dialect there is also rounded version of this vowel, found in some words like 水:

yɿ四 syfour
yuʮ水 syuwater

Two vowels which may be easy to confuse are the vowels in 好 and 火; the vowel in 好 is just the open-mid back rounded vowel ɔ, whereas the vowel in 火 sounds more like the sound in house.

auɔ好 haugood
ouəu火 houfire

The vowels in Ningbo Dialect which involve rounding are:

iuy区 chiudistrict
euœʏ头 deuhead
ieu手 shieuhand
oeø短 toeshort

A subset of the vowels can be followed by a nasal consonant -- unlike Mandarin, there is no distinction between front (alveolar) and back (velar) nasal sounds. Note, that for some people oen may rhyme with on.

on东 toneast
enəŋ村 tsenvillage
oenøŋ春 tshoenspring
in冰 pinice

For two of the vowels, instead of being followed by nasal consonants, the vowels themselves can become nasalized! Note that in the Wu Chinese Association's romanization system, these are also spelled with an "n"; there is no ambiguity because for each type of vowel it can either be nasalized or followed by a nasal consonant.

anã生 sanbirth
aonɔ̃汤 thaonsoup

Finally: Wu Chinese preserves the "checked tone" of Old Chinese, where syllables can end with stop consonants. The only stop consonant that words can end with is the glottal stop (ʔ).

ah法 fahlaw
oehøʔ雪 soehsnow
ih笔 pihpen
oh木 mohwood
iuehyəʔ月 yuehmoon
iohyoʔ吃 chioheat

Primary source: the Wu Chinese Association website and this post on Baidu Tieba.