Sunday, December 3, 2017

Udacity 120 Part 1: Intro to Classification Methods

The first part of Udacity UD120 introduces classification: the problem of predicting which category new objects belong to based on prior examples.

General concepts from the course

Often in the course, the objects to be classified are shown as 2d points in a scatter plot. Such scatter plots are easy to visualize, so they are often used for instructional purposes. In such a scatter plot, the features are the x and y values, each point has a label or class. In real data, there are usually more than two features (thus more than two dimensions in the feature space. Classification is usually a matter of finding features that have predictive power.

The decision surface, or decision boundary is a boundary separating class regions in the feature space. There is an n-dimensional feature space when you have n features; The space is divided by n-1-dimensional decision surfaces into classes.

In machine learning, the best practice is to train and test on different sets of data to avoid over-fitting to training data. General rule: Save about 10% of the data and use it as the testing set.

A supervised machine learning algorithm may have different degree of sensitivity to changing based on new examples, and there may be a trade-off between over-fitting and under-fitting, or between bias and variance.

Four Classification Algorithms


Naive Bayes

The Gaussian Naive Bayes classifier is based (somehow) on Bayes Rule. Some background about Bayes Rule was discussed in class, in the context of a medical test for cancer, where Pos and Neg mean positive and negative test result, and C means"patient has cancer":

The sensitivity P(Pos|C) is also known as true positive rate, whereas specificity P(Neg|~C) is the true negative rate. The prior P(C) is the probability before the test, and the posterior P(C|Pos), which is often what we're interested in, is the probability given the result of the test.

In order to calculate the posterior, we first calculate the joint probability P(C, Pos) = P(C)P(Pos|C), then divide by the normalizer P(Pos). That is, the posterior P(C|Pos) = P(C, Pos) / P(Pos) = P(C)P(Pos|C) / P(Pos). This is how Bayes Rule is often stated.

SVM

SVM (Support Vector Machines) are another classification algorithm, which involve separating the regions of different classes in such a way that maximizes the margin -- the distance between the decision boundary and the points in the different classes. At the same time, some outliers may permitted.

Non-linear boundaries are possible via the kernel trick. Without understanding how it works, users of machine learning libraries like can create SVM models with different types of boundaries by specifying different kernels.

Decision Trees

Decision trees are binary trees that can be used to classify objects by answering a series of yes/no questions. Decision trees are an old standard for the problem of classification, which have the advantage of being more understandable by humans than other learned models.

The problem of learning a decision tree from a training set of data involves finding the best way to split the sample points. In the class the method discussed for splitting the samples was to calculate the information gain for each possible split under consideration, where information gain is the difference between the entropy without splitting, and the weighted average entropy after splitting.

k-Nearest Neighbors

k-Nearest Neighbors, abbreviated as k-NN, is a classification scheme where an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

To predict the label of a new point in the feature space, we find the k nearest neighbors in the space. Whatever is the most common label among those points is the predicted label of the new point.

For k-nearest neighbors, much of the work of classification is deferred until the time of classification. That is, "training" can be super fast, and classification can be slower, especially if there are many features and the number of neighbors to find is high.