Linear Regression

Open In Colab Binder

Notebook producing the figures in Linear Regression to Neural Networks. Chapter on Linear Regression.

For all source files, see Github repository.

Dataset

First, we need some data. A simple dataset that still has enough complexity for our purposes is the Penguin dataset from Antartica LTER (Gorman et al. 2014). Some scientists working in Antartica have measured the flipper length, body mass, nose depth and more of some penguins living there. The dataset includes three species.

We can observe there are 4 quantitative columns describing the penguins. The penguin sex, island and species are categorical columns. The bill is the penguin's snout: we have variables bill_depth_mm and bill_length_mm.

Penguin dimensions Penguin bill. Courtesy of @allison_horst.

The other quantitative variables are the flipper_length_mm (length of penguin flipper in millimeters) and the body_mass_g (penguin body mass in grams).

Also, before conducting any experiments, let's split the dataset into a training and testing subset. Splitting the dataset into training and hold-out testing sets is a common practice called Cross Validation - allowing us to get a more reliable estimate of the generalization performance of the estimator. We hold out 1/5th of the data for testing and use the rest for training.

Linear Regression

First, let's explore linear regression. Presumably, there might be a relation between some of the quantiative variables and penguin body mass.

Body mass seems to correlate with flipper length. The longer the penguin flipper length, the heavier it is, generally speaking. Let's see whether we can predict penguin body mass using flipper length by modeling the relationship with linear regression: penguin flipper length will be the independent variable whilst penguin body mass will be the dependent variable.

We can analytically solve Linear Regression by minimizing the Residual Sum-of-Squares cost function:

$RSS(\beta) = (y - X \beta)^T (y - X \beta)$

In which $X$ is our design matrix. This loss function is also sometimes just referred to as "Ordinary Least Squares".

Ordinary Least Squares

First, we must build our design matrix. The design matrix contains all explanatory variables with which we would like to predict $y$. To incorporate the data intercept we can include a bias constant of 1 in our design matrix. An alternative would be to first center our data such to get rid of the intercept entirely. We define the design matrix as follows.

To now minimize our cost function we differentiate RSS with respect to $\beta$, giving us the following unique minimum:

$\hat{\beta} = (X^T X)^{-1} X^T Y$

Which results in the estimated least-squares coefficients given the training data. We can classify by simply multiplying our input data point with the found coefficient matrix: $\hat{y} = X \hat{\beta}$.

Let's see how well we did, quantitatively. We can do this by evaluating the Residual Sum-of-Squares on our observations given regression projection.

An easier measurement would be to take the mean error over all samples, i.e. to compute the MSE.

We will also evaluate the MSE for our testing data:

Let's observe our fitted regression line onto the data:

Ridge Regression

Let's see whether Ridge Regression is any good for this dataset, and explore what several different values of regularization strength do to our result.

Polynomial Regression

Let's also see what happens when we fit a more flexible model: polynomial regression - does the flexibility improve our results or does it only cause overfit?

High-dimensional data: Analytic versus SGD solution

The dataset we fit the regression on was pretty easy - perhaps a drastic oversimplification in comparison to the real world. In the real world, datasets might be of very high dimensionality: think of images, speech, or a biomedical dataset storing DNA sequences.

So, let's simulate a high-dimensional situation, in which the amount of dimensions far outmatches the amount of dataset samples ($p \gg n$). We will simply copy the dataset columns and add some noise to them to make them different.

We will write a little function to benchmark an estimator. Let's see how long our regular analytic fit takes.

Ouch. That took quite a while. Computing matrix inverses on large dimensional matrices is a very slow operation. Let's compare with the lower dimensional data approach by running each 10 times:

Clear difference. But we can do faster - using a more approximate, iterative procedure: Gradient Descent. Gradient Descent works by computing the gradient of the cost function with respect to the model weights - such that we can then move in the opposite direction of the gradient in parameter space. We can again run a small experiment to benchmark the estimator runtime on the high-dimensional dataset:

Let's run the former benchmark multiple times to better compare the two approaches.

We can observe SGD is much faster for such a high-dimensional dataset. Let's compare both in a plot.

We can extend our benchmark even further and vary the amount of noise dimensions added. We'll conduct 10 benchmarks for each dimensionality configuration.

Interesting results. Increasing amounts of dimensions seem to have an exponential effect on the analytic solution fitting time. Actually, this benchmark has been run on many more intermediate steps, on a more powerful computer. Let's check out those results.

Our analytic solution shows a clear exponential trend.

Indeed this is a game between time complexity and accuracy of our solutions - like we do see more often in the Machine Learning field. Given our dataset at hand and the time requirements of our solution, a ML practioner has to choose either of the solutions.

Author

Code written by Jeroen Overschie. MIT licensed.