IA (III): Regression

Regression is one of the technices we can find in the supervised learning paradigm.

Let’s suppose we have some historic data about some alcohol effects trials participants, and we have some data about the amount of alcohol they have ingested before showing symptoms of drunkenness. In addition, we have some data about themselves like weight and height.

Now, we want to explore how I would use machine learning to predict how many alcohol can a person ingest before getting drunk.

When we need to predict a numeric value, like an amount of money or a temperature or, in this case, the number of mililiters, in this cases is when a supervised learning technique called regression is used.

Let’s take one of the participants in our study and check the data is interesting for us. And, let’s make it simple and take just age, weight, height and percentage of body fat.

What we want to, it is to find a model that can calculate the amount of alcohol a person can drink before to have symptoms of drunkenness.

Age: 30. Weight: 90 kg. Height: 180 cm. Body fat: 23%. Alcohol: 125 ml.

ƒ([30, 90, 180, 23]) = 125

So we need our algorithm to learn the function that operates of all of the participant features to give us a result of amount of alcohol in milimiters.

Of course, a sample of only one person is not likely to give us a function that generalizes well. We need to gather the same sort of data from lots of diverse participants and train our model based on this larger set of data.

ƒ([X1, X2, X3, X4]) = Y

After we have trained the model and we have a generalized function that can be used to calculate our label Y, we can then plot the values of Y, calculated for specific features of X values on a chart. And, we can interpolate any new values of X to predict and unknown Y.

Captura de pantalla 2018-06-16 a las 11.37.00

We can use part of our study data to train the model and withhold the rest of the data for evaluating model performance.

Now we can use the model to predict f of x for evaluation data, and compare the predictions or scored labels to the actual labels that we know to be true.

The result can have differences between the predicted and actual levels, these are what we call the residuals and they can tell us something about the level of error in the model.

Captura de pantalla 2018-06-16 a las 11.18.51

There are a few ways we can measure the error in the model, and these include root-mean-square error, or RMSE, and mean absolute error, or MAE. Both of these are absolute measures of error in the model.

RMSE = √(∑(score - label)^2)
MAE = 1/n ∑ abs(score - label)

For example, an RMSE value of 5 would mean that the standard deviation of error from our test error is 5 mililiters.

The problem is that absolute values can vary wildly depending on what you are predicting. An error of 5 in one model can mean nothing but in a different model can be a big difference. So we might want to evaluate the model using relative metrics to indicate a more general level of error as a relative value between 0 and 1.

Relative absolute error, or RAE, and relative squared error, or RSE, produce a metric where the closer to 0 the error, the better the model

RAE = ∑ abs(score - label) / ∑ label
RSE = √(∑ (score - label)^2) / ∑ label^2

And the coefficient of determination, which we sometimes call R squared, is another relative metric, but this time a value closer to 1 indicates a good fit for the model.

CoD (R^2) = 1 var(score - label) / var(label)

[Updated: Correction in the second error chart were there was a typo: MAE -> RSE. Thanks to Michael Mora for the comment]

IA (III): Regression

Machine learning branches

In machine learning we can find three main different branches where we can classify the algorithms:

  • Supervised learning.
  • Unsupervised learning.
  • Reinforcement learning.

Supervised learning

In supervised algorithms you know the input and the output that you need from your model. You do not know how the output is achieved from the input data or how are the inner relations among you data, but definitely know the output data.

As an example, we can take a magazine publication that it has the subscription data of a determinate number of customers or old customers, let’s say 100.000 customers. The company in charge of the magazine knows that half of these customers (50.000) have cancelled their subscriptions and the other half (50.000) are still subscribed, and they want a model to predict what customers will cancel their subscriptions.

We know the input: customers subscription data, and the output: cancelled or not.

We can then build our training data set with 90.000 customers data. Half of them cancelled and half of them still active. We will train our system with this training set. And after that we will try to predict the result for the other 10.000 we left outside the training data to check the accuracy of our model.

Unsupervised learning

In unsupervised learning algorithms you do not know what is the output of your model, you maybe know there is some kind of relation or correlation in your data but, maybe, the data is too complex to guess.

In this kind of algorithms, you normalize your data in ways that it can be compared and you wait for the model to find some of these relationships. One of the special characteristics of these models is that, while the model can suggest different ways to categorize or order your data, it is up to you to make further research on these to unveil something useful.

For example, we can have a company selling a huge number of products and they want to improve their system to target customers with useful advertisement campaigns. We can give to our algorithm the customers data and the algorithms can suggest some relations: age range, location, …

Reinforcement learning

In reinforcement learning algorithms, they do not receive immediately the reward for their actions, and they need to accumulate some consecutive decision to know if the actions/decisions are or not correct. In this scenario, there is no supervisor, the feedback about the decision is delayed and agent’s actions affect the subsequent data it receives.

One example of this, it can be the chess game, where the algorithm is going to be taking decisions but, till the end of the game, it is not going to be able to know if these decisions were correct or not and, obviously, previous decisions affect subsequent decisions.

Machine learning branches