ML – Python (VII) – Overfitting & Underfitting

On the previous article, we named overfitting and underfitting but we did not go deep into details about them. Let’s just take a deeper dive on them.

When we work with a set of data to predict or classify a problem we try to achieve our goals implementing a model using the training data and testing it with the testing data. We can make adjustments based on the characteristics we are using or the model itself.

Modifying the model we can end up with a too simple model or a too complex model. Here is when we need to consider the overfitting and underfitting concepts.

Underfitting

As we can see on the image, the underfitting concept refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

It happens when we do not have enough data to build a precise model or when we try to build a linear model with non-linear data.

There are a few techniques we can try to prevent underfitting:

  • Sometimes the model is underfitting because the feature items are insufficient. In this case, we can add other feature items to unfold it well.
  • Add polynomial features, which are usually utilized as a part of the machine learning algorithm. For example, the linear model is more generalized by adding quadratic or cubic terms.
  • Reduce the regularization parameters. The motivation behind regularization is to prevent overfitting, yet now the model has an underfitting, we have to diminish the regularization parameters.

Overfitting

On the opposite side, the overfitting concept refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

Overfitting is more probable in non-parametric and non-linear models.

There are a few techniques we can try to prevent overfitting:

  • Cross-validation: It uses our initial training data to generate multiple mini train-test splits, and it uses these splits to tune our model. Cross-validation allows us to tune hyperparameters with only our original training set. This allows us to keep our test set as a truly unseen dataset for selecting our final model.
  • Train with more data: Training with more data can help algorithms detect the signal better but, if we just add more noisy data, this technique won’t help. That’s why we should always ensure our data is clean and relevant.
  • Remove features: We can manually improve algorithms generalizability by removing irrelevant input features. The criteria to remove them, if anything does not make sense, or if it is hard to justify, this is a good candidate to be removed.
  • Early stopping: When training an algorithm, we can measure how well each iteration of the model performs. Up until a certain number of iterations, new iterations improve the model. After that point, the model’s ability to generalize can weaken as it begins to overfit the training data. Early stopping refers to stopping the training process before the learner passes that point.
  • Regularization: Regularization refers to a broad range of techniques for artificially forcing our model to be simpler.
  • Ensembling: Ensembles are machine learning methods for combining predictions from multiple separate models.

The good one

Finally, looking at the middle graph it shows a pretty good predicted line. It covers the majority of the points in graph and also maintains the balance between bias and variance.

That is all for today.

ML – Python (VII) – Overfitting & Underfitting

ML – Python (VI) – Bias & Variance

Now, it is time to start digging in the theory of Machine Learning.

In the machine learning world, precision is everything. When we try to develop a model, we try to make it as much accurate as possible playing with the different parameters. But, the hard truth is that we can not build a one-hundred per cent accurate model due to we can not build a free of errors model. What we can do, it is trying to understand the possible sources of errors and this will help us to obtain a more precise model.

Types of errors

When we are talking about errors, we can find reducible and irreducible errors.

Irreducible errors are errors that cannot be reduced no matter what algorithm you apply. They are usually known as noise and, the can appear in our models due to multiple factors like an unknown variable, incomplete characteristics or a wrongly defined problem. It is important to mention that, no matter how good is our model, our data will always have some noise component or irreducible errors we can never remove.

Reducible errors have two components – bias and variance. This kind of errors derivate from the algorithm selection and the presence of bias or variance causes overfitting or underfitting of data.

Bias

Bias error is the difference between the expected prediction of our model and the real values or, saying it in a different way, how far are the predicted values from the actual values. High bias, predicted values are far off from the actual values, causes the algorithm to miss the relevant relationship between the input and output variable. When a model has a high bias then it implies that the model is too simple and does not capture the complexity of data thus underfitting the data. For example, if we try to adjust a linear regression to a set of data that has a non-linear pattern.

High bias implies that the model is too simple and does not capture the complexity of data thus underfitting the data. As examples, we have linear regression algorithms, logistic regression or linear discriminant analysis.

Low bias implies the opposite and it offers more flexibility. As examples, we have decision trees, k-nearest neighbour (KNN) and vector support machines.

Variance

It refers to the differences in the estimation of the function using different training data or, saying it in a different way, it tells us how scattered is the predicted value from the actual value. Variance occurs when the model performs well on the trained dataset but does not do well on a dataset that it is not trained on. Ideally, the result should not change too much from one set of data to another.

High variance causes overfitting that implies that the algorithm models random noise present in the training data, or that the algorithm is strongly dependent on the input data. It suggests big changes in the estimation of the function when the data changes. As an example, we have decision trees, k-nearest neighbour (KNN) and vector support machines.

Low variance suggests small changes in the estimation of the function when the data changes. As examples, we have linear regression, analysis of discrete linear systems and logic regression.

Bias–variance tradeoff

The objective of any machine learning algorithm is to achieve low bias and low variance, achieving at the same time a good performance predicting results. The bias-variance dilemma or bias-variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set. The bias opposite to the variance refers to the precision opposite to consistency of the trained models. Considering the combinations we can have:

  • High Bias Low Variance: Models are consistent but inaccurate on average. Tend to be less complex with a simple or rigid structure like linear regression or Bayesian linear regression.
  • Low Bias High variance: Models are somewhat accurate but inconsistent on averages. Tend to be more complex with a flexible structure like decision trees or k-nearest neighbour (KNN).
  • High Bias High Variance – Models are inaccurate and also inconsistent on average.
  • Low Bias Low Variance: This is the unicorn.

To build a good model we need to find a good balance between bias and variance that help us to minimise the total error. This is why to understand the bias and variance are fundamental to understand the model’s behaviour.

Detecting high bias or high variance

High Bias can be identified when we have:

  • High training error.
  • Validation error or test error is the same as training error.

High Variance can be identified when:

  • Low training error.
  • High validation error or high test error.

Fixing it

High bias is due to a simple model and we also see a high training error. To fix that we can do the following things:

  • Add more input features.
  • Add more complexity by introducing polynomial features.
  • Decrease Regularization term.

High variance is due to a model that tries to fit most of the training dataset points and hence gets more complex. To resolve the high variance issue we need to work on:

  • Getting more training data.
  • Reduce input features.
  • Increase Regularization term.

That is all for today. I hope the first theory article was not to hard to read. I will try to make them not too long and as concise as possible.

ML – Python (VI) – Bias & Variance

ML – Python (V) – scikit-learn

Finally, the last library we are going to see for now that will help us with our machine learning programs is going to be scikit-learn. This is probably one of the most useful libraries for Machine Learning in Python. It is an open-source library and it brings us a range of supervised and unsupervised learning algorithms.

This library includes the next libraries or packages:

  • NumPy: N-dimensional matrix library.
  • pandas: Data structures and analysis.
  • SciPy: Essential library for computer science.
  • Matplotlib: 2D data representation.
  • IP[y]: Improved interactive console.
  • SymPy: Symbolic mathematics.

Considering the extension and the fact that includes some of the libraries we have already explored, this article is going to be very short and without code examples. Basically, we are going to see a shortlist of basic functions or benefits the library offers us like:

  • Supervised learning algorithms: It brings us a variety of supervised algorithms.
  • Cross-validation: The library brings us instructions to implement some of the model’s precision verification methods.
  • Unsupervised learning algorithms: It brings us a variety of unsupervised algorithms.
  • Data sets: A miscellaneous collection of data sets.
  • Characteristic extraction and selection: It is very useful to extract characteristics from images and texts. In addition, it can help us to identify significant attributes.
  • Community: It has some community behind improving the library.

That’s all. Quick and simple. Let’s save some energy for the next article where we are going to start digging a little bit on Machine Learning theory.

ML – Python (V) – scikit-learn

ML – Python (IV) – Matplotlib

Continuing with the useful libraries we can find in the Python ecosystem, we have Matplotlib. It is a library that will help us present our data. It is a 2D graphics library.

With Matplotlib, we can use both Python or NumPy data structures but, it seems recommendable to use the NumPy data structures.

In the same way that the previous library we saw, Matplotlib does not come with the default installation and we need to install it in our system.

Installing Matplotlib

The installation is as simple as executing a command:

pip install -U matplotlib

After that, we will be able to draw some nice plots. As an example we can draw a basic one:

import matplotlib.pyplot as plt

a = [1, 2, 3, 4]
b = [11, 22, 33, 44]

plt.plot(a, b, color='blue', linewidth=3, label='line')
plt.legend()
plt.show()

You can find the code example here.

The result should be something like:

Matplotlib basic example

Details about the result view

The resulting view, see picture above, can contain different elements:

  • The main object is the window or main page, it is the top-level object for the rest of the elements.
  • You can create multiple independent objects.
  • Objects can have subtitles, legends and colour bars among others.
  • We can generate areas within the objects. They are where the data is represented with methods like ‘plot()‘ or ‘scatter()‘ and they can have associated labels.
  • Every area has an X-axis and a Y-axis representing numerical values. They have a scale, title and labels among others.

Matplotlib package structure

  • Matplotlib: The whole Python data visualization package.
  • Pyplot: It is a module of the Matplotlib package. Provides an interface to create objects and axis.
  • Pylab: It is a module of the Matplolib package. It is used to work with matrices. Its use is not recommended any more with the new IDEs and kernels.

Most common plot types

The most common plot types we can find are:

You can see more examples of available plots here.

With this, we finish a short overview of Matplolib and the main plots it can offer to us. Very interesting to draw them easily.

ML – Python (IV) – Matplotlib

ML – Python (III) – pandas

Another library in the Python ecosystem is pandas (PANel DAta). This library can help us to execute five common steps in data analysis:

  • Load data.
  • Data preparation.
  • Data manipulation.
  • Data modelling.
  • Data analysis.

The main panda structure is DataFrame. Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled axes. It is composed of three elements: the data, the index and the columns. In addition, the names of columns and indexes can be specified.

Main library characteristics

  • The DataFrame object is fast and efficient.
  • Tools to load data in memory from different formats.
  • Data alignment and missing data management.
  • Remodelling and turning date sets.
  • Labelling, cut and indexation of big amounts of data.
  • Columns can be removed or inserted.
  • Data grouping for aggregation and transformation.
  • High performance for data union and merge.
  • Time-based series functionality.
  • It has three main structures:
    • Series: 1D structures.
    • DataFrame: 2D structures.
    • Panel: 3D structures.

Installing pandas

pandas library is not present in the default Python installation and it needs to be installed:

pip install -U pandas

pandas useful methods

Creating a Series

import pandas as pd

series = pd.Series({"UK": "London",
                    "Germany": "Berlin",
                    "France": "Paris",
                    "Spain": "Madrid"})

Creating a DataFrame

data = np.array([['', 'Col1', 'Col2'], ['Fila1', 11, 22], ['Fila2', 33, 44]])

You can find the code example here.

Without the boilerplate code:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))

Exploring a DataFrame

  • df.shape – DataFrame shape.
  • len(df.index) – DataFrame high.
  • df.describe() – DataFrame numeric statistics (count, mean, std, min, 25%, 50%, 75%, max).
  • df.mean() – Return the mean of the values for the requested axis.
  • df.corr() – Correlation of columns.
  • df.count() – Count of non-null values per column.
  • df.max() – Maximum value per column.
  • fd.min() – Minimum per column.
  • df.median() – Median value per column.
  • df.std() – Standard deviation per column.
  • df[0] – Select a DataFrame column (returned as a new DataFrame).
  • df[1, 2] – Select two DataFrame columns (returned as a new DataFrame).
  • df.iloc[0][2] – Select a value.
  • df.loc([0] – Select a column using the index.
  • df.iloc([0, :] – Select a column using the index.
  • pd.read_<file_type>() – Read from a file (pd.read_csv(‘train.csv’).
  • df.to_<file_type>() – Write to a file (pd.to_csv(‘new_train.csv’)).
  • df.isnull() – Verify is there are null values in the data set.
  • df.isnull().sum() – Return the sum of null values per column in the data set.
  • df.dropna() or df.dropna(axis = 1) – Remove rows or columns with missing data.
  • df.fillna(x) – Replace missing values with x (df.fillna(df.mean())).

And, this is all. This has been just a quick, very quick, review of the pandas library. I just recommend you to play around a bit more but, we will use it more in the future.

ML – Python (III) – pandas

ML – Python (II) – NumPy

As I have said before, one of the best advantages of Python is the huge community and amount of resources that supports it. One of these libraries is NumPy (NUMerical PYthon).

It is one of the main libraries to support scientific work with Python. It brings powerful data structures and implements matrices and multidimensional matrices.

As a short example we can see how to create a 1-dimension structure and a 2-dimensions structure:

import numpy as np

a = np.array([1, 2, 3])
...
b = np.array([(1, 2, 3), (4, 5, 6)])
...

You can find the code example here.

But, why should we use NumPy structures instead of Python structures?

There are a couple of main reasons:

  • NumPy arrays consumes less memory than Python lists.
  • NumPy arrays are faster in execution terms.

Because you do not need to trust me, let’s play a little bit with the code and run some informal benchmarks.

Let’s start with the memory assumption:

import sys
import numpy as np

s = range(1000)
print(sys.getsizeof(5) * len(s))
...
d = np.arange(1000)
print(d.size * d.itemsize)

You can find the code example here.

This gives us the next result:

Python list: 
28000
NumPy array: 
8000

As we can see, there is a big difference on the memory consumption.

Now, let’s do the same for the execution time. Again, we are going to write a small code snippet and execute an informal benchmark:

import time
import numpy as np

SIZE = 1_000_000

L1 = range(SIZE)
L2 = range(SIZE)
A1 = np.arange(SIZE)
A2 = np.arange(SIZE)

start = time.time()
result = [(x, y) for x, y in zip(L1, L2)]
print((time.time() - start) * 1000)
...
start = time.time()
result = A1 + A2
print((time.time() - start) * 1000)

You can find the code example here.

This gives us the next result:

Python list: 
316.49184226989746
NumPy array: 
65.60492515563965

Again, as we can see, the execution time for the NumPy structures is much better.

In addition to the speed and memory improvements, it is worth to point to the different syntax between Python and NumPy when writing the addition operation:

  • Python: [(x, y) for x, y in zip(L1, L2)]
  • NumPy: A1 + A2

As we can see, the difference is quite big. The second case, even if you know nothing about Python or NumPy, is very easy to understand.

Quick review of the NumPy API

  • Creating matrices
    • import numpy as np – Import the NumPy dependency.
    • np.array() – Creates a matrix.
    • np.ones((3, 4)) – Creates a matrix with a one in every position.
    • np.zeros((3, 4)) – Creates a matrix with a zero in every position.
    • np.random.random((3, 4)) – Creates a matrix with random values in every position.
    • np.empty((3, 4)) – Creates an empty matrix.
    • np.full((3, 4), 8) – Creates a matrix with a specified value in every position.
    • np.arange(0, 30, 5) – Creates a matrix with a distribution of values (from 0 to 30 every 5).
    • np.linspace(0, 2, 5) – Creates a matrix with a distribution of values (5 elements from 0 to 2).
    • np.eye(4, 4) – Creates an identity matrix.
    • np.identity(4) – Creates an identity matrix.
  • Inspect matrices
    • a.ndim – Matrix dimension.
    • a.dtype – Matrix data type.
    • a.size – Matrix size.
    • a.shape – Matrix shape.
    • a.reshape(3, 2) – Change the shape of a matrix.
    • a[3, 2] – Select a single element of the matrix.
    • a[0:, 2] – Extract the value in the column 2 from every row.
    • a.min(), a.max() and a.sum() – Basic operations over the matrix.
    • np.sqrt(a) – Square root of the matrix.
    • np.std(a) – Standard deviation of the matrix.
    • a + b, a – b, a * b and a / b – Basic operations between matrices.

And, this is all. This has been just a quick, very quick, review of the NumPy library. I just recommend you to play around a bit more but, we will use it more in the future.

ML – Python (II) – NumPy

ML – Python (I) – Introduction

We have been here, in the blog, talking about Machine Learning sometimes. The purpose of this series of articles is to go a little bit further and to explore a bit more the Machine Learning space and its relation with Python.

All the information in a more technical shape and the small scripts can be found at my GitHub account under the project python-ml.

One of the questions that it is worth to discuss is, why Python?

Available languages for Machine Learning

It is clear that you can use a lot of different languages to implement Machine Learning algorithms and programs but, looking at the space and popularity you can easily see a tendency and preference for four of them.

  • Python
    • It is the leader of the race right now due to the simplicity and its soft learning curve.
    • It is especially good and successful for beginners, in both, programming and Machine Learning.
    • The libraries ecosystem and community support are huge.
  • R
    • It is designed for statistical analysis and visualization, it is used frequently to unlock patterns in big data blocks.
    • With RStudio, developers can easily build algorithms and statistical visualization.
    • It is a free alternative to more expensive software like Matlab.
  • Matlab
    • It is fast, stable and secure for complex mathematics.
    • It is considered as a hardcore language for mathematicians and scientists.
  • Julia
    • Designed to deal with numerical analysis needs and computational science.
    • The base Julia library was integrated with C and Fortram open source libraries.
    • The collaboration between the Jupyter and Julia communities, it gives Julia a powerful UI.

Some important metrics to consider when choosing a language should be:

  • Speed.
  • Learning curve.
  • Cost.
  • Community support.
  • Productivity.

Here we can classify our languages as follows:

  • Speed: R is basically a statistical language and it is difficult to beat in this context.
  • Learning curve: Here depends on the person’s knowledge. R is closer to the functional languages as opposite to python that is closer to object-oriented languages.
  • Cost: Only Matlab is not a free language. The other languages are open source.
  • Community: All of them are very popular but, Python has a bigger community and amount of resources available.
  • Productivity: R for statistical analysis, Matlab for computational vision, bio-informatics or biology is the playground of Julia and, Python is the king for general tasks and multiple usages.

The decision, at the end of the day, is about a balance between all the characteristics seen above, our skills and the field we are or the tasks we want to implement.

In my case, I am going to choose Python as probably all of you have assumed because it is like a swiss knife and, at this point, the beginning, I think this is important. There is always time later to focus on other things or reduce the scope.

IDEs

There are multiple IDEs that support Python. As a very extended language, there are multiple tools and environments we can use. Here just take the one you like the more.

If you do not know any IDE or platform, there are two of them that a lot of Data Scientist use:

I do not know them. As a developer, I am more familiar with Visual Studio Code or IntelliJ, and I will be using one of them probably unless I discover some exciting functionality or advantage in one of the other.

ML – Python (I) – Introduction

The Machine Learning process

To build our machine learning system, despite the field where we want to apply it, we need to follow a similar process or steps. Every step is important and the quality we achieve in every one of them will affect the quality of the whole system at the end.

Depending on the literature we are checking, these steps received one name or another. This list I present you here is just one of the ways of describing the process but, I hope, with the short descriptions, we will be able to match them with any other version out there.

Understanding the problem

The first thing we need to define is what we want to achieve when implementing our machine learning system. We should define what is the problem we want to solve and what is the objective at the end.

It is not just the definition of these two points, we should apply a context too. What resources do we have, what costs and benefits the project is going to have and the kind of criteria we should evaluate every time we start a new project.

Understanding the data

By now, if we are stating a machine learning project, we should know that the data is one of the most important things we need if it is not the most important thing. This step includes two points around data:

  • Gathering data: We need to identify our data sources or, if they do not exist, how we are going to generate the data. We need to define how we are going to collect and store the data, this usually involved writing some kind of code. And we need to define how we are going to integrate the data especially if we are gathering data from multiple sources.
  • Exploring data: We need to do a preliminary exam of the data, decided what data we are going to use, if there is something it is already calling our attention and if the data is going to allow us to keep progressing. For example, if we are building a classification system but we are missing data from one or more of the classes or the data is not enough for one of them, we should realise here and try to solve it.

Pre-processing the data

After collecting the data we need to normalise the data to be allowed to process it trying to achieve optimum results. Removing null values or finding a common scale for numeric values are two common tasks applied here. Another important task we should be performed here is to anonymize the data to comply with any data protection legislation.

Extracting the characteristics

This is one of the most important steps of the process. We should bring here some experts (if we do not have them on our team) to help us to define the characteristics that we are going to use in our model and that they are going to help us to solve the problem. We need to identify these characteristics in our data. For example, for a property valuation model things like the number of bedrooms, the location, the size, how old the property is. All these characteristics will help us to solve our problem.

Selecting the characteristics

Once we have extracted a list of characteristics, we need to find a balance between them and the cost. Computational resources are expensive, the more characteristics we try to process the more expensive is going to be to process them and the less performance we are going to achieve. The challenge here will be to select the fewer characteristics as possible without affecting the final result or affecting it as less as possible. Ideal candidates to be removed are irrelevant, redundant or correlated characteristics.

There are three main types of algorithms to select characteristics:

  • Wrappers: They are linked to the algorithm we are going to use, they evaluate the efficiency of including new characteristics. The downside of them, it is that they consume a big amount of resources.
  • Filters: They are independent of the algorithm we are going to use, they use mathematical and statistical techniques to help us to select characteristics. They consume fewer resources than the previous case.
  • Hybrids: They are built during the training step, this is why they are a mix of both previous categories.

Training the algorithm

In this step is when the algorithm starts learning and the model is built. During training, the algorithm learns the model. We need to decide what approach we are going to take (classification, regression, …) and, depending on this, we will choose the modelling technique we are going to use or, maybe, we will try some of them. And, finally, build our model. We are going to do any necessary tweaks too.

Evaluating the algorithm

Once the model is finished and our algorithm is ready we need to evaluate how accurate is. This is usually done with some data we have reserved and never use to train the algorithm. With this data, we check if the predictions done by the algorithm are good enough.

Analyzing the results

Now that we have results, we need to check them against the success criteria defined initially. Here we should include not just the accuracy, we should check if the solution is inside our restrictions, performance, costs, …

If everything aligns, we will proceed to the next step, if it does not, we need to review the process and the taken decisions based on the results we have.

Deploying the system

This is the wider step. Notify results, generate reposts, generate documentation, make the system available to required users, everything we else we can think around these areas needs to be done and any other procedural or administrative task to be able to add the seal of done to our system.

As I have said before, you will find these steps with different names probably in other books, articles and publications but I hope that with the little explanations we are going to match them and reference them.

See you.

The Machine Learning process

AI (I): Machine learning

Machine learning provides de foundation for artificial intelligence. So, what is it?

Machine learning is a technique in which we train a software model using data. The model learns from the training cases and then, we can use the trained model to make predictions for new data cases. To have a computer make intelligent predictions from the data, we just need a way to train it to perform the correct calculations.

We usually start with a data set that contains historical records, often called cases or observations. Each observation includes numeric features that quantify a characteristic of the item we are working with. We can call it ‘X’. In addition, we also have some value that we are trying to predict, we can call it ‘Y’. The purpose is to use our training cases to train a machine learning model so it can calculate a value for ‘Y’ from the features in ‘X’. As a simplification, we are creating a function that operates on a set of features ‘X’, to produce predictions ‘Y’.

Generally speaking, there are two broad kinds of machine learning, supervised and unsupervised.

In supervised learning scenarios, we start with observations called labels, that include known values for the variable we want to predict. The first thing we need to do, it is to split our data because we already know the label we are trying to predict. In this way, we can train the model using half of the data and keep the rest to test the performance of our model. When we obtain the desired results and we are confident our model works, we can use it with new observations for which the label is unknown, and generate new predicted values.

Unsupervised learning is different from supervised learning, in that this time we do not have known label values in the training data set. We train the model by finding similarities between the observations. After the model is trained, each new observation is assigned to the cluster of observations with the most similar characteristics.

AI (I): Machine learning

Machine learning branches

In machine learning we can find three main different branches where we can classify the algorithms:

  • Supervised learning.
  • Unsupervised learning.
  • Reinforcement learning.

Supervised learning

In supervised algorithms you know the input and the output that you need from your model. You do not know how the output is achieved from the input data or how are the inner relations among you data, but definitely know the output data.

As an example, we can take a magazine publication that it has the subscription data of a determinate number of customers or old customers, let’s say 100.000 customers. The company in charge of the magazine knows that half of these customers (50.000) have cancelled their subscriptions and the other half (50.000) are still subscribed, and they want a model to predict what customers will cancel their subscriptions.

We know the input: customers subscription data, and the output: cancelled or not.

We can then build our training data set with 90.000 customers data. Half of them cancelled and half of them still active. We will train our system with this training set. And after that we will try to predict the result for the other 10.000 we left outside the training data to check the accuracy of our model.

Unsupervised learning

In unsupervised learning algorithms you do not know what is the output of your model, you maybe know there is some kind of relation or correlation in your data but, maybe, the data is too complex to guess.

In this kind of algorithms, you normalize your data in ways that it can be compared and you wait for the model to find some of these relationships. One of the special characteristics of these models is that, while the model can suggest different ways to categorize or order your data, it is up to you to make further research on these to unveil something useful.

For example, we can have a company selling a huge number of products and they want to improve their system to target customers with useful advertisement campaigns. We can give to our algorithm the customers data and the algorithms can suggest some relations: age range, location, …

Reinforcement learning

In reinforcement learning algorithms, they do not receive immediately the reward for their actions, and they need to accumulate some consecutive decision to know if the actions/decisions are or not correct. In this scenario, there is no supervisor, the feedback about the decision is delayed and agent’s actions affect the subsequent data it receives.

One example of this, it can be the chess game, where the algorithm is going to be taking decisions but, till the end of the game, it is not going to be able to know if these decisions were correct or not and, obviously, previous decisions affect subsequent decisions.

Machine learning branches