To build our machine learning system, despite the field where we want to apply it, we need to follow a similar process or steps. Every step is important and the quality we achieve in every one of them will affect the quality of the whole system at the end.
Depending on the literature we are checking, these steps received one name or another. This list I present you here is just one of the ways of describing the process but, I hope, with the short descriptions, we will be able to match them with any other version out there.
Understanding the problem
The first thing we need to define is what we want to achieve when implementing our machine learning system. We should define what is the problem we want to solve and what is the objective at the end.
It is not just the definition of these two points, we should apply a context too. What resources do we have, what costs and benefits the project is going to have and the kind of criteria we should evaluate every time we start a new project.
Understanding the data
By now, if we are stating a machine learning project, we should know that the data is one of the most important things we need if it is not the most important thing. This step includes two points around data:
- Gathering data: We need to identify our data sources or, if they do not exist, how we are going to generate the data. We need to define how we are going to collect and store the data, this usually involved writing some kind of code. And we need to define how we are going to integrate the data especially if we are gathering data from multiple sources.
- Exploring data: We need to do a preliminary exam of the data, decided what data we are going to use, if there is something it is already calling our attention and if the data is going to allow us to keep progressing. For example, if we are building a classification system but we are missing data from one or more of the classes or the data is not enough for one of them, we should realise here and try to solve it.
Pre-processing the data
After collecting the data we need to normalise the data to be allowed to process it trying to achieve optimum results. Removing null values or finding a common scale for numeric values are two common tasks applied here. Another important task we should be performed here is to anonymize the data to comply with any data protection legislation.
Extracting the characteristics
This is one of the most important steps of the process. We should bring here some experts (if we do not have them on our team) to help us to define the characteristics that we are going to use in our model and that they are going to help us to solve the problem. We need to identify these characteristics in our data. For example, for a property valuation model things like the number of bedrooms, the location, the size, how old the property is. All these characteristics will help us to solve our problem.
Selecting the characteristics
Once we have extracted a list of characteristics, we need to find a balance between them and the cost. Computational resources are expensive, the more characteristics we try to process the more expensive is going to be to process them and the less performance we are going to achieve. The challenge here will be to select the fewer characteristics as possible without affecting the final result or affecting it as less as possible. Ideal candidates to be removed are irrelevant, redundant or correlated characteristics.
There are three main types of algorithms to select characteristics:
- Wrappers: They are linked to the algorithm we are going to use, they evaluate the efficiency of including new characteristics. The downside of them, it is that they consume a big amount of resources.
- Filters: They are independent of the algorithm we are going to use, they use mathematical and statistical techniques to help us to select characteristics. They consume fewer resources than the previous case.
- Hybrids: They are built during the training step, this is why they are a mix of both previous categories.
Training the algorithm
In this step is when the algorithm starts learning and the model is built. During training, the algorithm learns the model. We need to decide what approach we are going to take (classification, regression, …) and, depending on this, we will choose the modelling technique we are going to use or, maybe, we will try some of them. And, finally, build our model. We are going to do any necessary tweaks too.
Evaluating the algorithm
Once the model is finished and our algorithm is ready we need to evaluate how accurate is. This is usually done with some data we have reserved and never use to train the algorithm. With this data, we check if the predictions done by the algorithm are good enough.
Analyzing the results
Now that we have results, we need to check them against the success criteria defined initially. Here we should include not just the accuracy, we should check if the solution is inside our restrictions, performance, costs, …
If everything aligns, we will proceed to the next step, if it does not, we need to review the process and the taken decisions based on the results we have.
Deploying the system
This is the wider step. Notify results, generate reposts, generate documentation, make the system available to required users, everything we else we can think around these areas needs to be done and any other procedural or administrative task to be able to add the seal of done to our system.
As I have said before, you will find these steps with different names probably in other books, articles and publications but I hope that with the little explanations we are going to match them and reference them.
See you.