The Machine Learning process

To build our machine learning system, despite the field where we want to apply it, we need to follow a similar process or steps. Every step is important and the quality we achieve in every one of them will affect the quality of the whole system at the end.

Depending on the literature we are checking, these steps received one name or another. This list I present you here is just one of the ways of describing the process but, I hope, with the short descriptions, we will be able to match them with any other version out there.

Understanding the problem

The first thing we need to define is what we want to achieve when implementing our machine learning system. We should define what is the problem we want to solve and what is the objective at the end.

It is not just the definition of these two points, we should apply a context too. What resources do we have, what costs and benefits the project is going to have and the kind of criteria we should evaluate every time we start a new project.

Understanding the data

By now, if we are stating a machine learning project, we should know that the data is one of the most important things we need if it is not the most important thing. This step includes two points around data:

  • Gathering data: We need to identify our data sources or, if they do not exist, how we are going to generate the data. We need to define how we are going to collect and store the data, this usually involved writing some kind of code. And we need to define how we are going to integrate the data especially if we are gathering data from multiple sources.
  • Exploring data: We need to do a preliminary exam of the data, decided what data we are going to use, if there is something it is already calling our attention and if the data is going to allow us to keep progressing. For example, if we are building a classification system but we are missing data from one or more of the classes or the data is not enough for one of them, we should realise here and try to solve it.

Pre-processing the data

After collecting the data we need to normalise the data to be allowed to process it trying to achieve optimum results. Removing null values or finding a common scale for numeric values are two common tasks applied here. Another important task we should be performed here is to anonymize the data to comply with any data protection legislation.

Extracting the characteristics

This is one of the most important steps of the process. We should bring here some experts (if we do not have them on our team) to help us to define the characteristics that we are going to use in our model and that they are going to help us to solve the problem. We need to identify these characteristics in our data. For example, for a property valuation model things like the number of bedrooms, the location, the size, how old the property is. All these characteristics will help us to solve our problem.

Selecting the characteristics

Once we have extracted a list of characteristics, we need to find a balance between them and the cost. Computational resources are expensive, the more characteristics we try to process the more expensive is going to be to process them and the less performance we are going to achieve. The challenge here will be to select the fewer characteristics as possible without affecting the final result or affecting it as less as possible. Ideal candidates to be removed are irrelevant, redundant or correlated characteristics.

There are three main types of algorithms to select characteristics:

  • Wrappers: They are linked to the algorithm we are going to use, they evaluate the efficiency of including new characteristics. The downside of them, it is that they consume a big amount of resources.
  • Filters: They are independent of the algorithm we are going to use, they use mathematical and statistical techniques to help us to select characteristics. They consume fewer resources than the previous case.
  • Hybrids: They are built during the training step, this is why they are a mix of both previous categories.

Training the algorithm

In this step is when the algorithm starts learning and the model is built. During training, the algorithm learns the model. We need to decide what approach we are going to take (classification, regression, …) and, depending on this, we will choose the modelling technique we are going to use or, maybe, we will try some of them. And, finally, build our model. We are going to do any necessary tweaks too.

Evaluating the algorithm

Once the model is finished and our algorithm is ready we need to evaluate how accurate is. This is usually done with some data we have reserved and never use to train the algorithm. With this data, we check if the predictions done by the algorithm are good enough.

Analyzing the results

Now that we have results, we need to check them against the success criteria defined initially. Here we should include not just the accuracy, we should check if the solution is inside our restrictions, performance, costs, …

If everything aligns, we will proceed to the next step, if it does not, we need to review the process and the taken decisions based on the results we have.

Deploying the system

This is the wider step. Notify results, generate reposts, generate documentation, make the system available to required users, everything we else we can think around these areas needs to be done and any other procedural or administrative task to be able to add the seal of done to our system.

As I have said before, you will find these steps with different names probably in other books, articles and publications but I hope that with the little explanations we are going to match them and reference them.

See you.

The Machine Learning process

Starting a project

There are multiple ways to learn how to code. Some people do it with some kind of formal education like high school, university, master… Other people through Bootcamp or more modern initiatives we are seen lately. And, finally, there are people that it learns self-studying. No matter which one is your case, at the end of the day, the best way to learn and acquire some coding skills is to code.

As developers, we code (we do other things, not just code). Usually, if we do it professionally, enterprises have their tools and procedures. This is not the scope of this article. This article is going to focus on small projects we start outside these corporative environments, just for fun, for learning purposes or, because why not? And, I am talking about projects, not just code snippets or small demos trying something we have read in an article or blog, or testing this crazy idea we had in mind the last few days.

The purpose of the article is to offer some guidance on possible free tools we can use to work on a project following more or less a methodology and using some tools similar if not the same than the ones you can find on a corporative environment.

The focus of the article is people learning how to code to allow them to have a bigger picture, or people starting a long term open-source project, or just anyone curious. It going to focus not on the coding part but on the areas around the project.

Every project when it starts it needs a way to manage the code and a way to manage the efforts. I am certain about the first one, all of you agree but, I can hear from here people questioning the second one. Well, initially, and especially if we are the only developer, we can think it is not necessary but in the long run, even more, if we expect contributions in the future, it is going to be a very useful thing to have. It will keep our focus, it will make us think in advance, to do some planning and, it will give us a history of the project and why we took a certain decision at a certain point or why we added a concrete functionality. And, if you are learning how to code, it will give you the bigger picture I was talking about before.

To manage our code we need some kind of distributed version control system for tracking changes. There are a few of them out there like Git, Subversion, Perforce, Team Foundation Version Control or Mercurial. If you stay long enough in the industry, you will see all of them but, in this case, my favourite preference is Git. There are some cloud platforms that offer you an account to use it (GitHub, GitLab, Bitbucket). All of them are similar and at this level, there is no big difference, I invite you to test all of them but, in this case, I am going to recommend GitHub. I like it, I am used to it, it is hugely extended among the open-source community, and, integrates easily and smoothly with other tools we are going to see in this article.

To manage our efforts we need some kind of project management software tool for tracking tasks and the progress of them. As in the previous case, here event more, there are a lot of them out there. One very simple to use and very extended is Trello. Trello offers you some customizable boards we can use to track efforts, progress and plan in advance. In addition, there are a lot of useful plugins to improve and highly customize the boards and the cards (tasks) there. Here just a quick mention to the ‘Projects’ tab in GitHub that it allows you to create some automated Kanban boards. It is interesting to play with it. But, I have never seen it in a corporative environment where I have seen Trello multiple times. The first place here is for JIRA.

Once we start coding, creating pull requests and merging code in our repository it is nice to have in place a CI/CD environment. There are multiple advantages of this but, even if we are just learning, it will keep your code healthy making sure that any change made still compiles and pass all our tests. Again, in this category, we can find some cloud platforms and on-premises solution but, for the article, I have chosen Travis CI (the dot org). It is simple to register, great integration with GitHub, powerful enough and well documented.

One thing that developers should be worried about it is the quality and maintainability of the code they write. And, I am not talking just if our code passes all the test, I am talking about bugs, vulnerabilities, test coverage, code duplication, format (we should be using our IDE auto-format or save actions for the last one). To cover all this list we can find the tool SonarQube, and a cloud solution SonarCloud. This tools will report us with all the found problems every time a build is done, allowing us to correct them as soon as possible and not let them pailing and just be found when there is a code auditory or similar. Again, it is an easy tool to manage and to integrate with GitHub and Travis CI.

Are these tools the best ones? The more useful ones? Yes, no, maybe. I am a strong believer that there are not perfect tools, there are tools perfect for a job and, this is what sometimes we as developers need to decide, which tool fits best the job. The tools in the article are just examples and, they were perfect for the article.

Starting a project

Docker Java Client

No questions about containers been one of the latest big things. They are everywhere, everyone use them or want to use them and the truth is they are very useful and they have change the way we develop.

We have docker for example that provides us with docker and docker-compose command line tools to be able to run one or multiple containers with just a few lines or a configuration file. But, we, developers, tend to be lazy and we like to automize things. For example, what about when we are running our application in our favorite IDE the application starts the necessary containers for us? It will be nice.

The point of this article was to play a little bit the a docker java client library, the case exposed above is just an example, probably you readers can think about better cases but, for now, this is enough for my learning purposes.

Searching I have found two different docker java client libraries:

I have not analyze them or compare them, I have just found the one by GitHub before and it looks mature and usable enough. For this reason, it is the one I am going to use for the article.

Let’s start.

First, we need to add the dependency to our pom.xml file:

<dependency>
    <groupId>com.github.docker-java</groupId>
    <artifactId>docker-java</artifactId>
    <version>3.1.1</version>
</dependency>

The main class we are going to use to execute the different instructions is the DockerClient class. This is the class that establishes the communication between our application and the docker engine/daemon in our machine. The library offers us a very intuitive builder to generate the object:

DockerClient dockerClient = DockerClientBuilder.getInstance().build();

There are some options that we can configure but for a simple example is not necessary. I am just going to say there is a class called DefaultDockerClientConfig where a lot of different properties can be set. After that, we just need to call the getInstance method with our configuration object.

Image management

Listing images

List<Image> images = dockerClient.listImagesCmd().exec();

Pulling images

dockerClient.pullImageCmd("postgres")
                .withTag("11.2")
                .exec(new PullImageResultCallback())
                .awaitCompletion(30, TimeUnit.SECONDS);

Container management

Listing containers

// List running containers
dockerClient.listContainersCmd().exec();

// List existing containers
dockerClient.listContainersCmd().withShowAll(true).exec();

Creating containers

CreateContainerResponse container = dockerClient.createContainerCmd("postgres:11.2")
                .withName("postgres-11.2-db")
                .withExposedPorts(new ExposedPort(5432, InternetProtocol.TCP))
                .exec();

Starting containers

dockerClient.startContainerCmd(container.getId()).exec();

Other

There are multiple operations we can do with containers, the above is just a short list of examples but, we can extended with:

  • Image management
    • Listing images: listImagesCmd()
    • Building images: buildImageCmd()
    • Inspecting images: inspectImageCmd("id")
    • Tagging images: tagImageCmd("id", "repository", "tag")
    • Pushing images: pushImageCmd("repository")
    • Pulling images: pullImageCmd("repository")
    • Removing images: removeImageCmd("id")
    • Search in registry: searchImagesCmd("text")
  • Container management
    • Listing containers: listContainersCmd()
    • Creating containers: createContainerCmd("repository:tag")
    • Starting containers: startContainerCmd("id")
    • Stopping containers: stopContainerCmd("id")
    • Killing containers: killContainerCmd("id")
    • Inspecting containers: inspectContainerCmd("id")
    • Creating a snapshot: commitCmd("id")
  • Volume management
    • Listing volumes: listVolumesCmd()
    • Inspecting volumes: inspectVolumeCmd("id")
    • Creating volumes: createVolumeCmd()
    • Removing volumes: removeVolumeCmd("id")
  • Network management
    • Listing networks: listNetworksCmd()
    • Creating networks: createNetworkCmd()
    • Inspecting networks: inspectNetworkCmd().withNetworkId("id")
    • Removing networks: removeNetworkCmd("id")

And, that is all. There is a project suing some of the operations listed here. It is call country, it is one of my learning projects, and you can find it here.

Concretely, you can find the code using the docker java client library here, and the code using this library here, specifically in the class PopulationDevelopmentConfig.

I hope you find it useful.

Docker Java Client

VirtualBox: Increase space

No one can discuss that virtual machines are a very important tool. Maybe, nowadays, with all the container solutions they are a bit less important but they are still very useful.

When we are using a virtual machine, one of the possible problems we can find in some point is that our hard drive can reach its maximum capacity. Luckily, this is not the end of the world and we can expand our HDs.

Important note: We are going to loss the snapshots we have. (But, it is a small price to pay to avoid to start a new machine from scratch.

This quick manual is going to be based on VirtualBox, I guess that for other virtualization tools it must be similar using the appropriate tools.

First thing we need to do, it is to stop our virtual machine.

After that, we have some command line tools that they are going to do this process “simple”.

The first command we are going to execute is going to clone our HD in “vmdk” format to a new one with “vdi” format:

VBoxManage clonehd <virtual_machine_path>/<hd_name>.vmdk" <new_name>.vdi --format vdi

This process will take some time depending on the HD size.

Once the command has finish its execution, we are going to increase the size of the cloned HD. Les’t imagine the initial size of the HD was 80GB and we want to duplicate it:

VBoxManage modifyhd <new_name>.vdi --resize 163840

Again, once the operation is finished, we need to clone the new HD from a “vdi” format to “vmdk” format:

VBoxManage clonehd <new_name>.vdi <hd_name_new>.vmdk --format vmdk

After waiting for the operation to finish, we will have our new HD ready to plug in our virtual machine. This is going to be the next step. Go to the VirtualBox user interface and replace the old HD device with the new one.

Now, if we start our virtual machine we will still see the old size and we will not be able to find the new 80GB added. This is because we are missing one step. Turn off your virtual machine again if you have turned it on before reading these lines and follow the next step.

We need a tool to edit our HD partitions. In this case, I am going to use a live iso called GParted (wikipedia).

In a similar way we have replace the old HD with the need one, we are going to load in the CD-ROM unit the GParted live iso.

Now, we should run again our virtual machine but, instead of leaving to boot as usual, we will press F12 on startup to be able to choose the unit we want to use to boot the virtual machine. CD-ROM will be one of the offered option. After this and a few options selected during GParted start, GParted will boot. Now we just need to expand the current partition to cover the new added space.

And, that is all. We can shutdown the virtual machine, remove the live iso from our devices attached to the virtual machine and boot it again. Now, we will be able to see the 160GB HD.

VirtualBox: Increase space

The Sec in SecDevOps

More or less, everyone that works in the software industry should be aware by now of the term DevOps and the life cycle of any application. Something similar to:

  • Plan: Gathering requirements.
  • Code: Write the code.
  • Test: Executing the tests.
  • Package: Package our application for the release.
  • Release: When the application artifacts are generated.
  • Deploy: We make it available to the users (QA, final users, product, …).
  • Operate: Take care of the applications needs.
  • Monitor: Keep an eye on the application.

All these bullet points are a rough description of the main stages of the application no matter what methodologies we are using.

The thing is that, nowadays, we should not just be aware of the term DevOps, we should be aware and, using or introducing, the term SecDevOps in our environments.

The ‘Sec’ stands by ‘Security’ and, it means, basically to add the security component and the help of security professionals to all the stages in the cycle, not just after the deployment like usually happens.

In addition to the common tasks we have in the different stages, if we add the ‘Sec’ approach, we should be adding:

  • Plan: We should be analyzing the Threat Model. For example, user authentication, external servers, encryption, … Here, we will be generating security focus stories.
  • Code: In this stage, the security proffesinal should be training and helping the team with the security concepts: tools, terminology, attacks. This will help developers to be aware of possible problems and write some defensive measures integrated in their code or follow best practices and recommendations.
  • Test: In this stage, specific tests to check security capabilities will be created allowing us to early detect possible problems.
  • Package: Here, we will check vulnerabilities in external libraries, vulnerabilities in our containers or code static analyzers.
  • Release: This stage can be combined with the previous one from a security point of view. Some platforms where the containers are stored, for example, integrate vulnerability scanners.
  • Deploy: Here, we can run dynamic security tests with typical ethical hacking tools.
  • Operate: Here is when the Red Team steps up in addition to resilience checks.
  • Monitor: We can analyze all our logs (we hope previously centralized) trying to discover security problems.

I hope this gives us a simplify view of how we can integrate security awareness and processes in our application cycles.

The Sec in SecDevOps

Random enum

As a software engineers or software developers we need to test what we are implementing (you have a description of types of tests here). When implementing these tests, we can hardcode the data we are using or, we can randomly generate it what, despite we can think it is going to make our life more difficult when debugging errors, it is going to make it easier in the long run. Our code should not be linked to the data it is processing, it should be generic for the type of data it is expecting.

To do this, the easiest way is to implement or use libraries that implement random generators. One of the most interesting and one it is always forgotten is the random generator for our enums.

In Java, there is an easy way to implement it.

It can be just for one enum class:

private static CarBrand randomCarBrand() {
    return CarBrand.class.getEnumConstants()[new Random().nextInt(CarBrand.class.getEnumConstants().length)];
}

Or, even more interesting, it can be a generic random generator that it receives as a parameter an enum class and return the random value:

public static <T extends Enum<?>> T randomEnum(Class<T> clazz) {
    int x = random.nextInt(clazz.getEnumConstants().length);
    return clazz.getEnumConstants()[x];
}

We can see an example here:

import java.util.Random;

public class RandomEnum {

    public static void main(String[] args) {

        for (int i = 0; i < 10; i++) {
            System.out.println(randomCarBrand());
        }
    }

    private static CarBrand randomCarBrand() {
        return CarBrand.class.getEnumConstants()[new   Random().nextInt(CarBrand.class.getEnumConstants().length)];
    }

    enum CarBrand {
        ARUTI_SUZUKI,
        TATA,
        HONDA,
        HYUNDAI,
        FORD,
        MAHINDRA,
        SKODA,
        ARIEL,
        ASHOK_LEYLAND,
        ASTON_MARTIN,
        AUDI,
        BAJAJ,
        BENTLEY,
        BMW,
    }
}

Do not forget, from know on, if you are not doing it, try to make all your tests, except if you want to test an edge case, random and see how it goes.

Random enum

Checking certificate dates

Sometime, when we are working or doing some investigations in our spare time we need to check the dates a web certificate has, especially, the expiration date. Obviously, we can go to our browser, introduce the desired url and with a few clicks check the issued and expiration dates.

But, there is another way more simple, easier and, in case we need it, we can script.

echo | openssl s_client -servername www.google.co.uk -connect www.google.co.uk:443 2>/dev/null | openssl x509 -noout -dates

This simple command gives us the information we want.

I hope it is useful.

Checking certificate dates