ML – Python (IV) – Matplotlib

Continuing with the useful libraries we can find in the Python ecosystem, we have Matplotlib. It is a library that will help us present our data. It is a 2D graphics library.

With Matplotlib, we can use both Python or NumPy data structures but, it seems recommendable to use the NumPy data structures.

In the same way that the previous library we saw, Matplotlib does not come with the default installation and we need to install it in our system.

Installing Matplotlib

The installation is as simple as executing a command:

pip install -U matplotlib

After that, we will be able to draw some nice plots. As an example we can draw a basic one:

import matplotlib.pyplot as plt

a = [1, 2, 3, 4]
b = [11, 22, 33, 44]

plt.plot(a, b, color='blue', linewidth=3, label='line')
plt.legend()
plt.show()

You can find the code example here.

The result should be something like:

Matplotlib basic example

Details about the result view

The resulting view, see picture above, can contain different elements:

  • The main object is the window or main page, it is the top-level object for the rest of the elements.
  • You can create multiple independent objects.
  • Objects can have subtitles, legends and colour bars among others.
  • We can generate areas within the objects. They are where the data is represented with methods like ‘plot()‘ or ‘scatter()‘ and they can have associated labels.
  • Every area has an X-axis and a Y-axis representing numerical values. They have a scale, title and labels among others.

Matplotlib package structure

  • Matplotlib: The whole Python data visualization package.
  • Pyplot: It is a module of the Matplotlib package. Provides an interface to create objects and axis.
  • Pylab: It is a module of the Matplolib package. It is used to work with matrices. Its use is not recommended any more with the new IDEs and kernels.

Most common plot types

The most common plot types we can find are:

You can see more examples of available plots here.

With this, we finish a short overview of Matplolib and the main plots it can offer to us. Very interesting to draw them easily.

ML – Python (IV) – Matplotlib

ML – Python (III) – pandas

Another library in the Python ecosystem is pandas (PANel DAta). This library can help us to execute five common steps in data analysis:

  • Load data.
  • Data preparation.
  • Data manipulation.
  • Data modelling.
  • Data analysis.

The main panda structure is DataFrame. Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled axes. It is composed of three elements: the data, the index and the columns. In addition, the names of columns and indexes can be specified.

Main library characteristics

  • The DataFrame object is fast and efficient.
  • Tools to load data in memory from different formats.
  • Data alignment and missing data management.
  • Remodelling and turning date sets.
  • Labelling, cut and indexation of big amounts of data.
  • Columns can be removed or inserted.
  • Data grouping for aggregation and transformation.
  • High performance for data union and merge.
  • Time-based series functionality.
  • It has three main structures:
    • Series: 1D structures.
    • DataFrame: 2D structures.
    • Panel: 3D structures.

Installing pandas

pandas library is not present in the default Python installation and it needs to be installed:

pip install -U pandas

pandas useful methods

Creating a Series

import pandas as pd

series = pd.Series({"UK": "London",
                    "Germany": "Berlin",
                    "France": "Paris",
                    "Spain": "Madrid"})

Creating a DataFrame

data = np.array([['', 'Col1', 'Col2'], ['Fila1', 11, 22], ['Fila2', 33, 44]])

You can find the code example here.

Without the boilerplate code:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))

Exploring a DataFrame

  • df.shape – DataFrame shape.
  • len(df.index) – DataFrame high.
  • df.describe() – DataFrame numeric statistics (count, mean, std, min, 25%, 50%, 75%, max).
  • df.mean() – Return the mean of the values for the requested axis.
  • df.corr() – Correlation of columns.
  • df.count() – Count of non-null values per column.
  • df.max() – Maximum value per column.
  • fd.min() – Minimum per column.
  • df.median() – Median value per column.
  • df.std() – Standard deviation per column.
  • df[0] – Select a DataFrame column (returned as a new DataFrame).
  • df[1, 2] – Select two DataFrame columns (returned as a new DataFrame).
  • df.iloc[0][2] – Select a value.
  • df.loc([0] – Select a column using the index.
  • df.iloc([0, :] – Select a column using the index.
  • pd.read_<file_type>() – Read from a file (pd.read_csv(‘train.csv’).
  • df.to_<file_type>() – Write to a file (pd.to_csv(‘new_train.csv’)).
  • df.isnull() – Verify is there are null values in the data set.
  • df.isnull().sum() – Return the sum of null values per column in the data set.
  • df.dropna() or df.dropna(axis = 1) – Remove rows or columns with missing data.
  • df.fillna(x) – Replace missing values with x (df.fillna(df.mean())).

And, this is all. This has been just a quick, very quick, review of the pandas library. I just recommend you to play around a bit more but, we will use it more in the future.

ML – Python (III) – pandas

ML – Python (II) – NumPy

As I have said before, one of the best advantages of Python is the huge community and amount of resources that supports it. One of these libraries is NumPy (NUMerical PYthon).

It is one of the main libraries to support scientific work with Python. It brings powerful data structures and implements matrices and multidimensional matrices.

As a short example we can see how to create a 1-dimension structure and a 2-dimensions structure:

import numpy as np

a = np.array([1, 2, 3])
...
b = np.array([(1, 2, 3), (4, 5, 6)])
...

You can find the code example here.

But, why should we use NumPy structures instead of Python structures?

There are a couple of main reasons:

  • NumPy arrays consumes less memory than Python lists.
  • NumPy arrays are faster in execution terms.

Because you do not need to trust me, let’s play a little bit with the code and run some informal benchmarks.

Let’s start with the memory assumption:

import sys
import numpy as np

s = range(1000)
print(sys.getsizeof(5) * len(s))
...
d = np.arange(1000)
print(d.size * d.itemsize)

You can find the code example here.

This gives us the next result:

Python list: 
28000
NumPy array: 
8000

As we can see, there is a big difference on the memory consumption.

Now, let’s do the same for the execution time. Again, we are going to write a small code snippet and execute an informal benchmark:

import time
import numpy as np

SIZE = 1_000_000

L1 = range(SIZE)
L2 = range(SIZE)
A1 = np.arange(SIZE)
A2 = np.arange(SIZE)

start = time.time()
result = [(x, y) for x, y in zip(L1, L2)]
print((time.time() - start) * 1000)
...
start = time.time()
result = A1 + A2
print((time.time() - start) * 1000)

You can find the code example here.

This gives us the next result:

Python list: 
316.49184226989746
NumPy array: 
65.60492515563965

Again, as we can see, the execution time for the NumPy structures is much better.

In addition to the speed and memory improvements, it is worth to point to the different syntax between Python and NumPy when writing the addition operation:

  • Python: [(x, y) for x, y in zip(L1, L2)]
  • NumPy: A1 + A2

As we can see, the difference is quite big. The second case, even if you know nothing about Python or NumPy, is very easy to understand.

Quick review of the NumPy API

  • Creating matrices
    • import numpy as np – Import the NumPy dependency.
    • np.array() – Creates a matrix.
    • np.ones((3, 4)) – Creates a matrix with a one in every position.
    • np.zeros((3, 4)) – Creates a matrix with a zero in every position.
    • np.random.random((3, 4)) – Creates a matrix with random values in every position.
    • np.empty((3, 4)) – Creates an empty matrix.
    • np.full((3, 4), 8) – Creates a matrix with a specified value in every position.
    • np.arange(0, 30, 5) – Creates a matrix with a distribution of values (from 0 to 30 every 5).
    • np.linspace(0, 2, 5) – Creates a matrix with a distribution of values (5 elements from 0 to 2).
    • np.eye(4, 4) – Creates an identity matrix.
    • np.identity(4) – Creates an identity matrix.
  • Inspect matrices
    • a.ndim – Matrix dimension.
    • a.dtype – Matrix data type.
    • a.size – Matrix size.
    • a.shape – Matrix shape.
    • a.reshape(3, 2) – Change the shape of a matrix.
    • a[3, 2] – Select a single element of the matrix.
    • a[0:, 2] – Extract the value in the column 2 from every row.
    • a.min(), a.max() and a.sum() – Basic operations over the matrix.
    • np.sqrt(a) – Square root of the matrix.
    • np.std(a) – Standard deviation of the matrix.
    • a + b, a – b, a * b and a / b – Basic operations between matrices.

And, this is all. This has been just a quick, very quick, review of the NumPy library. I just recommend you to play around a bit more but, we will use it more in the future.

ML – Python (II) – NumPy

Container Security: Anchore Engine

Nowadays, containers are taking over the world. We still have big systems, legacy system and, obviously, not every company out there has enough speed to migrate to containerized solutions but, wherever you look, people are talking about containers.

And, if you look in the opposite direction, people are talking about security. Breaches, vulnerabilities, systems not properly patched, all kind of problems that put at risk enterprise security and users data.

With all of this, and it is not new, projects involving both topics have been growing and growing. The ecosystem is huge, and the amount of options is starting to be overwhelming.

We have projects like:

  • Docker Bench for Security: The Docker Bench for Security is a script that checks for dozens of common best-practices around deploying Docker containers in production. The tests are all automated and are inspired by the CIS Docker Benchmark v1.2.0.
  • Clair: Clair is an open-source project for the static analysis of vulnerabilities in application containers (currently including apps and docker).
  • Cilium: Cilium is open source software for providing and transparently securing network connectivity and load-balancing between application workloads such as application containers or processes. Cilium operates at Layer 3/4 to provide traditional networking and security services as well as Layer 7 to protect and secure use of modern application protocols such as HTTP, gRPC and Kafka. Cilium is integrated into common orchestration frameworks such as Kubernetes and Mesos.
  • Anchore Engine: The Anchore Engine is an open-source project that provides a centralized service for inspection, analysis and certification of container images. The Anchore Engine is provided as a Docker container image that can be run standalone or within an orchestration platform such as Kubernetes, Docker Swarm, Rancher, Amazon ECS, and other container orchestration platforms.
  • OpenSCAP: The OpenSCAP ecosystem provides multiple tools to assist administrators and auditors with assessment, measurement, and enforcement of security baselines. We maintain great flexibility and interoperability, reducing the costs of performing security audits.
  • Dagda: Dagda is a tool to perform static analysis of known vulnerabilities, trojans, viruses, malware & other malicious threats in docker images/containers and to monitor the docker daemon and running docker containers for detecting anomalous activities.
  • Notary: The Notary project comprises a server and a client for running and interacting with trusted collections. See the service architecture documentation for more information.
  • Grafaes: An open artefact metadata API to audit and govern your software supply chain.
  • Sysdig Falco: Falco is a behavioural activity monitor designed to detect anomalous activity in your applications. Powered by sysdig’s system call capture infrastructure, Falco lets you continuously monitor and detect container, application, host, and network activity – all in one place – from one source of data, with one set of rules.
  • Banyan Collector: Banyan Collector is a light-weight, easy to use, and modular system that allows you to launch containers from a registry, run arbitrary scripts inside them, and gather useful information.

As we can see, there are multiple tools within this container security scope. These are just some example.

In this article, we are going to explore a bit more Archore Engine. We are going to create a basic Jenkins pipeline to scan one container. Fro this, we are going to need:

  • A repository in GitHub with a simple dockerized project. In my case, I will be using this one. It’s a simple Spring Boot app with a hello endpoint and a very simple ‘Dockerfile’.
  • We are going to need a Docker Hub repository to store our image. I will be using this one.
  • Docker and docker-compose.

And, that’s all. Let’s go.

We can see in the next image the pipeline we are going to implement:

Install Anchore Engine

We just need to execute a few commands to have Anchore Engine up and running.

mkdir -p ~/aevolume/config 
mkdir -p ~/aevolume/db/
cd ~/aevolume/config && curl -O https://raw.githubusercontent.com/anchore/anchore-engine/master/scripts/docker-compose/config.yaml && cd - 
cd ~/aevolume
curl -O https://raw.githubusercontent.com/anchore/anchore-engine/master/scripts/docker-compose/docker-compose.yaml

After that, we should see a folder ‘aevolume’ with a content similar to:

Running Anchore Engine

As we can see, the previous step has provided us with a docker-compose file to run in an easy way Anchore Engine. We just need to execute the command:

docker-compose up -d

When docker-compose finishes, we should be able to see the two containers for Anchore Engine executing. One for the application itself and one for the database.

Install the Anchore CLI

It is not necessary but, it is going to be very useful to debug integration problem if we have (I had a few the first time). For this, we just need to execute a simple command that it will make the executable ‘anchore-cli’ available in our system.

pip install anchorecli

Install the Jenkins plugin

Now, we start working on the integration with Jenkins. The first step is to install the Anchore integration on Jenkins. We just need to go to the Jenkins management plugin area and install one called ‘Anchore Container Image Scanner Plugin’.

Configure Anchore in Jenkins

There is one more step we need to take to configure the Anchore plugin in Jenkins. We need to provide the engine URL and the access credentials. This credentials can be found in the file ‘~/aevolume/config/config.yaml’.

Configure Docker Hub repository

The last configuration we need to do, it is to add our access credential for our Docker Hub repository. I recommend here to generate an access token and not to use our real credentials. Once we have the access credential, we just need to add them to Jenkins.

Create a Jenkins pipeline

To be able to run our builds and to analyze our containers, we need to create a Jenkins pipeline. We are going to use the script feature for this. The script will look like this:

pipeline {
    environment {
        registry = "fjavierm/anchore_demo"
        registryCredential = 'DOCKER_HUB'
        dockerImage = ''
    }
    agent any
        stages {
            stage('Cloning Git') {
                steps {
                    git 'https://github.com/fjavierm/demo.git'
                }
            }

            stage('Building image') {
                steps {
                    script {
                        dockerImage = docker.build registry + ":$BUILD_NUMBER"
                    }
                }
            }

            stage('Container Security Scan') {
                steps {
                    sh 'echo "docker.io/fjavierm/anchore_demo:latest `pwd`/Dockerfile" > anchore_images'
                    anchore name: 'anchore_images'
                }
            }
            stage('Deploy Image') {
                steps{
                    script {
                        docker.withRegistry( '', registryCredential ) {
                            dockerImage.push()
                        }
                    }
                }
            }
            stage('Cleanup') {
                steps {
                sh'''
                    for i in `cat anchore_images | awk '{print $1}'`;do docker rmi $i; done
                '''
            }
        }
    }
}

This will create a pipeline like:

Execute the build

Now, we just need to execute the build and see the results:

Conclusion

With this, we finish the demo. We have installed Anchore Engine, integrate it with Jenkins, run a build and check the analysis results.

I hope it is useful.

Container Security: Anchore Engine

ML – Python (I) – Introduction

We have been here, in the blog, talking about Machine Learning sometimes. The purpose of this series of articles is to go a little bit further and to explore a bit more the Machine Learning space and its relation with Python.

All the information in a more technical shape and the small scripts can be found at my GitHub account under the project python-ml.

One of the questions that it is worth to discuss is, why Python?

Available languages for Machine Learning

It is clear that you can use a lot of different languages to implement Machine Learning algorithms and programs but, looking at the space and popularity you can easily see a tendency and preference for four of them.

  • Python
    • It is the leader of the race right now due to the simplicity and its soft learning curve.
    • It is especially good and successful for beginners, in both, programming and Machine Learning.
    • The libraries ecosystem and community support are huge.
  • R
    • It is designed for statistical analysis and visualization, it is used frequently to unlock patterns in big data blocks.
    • With RStudio, developers can easily build algorithms and statistical visualization.
    • It is a free alternative to more expensive software like Matlab.
  • Matlab
    • It is fast, stable and secure for complex mathematics.
    • It is considered as a hardcore language for mathematicians and scientists.
  • Julia
    • Designed to deal with numerical analysis needs and computational science.
    • The base Julia library was integrated with C and Fortram open source libraries.
    • The collaboration between the Jupyter and Julia communities, it gives Julia a powerful UI.

Some important metrics to consider when choosing a language should be:

  • Speed.
  • Learning curve.
  • Cost.
  • Community support.
  • Productivity.

Here we can classify our languages as follows:

  • Speed: R is basically a statistical language and it is difficult to beat in this context.
  • Learning curve: Here depends on the person’s knowledge. R is closer to the functional languages as opposite to python that is closer to object-oriented languages.
  • Cost: Only Matlab is not a free language. The other languages are open source.
  • Community: All of them are very popular but, Python has a bigger community and amount of resources available.
  • Productivity: R for statistical analysis, Matlab for computational vision, bio-informatics or biology is the playground of Julia and, Python is the king for general tasks and multiple usages.

The decision, at the end of the day, is about a balance between all the characteristics seen above, our skills and the field we are or the tasks we want to implement.

In my case, I am going to choose Python as probably all of you have assumed because it is like a swiss knife and, at this point, the beginning, I think this is important. There is always time later to focus on other things or reduce the scope.

IDEs

There are multiple IDEs that support Python. As a very extended language, there are multiple tools and environments we can use. Here just take the one you like the more.

If you do not know any IDE or platform, there are two of them that a lot of Data Scientist use:

I do not know them. As a developer, I am more familiar with Visual Studio Code or IntelliJ, and I will be using one of them probably unless I discover some exciting functionality or advantage in one of the other.

ML – Python (I) – Introduction

Embracing remote work

Recently, due to some unexpected and undesired circumstances, I have been forced to work remotely for a long period of time, enough to build some opinions and put to test some initiatives around remote work that I read in the past. Here there are some conclusions. Some of them can be probably extrapolated to any kind of role but, in this case, they are from a software developer point of view.

In advance, I will say that I am someone that likes to be at the office, the whiteboard discussions with multiple participants, even, just to grab a notebook draw or write something and discuss it with the people, this kind of things. Till now, I have not had the chance or the willing to be working remotely for a long period of time but, life sometimes needs adaptation.

A few things I have learned or they have worked during this period:

Trust

One of the main concerns usually it is that the people working remotely are not going to perform at the same level that if they were at the office. I must say that this is completely false. We are adults, we have our tasks, responsibilities and deadlines, we must learn to trust each other. Without trust, this is not going to work. Remote workers should not be asked to prove constantly they are working. We should measure their performance based on the same metrics they would have if they were at the office. Managers and colleagues need to be open to trusting remote workers and remote workers need to work in this trust.

Instant messages

Instant messaging tools like Slack are great, they allow us to communicate in an easy way, ask and solve questions, send tasks, send deadlines, share files, etc. But, they are not a control tool, do not expect your remote worker answering always a few seconds after you have sent a message to them, in the same way, that you should not expect an answer from someone it is at the office.

And, in addition to the instant messaging tools, try using some videoconference tools like Zoom, talk with your remote workers, share your screen, catch up with them when it is necessary, even if it is just for a few minutes. Have a quick chat when resolving questions or planning. You will see how useful it is and it will create a closer relationship than the one just built using instant messages.

Tools

Make sure your company offers the appropriate tools to remote workers, they do not need extra tools just to have available the same tools the people at the office has. If you provide laptops to them make sure they can perform properly under the expected workload for their role. Have your IT teams configure the necessary tools like VPNs, 2FA, access key, etc.

Availability

Managers, face it, the fact that remote workers are working from home and they have available their work environment does not mean they should be available all times of the day. Respect their schedule. If they deliver, if they are flexible to be available for worth it meetings, do not push them to work off-hours and do not expect them to be answering emails or messages. Let them rest and be productive the next day.

Culture

One of the most heard things it is that remote workers are not around and they do not fit in the company culture but, what are they actually missing not been there? Some trivial conversation around the coffee machines, a few pizza or beer events. That it is nice to have but, it is not what defines the company culture. Include them in the big meetings like kickoffs, to transmit this kind of meetings using a videoconferencing app is not difficult. Invite them, when it is possible to the big parties, Christmas dinner for example. Just try to think a little about them when it is possibble and do not discourage people from using videoconference with them.

Meetings

I will say that “If you have just one remote worker that needs to attend a meeting, the meeting should be online”. If this is not possible book a room with a speaker and a microphone, share your screen during the meeting, ask for the participants’ opinion even, some times, ask for the collaboration of the people on the line. Try, it is easy, it is useful and, it works.

Use tools available for remote working. There are excellent whiteboards, project management tools and, even, just a document excel or word where you can interact with people it will make everything so much easier.

Communication

Probably you will not realise initially but, there is a big chance that your remote worker is going to be an excellent communicator or, if they are just starting to work remotely, you will see how, over time, they will improve their communication skills.

They are going to learn a few things:

  • Picking up different tones. When you are face-to-face you are able to obtain non-verbal communication signal from the people around you but, in general, when you are working remotely, you have just the voices to identify different things and use them to improve the way you are communicating things.
  • Confidence. When you are face-to-face, part of your communication is done by your presence, the fact that you are there. When you are working remotely, you need to create this presence with your voice and your confidence.
  • Take to the next level the tools you have. A remote worker will use the available tools in ways that you have never thought. They will learn to use them more effectively and, probably, even find usages for them that were not initially designed for. Everything just to improve the way they communicate.
  • Make people understand them. They will learn to be assertive and clear, not rambling around. They will learn how to get to the point. And, they will learn to overcommunicate in order to establish better communication, explain themselves and ask for feedback.

In general, I think that it has been a great experience. I must say that the first week I was feeling a bit weird, I imagine that it was the sudden and unplanned change of situation but, this feeling went away soon and easy, leaving a great experience.

Embracing remote work

Maven archetypes

We live in a micro-services world, lately, does not matter where you go, big, medium or small companies or start-ups, everyone is trying to implement microservices or migrating to them.

Maybe not initially, but when companies achieve a certain level of maturity, they start having a set of common practices, libraries or dependencies they apply or use in all the micro-services they build. Let’s say, for example, authentication or authorization libraries, metrics libraries, … or any other component they use.

When this level of maturity is achieved, usually, to start a project basically we take the “How-To” article in our wiki and start copying and pasting common code, configurations and creating a concrete structure in the new project. After that, it is all set to start implementing business logic.

This copy and paste process is not something that it usually takes a long time but, it is a bit tedious and prone to human errors. To make our lives easier and to try to avoid unnecessary mistakes we can use maven archetypes.

Taken from the maven website, an archetype is:

In short, Archetype is a Maven project templating toolkit. An archetype is defined as an original pattern or model from which all other things of the same kind are made. The name fits as we are trying to provide a system that provides a consistent means of generating Maven projects. Archetype will help authors create Maven project templates for users, and provides users with the means to generate parameterized versions of those project templates.

In the next two sections, we are going to learn how to build some basic archetypes and how to build a more complex one.

Creating a basic archetype

Following the maven documentation page we can see there are a few ways to create our archetype:

From scratch

I am not going to go into details here because the maven documentation is good enough and because it is the method we are going to use in the “Creating a complex archetype” section below. You just need to follow the four steps the documentation is showing:

  1. Create a new project and pom.xml for the archetype artefact.
  2. Create the archetype descriptor.
  3. Create the prototype files and the prototype pom.xml.
  4. Install the archetype and run the archetype plugin

Generating our archetype

This is a very simple one also described in the maven documentation. Basically, you use maven to generate the archetype structure for you

mvn archetype:generate \
    -DgroupId=[your project's group id] \
    -DartifactId=[your project's artifact id] \
    -DarchetypeGroupId=org.apache.maven.archetypes \
    -DarchetypeArtifactId=maven-archetype-archetype

As simple as that. After executing the command, we can add our personalisations to the project and proceed to install it as seen before.

From an existing project

This option allows us to create a project and when we are happy with how it is, to transform the project into an archetype. Basically we need to follow the next steps:

  1. Build the project layout by scratch and add files as need.
  2. Run the Maven archetype plugin on an existing project and configure from there.
mvn archetype:create-from-project

This will generate an “archetype” folder into the “target” folder:

target/generated-sources/archetype

We just need to copy this folder structure to the desired location and we will have our archetype ready to go. It needs to be installed as usual to be able to use it.

Using our archetype

Once we have install our archetype, we can start using it:

mvn archetype:generate \
    -DarchetypeGroupId=dev.binarycoders \
    -DarchetypeArtifactId=simple-archetype \
    -DarchetypeVersion=1.0-SNAPSHOT \
    -DgroupId=org.example \
    -DartifactId=project1

This will create a new project using the archetype. The information we need to modify in the previous command is:

  • archetypeGroupId: It is the archetype group id we have defined when we created the archetype.
  • archetypeArtifactId: It is basically the name of our archetype.
  • archetypeVersion: It is the version of the archetype we want to use in case the archetype has been evolving over time and we have different versions.
  • groupId: It is the group id our new project is going to have.
  • artifactId: It is the name of our new project.

Deleting our archetype

Right now, after installing our archetype, it is only available in our local repository. This fact allows us to delete the archetype in a very simple way. We just need to take a look at the archetype catalogue in our repository and manually remove the archetype. We can find this file at:

~/.m2/repository/archetype-catalog.xml

Creating a complex archetype

For most cases, the already reviewed ways to create archetypes should be enough but, not for all of them. What happens if we need to define some modules we want to define the name when creating the project? Or classes? Or some other customisations?

Luckily, Maven gives us some level of flexibility allowing us to define some variables and use some concrete patterns to define folders and files in our archetypes in a way they will be replaced when the projects using the archetype are created.

As a general rule we will be using two kinds of notation for our dynamically elements:

  • Defined in files: ${varName}
  • Defined in file system: __varName__ (two underscores)

This will help us to achieve our goals.

As an example, I am going to create a small complex archetype to be able to see this in action. The projects created with the archetype are going to have:

  • A parent project with <artifactId> name.
  • Two modules called <artifactId>-one and artifactId-two.
  • A main class called <classPrefix>OneApp and <classPrefix>TwoApp respectibely.
  • The classes will be located in the package <package>.one and <package>.two respectibely.
  • The module One will have a properties class stored in the resources folder.

The code of the archetype can be found at the GitHub repository.

The first file we can check is archetype-metadata-xml located in META-INF/maven.

We can see here the definition of the variable classPrefix and groupId with a default value assigned.

<requiredProperties>
    <requiredProperty key="classPrefix" />
    <requiredProperty key="groupId">
        <defaultValue>dev.binarycoders</defaultValue>
    </requiredProperty>
</requiredProperties>

After that, we can see the definition of the project structure we want to achieve. In this case, we have the fileSets node with the files on the parent project and, after that, the definition of the modules we want to include. Here we should pay special attention to the way the module attributes are defined:

<module id="${rootArtifactId}-one"
         dir="__rootArtifactId__-one"
         name="${rootArtifactId}-one">

As we can see they use the notation described before, using the “${}” notation for variables in files and the notation “__” (two underscores) for file system elements. The rest of the file is pretty simple.

If we explore the folder structure, we can see a few elements defined with these two underscores notation like the module names and the class names. This will be dynamic elements that will take the name from the variable defined when the project is created.

We can define different filesets for the files we want to be copied to our generated project. For example, we can copy all the .java files we can find inside the path src/main/java:

<fileSet filtered="true" packaged="true" encoding="UTF-8">
    <directory>src/main/java</directory>
        <includes>
            <include>**/*.java</include>
        </includes>
</fileSet>

Finally, if we explore one of the classes, we can see the next content:

#set( $symbol_pound = '#' )
#set( $symbol_dollar = '$' )
#set( $smbol_escape = '\' )
package ${package}.one;

public class ${classPrefix}OneApp {
}

The first three lines are just alias to be able to use the symbols that have a specific meaning not just as literals.

After that, we can see the package definition that it is going to be built with one part dynamically added and one part statically defined. We can see the class name follows the same pattern.

Deserves special attention to the fact that, despite we are defining packages into the classes, we are not replicating this structure in the project structure, Maven will take care of that for us. This is because when we defined the fileset we defined the attribute package equals true. If this attribute is set to false, we will be in charge of defining the desired structure.

It is worth it to mention that because of the files in the maven archetype act as velocity templates, we can introduce some logic and some dynamic content in our files. For example, print something or not in a determinate file:

<requiredProperty key="greeting">
    <defaultValue>y</defaultValue>
</requiredProperty>
#if (${greeting == 'y'})
    // Hello, welcome here!
#end

This variable can be set using the command line when we generate our new project:

-Dgreeting=n

Finally, there is one more interesting thing we can do. We can use a post-generation script write in groovy to execute some actions after the project has been generated. One interesting use, it is to remove not desired files based on some variables defined when generating the project. This script will be located in the folder src/main/resources/META-INF with the name archetype-post-generate.groovy.

import groovy.io.FileType

def rootDir = new File(request.getOutputDirectory() + "/"
    + request.getArtifactId())
def oneBundle = new File(rootDir, request.getArtifactId()
    + "-one")

def projectPackage = request.getProperties().get("package")

assert new File(oneBundle, "src/main/java/" 
    + projectPackage.split("\\.").join('/')
    + "/toDelete.txt").delete()

With this, every time that we use the archetype to create a new project we will obtain the desired results.

We can use our recently created archetype with:

mvn archetype:generate \
    -DarchetypeGroupId=dev.binarycoders \
    -DarchetypeArtifactId=simple-archetype \
    -DarchetypeVersion=1.0-SNAPSHOT \
    -DgroupId=org.example \
    -DartifactId=project

And the result:

And, one of the classes:

This is all. I hope is useful.

See you.

Maven archetypes