Apache Kafka (II): Interesting learning resources

As explained in my last article, I needed to learn Apache Kafka at a fast pace. My plan after that was to write a series of articles ordering my learns, thoughts and ideas but something has changed. Despite I am confident in my communication skills, I found a series of videos by Confluent (I have no relation with them) I would have loved to discover when I was starting. They are four mini-courses that explain Apache Kafka, Kafka Stream and ksqlDB. I would have loved to have them before I started because I think they are clear enough, concise and platform agnostic (not linked to their platform).

For all these reasons, I have considered it is better to help people to point to the resources instead of just writing a few articles. Everyone that follows the blog knows that I do not do this, to share other materials, too often but I think they do a good job explaining all concepts with examples related to Apache Kafka and there is no point rewriting what they are doing. There is, maybe, space for one last article on this series with some configurations, tips and pitfalls (I have a long list of that).

The series of videos are:

  • Apache Kafka 101: A brief introduction, the explanation of basic concepts related to it and the basic use.
  • Kafka Streams 101: A nice, full of examples, course about Kafka Streams and related concepts.
  • ksqlDB 101: Same format as the previous on focus on ksqlDB.
  • Inside ksqlDB: Going one step further on ksqlDB showing some of the operations.
Apache Kafka (II): Interesting learning resources

Apache Kafka (I): Intro + Installation

Due to some recent changes, I have been forced to learn about Apache Kafka. It has been suddenly and, despite the fact I love to learn new things, a bit in a hurry for my taste. For this reason, I have planned to write a few articles trying to sort my learnings and my ideas and put some order into the chaos. During this extremely quick learning period, I have learned more complex concepts before than basics ones, and tricks and exceptions before than rules. I hope these articles help me to get a better understanding of the technology, and to be able to help anyone else in the same position.

What is Apache Kafka?

On the project’s webpage, we can see the following definition:

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Maybe, a more elaborated definition is that Apache Kafka is an event streaming platform, a highly scalable and durable system capable of continuously ingesting gigabytes of events per second from various sources and making them available in milliseconds, used to collect, store and process real-time data streams at scale. It has numerous use cases, including distributed logging, stream processing and Pub-Sub messaging.

At this point, probably still confused, I know how you feel, it happens to me the first time. Let’s see if we can dig a bit more into all of this. Splitting the definition into smaller parts, we have:

  • An event is something that happens in a system. It can be anything from a particular user logging in, to a sensor registering a temperature.
  • A data pipeline reliably processes and moves data from one system to another, in this case, events.
  • A streaming application is an application that consumes streams of data.

How does Apache Kafka work?

Apache Kafka combines two messaging models:

  • Queueing: Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues are not multi-subscriber.
  • Publish-subscribe: The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes.

To be able to combine both approaches, Apache Kafka uses a partitioned log model, considering a log an ordered sequence of events distributed into different partitions. This allows the existence of multiple subscribers (one per partition) and, while it guarantees the delivery order per partition, this is not guaranteed per log.

This model provides replayability, which allows multiple independent applications to process the events in a log (data stream) at their own pace.

RabbitMQ vs Apache Kafka

One of the first questions that pop up into my mind when reviewing Apache Kafka for the first time was “why do not use RabbitMQ?” (I have. more extensive experience with it). Over time, I discovered that this was not the right question, a more accurate one should be “what each service excels at?“. The second question is closer to the mentality we should have as software engineers.

The one sentence to remember when deciding is that RabbitMQ’s message broker design excelled in use cases that had specific routing needs and per-message guarantees, whereas Kafka’s append-only log allowed developers access to the stream history and more direct stream processing.

RabbitMQ seems better suited for long-running tasks when it is needed to run reliable background jobs. And for communication and integration within, and between applications, i.e as a middleman between microservices. While Apache Kafka seems ideal if a framework for storing, reading (re-reading), and analyzing streaming data is needed.

Some more concrete scenarios are:

  • Apache Kafka:
    • High-throughput activity tracking: Kafka can be used for a variety of high-volume, high-throughput activity-tracking applications. For example, you can use Kafka to track website activity (its original use case), ingest data from IoT sensors, monitor patients in hospital settings, or keep tabs on shipments.
    • Stream processing: Kafka enables you to implement application logic based on streams of events. You might keep a running count of types of events or calculate an average value over the course of an event that lasts several minutes. For example, if you have an IoT application that incorporates automated thermometers, you could keep track of the average temperature over time and trigger alerts if readings deviate from a target temperature range.
    • Event sourcing: Kafka can be used to support event sourcing, in which changes to an app state are stored as a sequence of events. So, for example, you might use Kafka with a banking app. If the account balance is somehow corrupted, you can recalculate the balance based on the stored history of transactions.
    • Log aggregation: Similar to event sourcing, you can use Kafka to collect log files and store them in a centralized place. These stored log files can then provide a single source of truth for your app.
  • RabbitMQ:
    • Complex routing: RabbitMQ can be the best fit when you need to route messages among multiple consuming apps, such as in a microservices architecture. RabbitMQ consistent hash exchange can be used to balance load processing across a distributed monitoring service, for example. Alternative exchanges can also be used to route a portion of events to specific services for A/B testing.
    • Legacy applications: Using available plug-ins (or developing your own), you can deploy RabbitMQ as a way to connect consumer apps with legacy apps. For example, you can use a Java Message Service (JMS) plug-in and JMS client library to communicate with JMS apps.

Apache Kafka installation (on macOS)

The articles are going to be focused on a macOS system, but steps to install Apache Kafka in other systems should be easy enough, and the rest of the concepts should be indistinguishable from one system to another.

We are going to install Apache Kafka using Homebrew, a package manager for macOS. The installation is as simple as running:

brew install kafka

This command will install two services on our system:

  • Apache Kafka
  • Zookeeper

We can run them as services using the Homebrew commands ‘brew services’ (similar to the ‘systemctl’ if you are a Linux user), or execute them manually.

To run them as services:

brew services start zookeeper
brew services start kafka

To run them manually:

zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
kafka-server-start /usr/local/etc/kafka/server.properties

Everything should work out of the box. With this, we will have a simple Apache Kafka server running locally.

I have seen some people having connection problems when starting Apache Kafka, if this happens, we can edit the file ‘/usr/local/etc/kafka/server.properties‘, find the line ‘#listeners=PLAINTEXT://:9092‘, uncommented and run it again.

With these steps, we should be ready to start making some progress.

Apache Kafka (I): Intro + Installation

Maintaining compatibility

Most of the companies nowadays are implementing or want to implement architectures based on micro-services. While this can help companies to overcome multiple challenges, it can bring its own new challenges to the table.

In this article, we are going to discuss a very concrete one been maintaining compatibility when we change objects that help us to communicate the different micro-services on our systems. Sometimes, they are called API objects, Data Transfer Objects (DTO) or similar. And, more concretely, we are going to be using Jackson and JSON as a serialisation (marshalling and unmarshalling) mechanism.

There are some other methods, other technologies and other ways to achieve this but, this is just one tool to keep on our belt and be aware of to make informed decisions when faced with this challenge in the future.

Maintaining compatibility is a very broad term, to establish what we are talking about and ensure we are on the same page, let us see a few examples of real situations we are trying to mitigate:

  • To deploy breaking changes when releasing new features or services due to changes on the objects used to communicate the different services. Especially, if at deployment time, there is a small period of time where the old and the new versions are still running (almost impossible to avoid unless you stop both services, deploy them and restart them again).
  • To be forced to have a strict order of deployment for our different services. We should be able to deploy in any order and whenever it best suits the business and the different teams involved.
  • The need of, due to a change in one object in one concrete service, being forced to deploy multiple services not directly involved or affected by the change.
  • Related to the previous point, to be forced to change other services because of a small data or structural change. An example of this would be some objects that travel through different systems been, for example, enriched with extra information and finally shown to the user on the last one.

To exemplify the kind of situation we can find ourselves in, let us take a look at the image below. In this scenario, we have four different services:

  • Service A: It stores some basic user information such as the first name and last name of a user.
  • Service B: It enriches the user information with extra information about the job position of the user.
  • Service C: It adds some extra administrative information such as the number of complaints open against the user.
  • Service D: It finally uses all the information about the user to, for example, calculate some advice based on performance and area of work.

All of this is deployed and working on our production environment using the first version of our User object.

At some point, product managers decided the age field should be considered on the calculations to be able to offer users extra advice based on proximity of retirement. This added requirement is going to create a second version of our User object where the field age is present.

Just a last comment, for simplicity purposes, let us say the communication between services is asynchronous based on queues.

As we can see on the image, in this situation only services A and D should be modified and deployed. This is what we are trying to achieve and what I mean by maintaining compatibility. But, first, let us explore what are the options we have at this point:

  1. Upgrade all services to the second version of the object User before we start sending messages.
  2. Avoid sending the User from service A to service D, send just an id, and perform a call from service D to recover the User information based on the id.
  3. Keep the unknown fields on an object even, if the service processing the message at this point does not know anything about them.
  4. Fail the message, and store it for re-processing until we perform the upgrade to all services involved. This option is not valid on synchronous communications.

Option 1

As we have described, it implies the update of the dependency service-a-user in all the projects. This is possible but it brings quickly some problems to the table:

  • We not only need to update direct dependencies but indirect dependencies too what it can be hard to track, and easy to miss. In addition, a decision needs to be done about what to do when a dependency is missed, should an error be thrown? Should we fail silently?
  • We have a problem with scenarios where we need to roll back a deployment due to something going wrong. Should we roll back everything? Good luck! Should we try to fix the problem while our system is not behaving properly?
  • Heavy refactoring operations or modifications can make upgrades very hard to perform.

Option 2

Instead of sending the object information on the message, we just send an id to be able posteriorly to recover the object information using a REST call. This option while very useful in multiple cases is not exempt from problems:

  • What if, instead of just a couple of enrichers, we have a dozen of them and they need the user information? Should we consolidate all services and send ids for the enriched information crating stores on the enrichers?
  • If, instead of a queue, other mechanisms of communications are used such as RPC, do now all the services need to call service A to recover the User information and do their job? This just creates a cascade of calls.
  • And, under this scenario, we can have inconsistent data if there is any update while the different services are recovering a User.

Option 3

This is going to be the desired option and the one we are going to do a deep dive on this article using Jackson and JSON how to keep the fields even if the processing service does not know everything about them.

To add in advance that, as always, there are no silver bullets, there are problems that not even this solution can solve but it will mitigate most of the ones we have named on previous lines.

One problem we are not able to solve with this approach – especially if your company perform “all at once” releases instead of independent ones – is, if service B, once deployed tries to persist some information on service A before the new version has been deployed, or tries to perform a search using one criterion, in this case, the field age, on the service A. In this scenario, the only thing we can do is to throw an error.

Option 4

This option, especially in asynchronous situations where messages can be stored to be retried later, can be a possible solution to propagate the upgrade. It will slow down our processing capabilities temporarily, and retrying mechanism needs to be in place but, it is doable.

Using Jackson to solve versioning

Renaming a field

Plain and simple, do not do it. Especially, if it is a client-facing API and not an internal one. It will save you a lot of trouble and headaches. Unfortunately, if we are persisting JSON on our databases, this will require some migrations.

If it needs to be done, think about it again. Really, rethink it. If after rethinking it, it needs to be done a few steps need to be taken:

  1. Update the API object with the new field name using @JsonAlias.
  2. Release and update everything using the renamed field, and @JsonAlias for the old field name.
  3. Remove @JsonAlias for the old field name. This is a cleanup step, everything should work after step two.

Removing a field

Like in the previous case, do not do it, or think very hard about it before you do it. Again, if you finally must, a few steps need to be followed.

First, consider to deprecate the field:

If it must be removed:

  1. Explicitly ignore the old property with @JsonIgnoreProperties.
  2. Remove @JsonIgnoreProperties for the old field name.

Unknown fields (adding a field)

Ignoring them

The first option is the simplest one, we do not care for new fields, a rare situation but it can happen. We should just ignore them:

A note of caution in this scenario is that we need tone completely sure we want to ignore all properties. As an example, we can miss on APIs that return errors as HTTP 200 OK, and map the errors on the response if we are not aware of that, while in other circumstances it will just crash making us aware.

Ignoring enums

In a similar way, we can ignore fields, we can ignore enums, or more appropriately, we can map them to an UNKNOWN value.

Keeping them

The most common situation is that we want to keep the fields even if they do not mean anything for the service it is currently processing the object because they will be needed up or downs the stream.

Jackson offers us two interesting annotations:

  • @JsonAnySetter
  • @JsonAnyGetter

These two annotations help us to read and write fields even if I do not know what they are.

class User {
    @JsonAnySetter
    private final Map<String, Object> unknownFields = new LinkedHashMap<>();
    
    private Long id;
    private String firstname;
    private String lastname;

    @JsonAnyGetter
    public Map<String, Object> getUnknownFields() {
        return unknownFields;
    }
}

Keeping enums

In a similar way, we are keeping the fields, we can keep the enums. The best way to achieve that is to map them as strings but leave the getters and setters as the enums.

@JsonAutoDetect(
    fieldVisibility = Visibility.ANY,
    getterVisibility = Visibility.NONE,
    setterVisibility = Visibility.NONE)
class Process {
    private Long id;
    private String state;

    public void setState(State state) {
        this.state = nameOrNull(state);
    }

    public State getState() {
        return nameOrDefault(State.class, state, State.UNKNOWN);
    }

    public String getStateRaw() {
        return state;
    }
}

enum State {
    READY,
    IN_PROGRESS,
    COMPLETED,
    UNKNOWN
}

Worth pointing that the annotation @JsonAutoDetect tells Jackson to ignore the getters and setter and perform the serialisation based on the properties defined.

Unknown types

One of the things Jackson can manage is polymorphism but this implies we need to deal sometimes with unknown types. We have a few options for this:

Error when unknown type

We prepare Jackson to read an deal with known types but it will throw an error when an unknown type is given, been this the default behaviour:

@JsonTypeInfo(
    use = JsonTypeInfo.Id.NAME,
    include = JsonTypeInfo.As.PROPERTY)
@JsonSubTypes({
    @JsonSubTypes.Type(value = SelectionProcess.class, name = "SELECTION_PROCESS"),
})
interface Process {
}

Keeping the new type

In a very similar to what we have done for fields, Jackson allow as to define a default or fallback type when the given type is not found, what put together with out unknown fields previous implementation can solve our problem.

@JsonTypeInfo(
    use = JsonTypeInfo.Id.NAME,
    include = JsonTypeInfo.As.PROPERTY,
    property = "@type",
    defaultImpl = AnyProcess.class)
@JsonSubTypes({
    @JsonSubTypes.Type(value = SelectionProcess.class, name = "SELECTION_PROCESS"),
    @JsonSubTypes.Type(value = SelectionProcess.class, name = "VALIDATION_PROCESS"),
})
interface Process {
    String getType();
}

class AnyProcess implements Process {
    @JsonAnysetter
    private final Map<String, Object> unknownFields = new LinkedHashMap<>();

    @JsonProperty("@type")
    private String type;

    @Override
    public String getType() {
        return type;
    }

    @JsonAnyGetter
    public Map<String, Object> getUnknownFields() {
        return unknownFields
    }
}

And, with all of this, we have decent compatibility implemented, all provided by the Jackson serialisation.

We can go one step further and implement some basic classes with the common code e.g., unknownFields, and make our API objects extend for simplicity, to avoid boilerplate code and use some good practices. Something similar to:

class Compatibility {
    ....
}

class MyApiObject extends Compatibility {
    ...
}

With this, we have a new tool under our belt we can consider and use whenever is necessary.

Maintaining compatibility

Container attack vectors

We live in a containerised world. Container solutions like Docker are now so extended that they are not a niche thing any more or a buzzword, they are mainstream. Multiple companies use it and, the ones that do not are dreaming with it probably.

The only problems are that they are still something new. The adoption of them has been fast and, it has arrived like a storm to all kind of industries that use technology. The problem is that from a security point of view we, as an industry, do not have all the awareness we should have. Containers and, especially, containers running on cloud environments are hidden partially the fact that they exist and they need to be part of our security considerations. Some companies use them thinking they are completely secure, trusting the cloud providers or the companies that generate the containers take care of everything and, even, for less technology focus business, they are an abstraction and not real and tangible thing. They are not the old bare metal servers, the desktop machines or the virtual machines they were used to it, and till a certain point, they worried because they were things that could be touched.

All of that has made that while security concerns for web applications are first-level citizens, not as much as it should but the situation has improved a lot on the last few years, security concerns about containers seem to be the black sheep of the family, no one talks about it. And, this is not right. It should have the same level of concern and the same attention should be paid to it and, be part of the development life cycle.

In the same way that web applications can be attacked in multiple ways, containers have their own attack vectors, some of which we are going to see here. We will see that some of the attack vectors can be easily compared with known attack vectors on spaces we are more aware like web applications.

Vulnerable application code

Containers package applications and third-party dependencies that can contain known flaws or vulnerabilities. There are thousands of published vulnerabilities that attackers can take advantage to exploit our systems if found on the applications running inside the containers.

The best to try to avoid running container with known vulnerabilities is to scan the images we are going to deploy and, not just as a one-time thing. This should be part of our delivery pipelines and, the scans should apply all the time. In addition to known vulnerabilities, scanners should try to find out-of-date packages that need an update. Even, some available scanners try to find some possible malware on the images.

Badly configured container images

When configuring how a container is going to be built some vulnerabilities can be introduced by mistake or if not the proper attention is paid to the building process that can be later exploited by attackers. A very common example is to configure the container to run with unnecessary root permissions giving it more privileges on the host than it really needs.

Build machine attacks

As any piece of software, the one we use to run CI/CD pipelines and build container images can be attacked successfully and, attackers can add malicious code to our containers during the build phase obtaining access to our production environment once the containers have been deploy and, even, utilising these compromised containers to pivot to other parts of our systems or networks.

Supply chain attacks

Once containers have been built they are stored in registries and retrieved or “pulled” when they are going to be run. Unfortunately, no one can guarantee the security of this registries and, an attacker can compromise the registry an replace the original image with a modified one including a few surprises.

Badly configured containers

When creating configuration files for our containers, i.e. a YAML file, we can make some mistakes and add configurations to the containers we did not need. Some possible examples are unnecessary access privileges or unnecessary open ports.

Vulnerable host

Containers run on host machines and, in the same way, we try to ensure containers are secure host should be too. Some times they run old versions of orchestration component with known vulnerabilities or other components for monitorisation. A good idea is to minimise the number of components installed on the host, configure them correctly and apply security best practices.

Exposed secrets

Credentials, tokens or passwords are all of them necessary if we want our system to be able to communicate with other parts of the system. One risk is the way we supply the container and the applications running in it these secret values. There are different approaches with varying levels of security that can be used to prevent any leakage.

Insecure networking

The same than non containerised applications, containers need to communicate using networks. some level of attention will be necessary to set up secure connections among components.

Container escape vulnerabilities

Containers are prepared to run on isolation from the hosts were they are running, in general, all container runtimes like “containerd” or “CRI-O” have been heavily tested and are quite reliable but, as always, there are vulnerabilities to be discovered. Some of these vulnerabilities can let malicious code running inside a container escape out into the host. Due to the severity of this, some stronger isolation mechanisms can be worth to consider.

Some other risks related to containers but not directly been containers can be:

  • Attacks to code repositories of application deployed on the containers poisoning them with malicious code.
  • Hosts accessible from the Internet should be protected as expected with other tools like firewalls, identity and access management systems, secure network configurations and others.
  • When container run under an orchestrator, i.e. Kubernetes, a door to new attack vectors is open. Configurations, permission or access not controlled properly can give attackers access to our systems.

As we can see some of the attack vectors are similar to the one existing in more mature areas like networking or web application but, due to the abstraction and the easy-to-use approach, the security on containers, unfortunately, is left out the considerations.

Reference: “Container Security by Liz Rice (O’Reilly). Copyright 2020 Vertical Shift Ltd., 978-1-492-05670-6”

Container attack vectors

Defining Software Architecture

The Software Architecture definition is something that, for a long time, the industry as a whole has not been able to agree or to find a consensual definition. In some cases, it is defined as the blueprint of a system and, in other, it is the roadmap for developing a system, including all the options in the middle.

The truth is that it is both things and, probably, much more than that. To try to figure out what it is, I think we are still far from a formal definition, we can focus on what it is analysed when we take a look at concrete architectures.

  • Structure
  • Architecture characteristics
  • Architecture decisions
  • Design principles

Structure

When we talk about the structure we are referring to the type or types of architecture styles selected to implement a system such as microservices, layered, or a microkernel. These styles do not describe and architecture but its structure.

Architecture characteristics

The architecture characteristics define the quality attributes of a system, the “-ilities” the system must support. These characteristics are not related to the business functionality of the system but with its proper function. They are sometimes known as non-functional requirements. Some of them are:

AvailabilityReliabilityTestability
ScalabilitySecurityAgility
Fault ToleranceElasticityRecoverability
PerformanceDeployabilityLearnability
Architecture characteristics

A long list of them, maybe too long, can be found on one of the articles on the Wikipedia: List of system quality attributes.

Architecture decisions

Architecture decisions define the rules of how a system should be built. Architecture decisions form the constraints of a system and inform the development teams of what it is allowed and what it is not when building the system.

An example, it is the decision of who should have access to the databases on the system, deciding that only business and service layers can access them and excluding the presentation layer.

When some of these decisions need to be broken due to constraints at one part of the system, this can be done using a variance.

Design principles

Design principles are guidelines rather than strong rules to follow. Things like synchronous versus asynchronous communications within a microservices architecture. It is some kind of a preferred way to do it but this does not mean developers cannot take different approaches on concrete situations.

Reference: “Fundamentals of Software Architecture by Mark Richards and Neal Ford (O’Reilly). Copyright 2020 Mark Richards, Neal Ford, 978-1-492-04345-4″

Defining Software Architecture

Container Security: Anchore Engine

Nowadays, containers are taking over the world. We still have big systems, legacy system and, obviously, not every company out there has enough speed to migrate to containerized solutions but, wherever you look, people are talking about containers.

And, if you look in the opposite direction, people are talking about security. Breaches, vulnerabilities, systems not properly patched, all kind of problems that put at risk enterprise security and users data.

With all of this, and it is not new, projects involving both topics have been growing and growing. The ecosystem is huge, and the amount of options is starting to be overwhelming.

We have projects like:

  • Docker Bench for Security: The Docker Bench for Security is a script that checks for dozens of common best-practices around deploying Docker containers in production. The tests are all automated and are inspired by the CIS Docker Benchmark v1.2.0.
  • Clair: Clair is an open-source project for the static analysis of vulnerabilities in application containers (currently including apps and docker).
  • Cilium: Cilium is open source software for providing and transparently securing network connectivity and load-balancing between application workloads such as application containers or processes. Cilium operates at Layer 3/4 to provide traditional networking and security services as well as Layer 7 to protect and secure use of modern application protocols such as HTTP, gRPC and Kafka. Cilium is integrated into common orchestration frameworks such as Kubernetes and Mesos.
  • Anchore Engine: The Anchore Engine is an open-source project that provides a centralized service for inspection, analysis and certification of container images. The Anchore Engine is provided as a Docker container image that can be run standalone or within an orchestration platform such as Kubernetes, Docker Swarm, Rancher, Amazon ECS, and other container orchestration platforms.
  • OpenSCAP: The OpenSCAP ecosystem provides multiple tools to assist administrators and auditors with assessment, measurement, and enforcement of security baselines. We maintain great flexibility and interoperability, reducing the costs of performing security audits.
  • Dagda: Dagda is a tool to perform static analysis of known vulnerabilities, trojans, viruses, malware & other malicious threats in docker images/containers and to monitor the docker daemon and running docker containers for detecting anomalous activities.
  • Notary: The Notary project comprises a server and a client for running and interacting with trusted collections. See the service architecture documentation for more information.
  • Grafaes: An open artefact metadata API to audit and govern your software supply chain.
  • Sysdig Falco: Falco is a behavioural activity monitor designed to detect anomalous activity in your applications. Powered by sysdig’s system call capture infrastructure, Falco lets you continuously monitor and detect container, application, host, and network activity – all in one place – from one source of data, with one set of rules.
  • Banyan Collector: Banyan Collector is a light-weight, easy to use, and modular system that allows you to launch containers from a registry, run arbitrary scripts inside them, and gather useful information.

As we can see, there are multiple tools within this container security scope. These are just some example.

In this article, we are going to explore a bit more Archore Engine. We are going to create a basic Jenkins pipeline to scan one container. Fro this, we are going to need:

  • A repository in GitHub with a simple dockerized project. In my case, I will be using this one. It’s a simple Spring Boot app with a hello endpoint and a very simple ‘Dockerfile’.
  • We are going to need a Docker Hub repository to store our image. I will be using this one.
  • Docker and docker-compose.

And, that’s all. Let’s go.

We can see in the next image the pipeline we are going to implement:

Install Anchore Engine

We just need to execute a few commands to have Anchore Engine up and running.

mkdir -p ~/aevolume/config 
mkdir -p ~/aevolume/db/
cd ~/aevolume/config && curl -O https://raw.githubusercontent.com/anchore/anchore-engine/master/scripts/docker-compose/config.yaml && cd - 
cd ~/aevolume
curl -O https://raw.githubusercontent.com/anchore/anchore-engine/master/scripts/docker-compose/docker-compose.yaml

After that, we should see a folder ‘aevolume’ with a content similar to:

Running Anchore Engine

As we can see, the previous step has provided us with a docker-compose file to run in an easy way Anchore Engine. We just need to execute the command:

docker-compose up -d

When docker-compose finishes, we should be able to see the two containers for Anchore Engine executing. One for the application itself and one for the database.

Install the Anchore CLI

It is not necessary but, it is going to be very useful to debug integration problem if we have (I had a few the first time). For this, we just need to execute a simple command that it will make the executable ‘anchore-cli’ available in our system.

pip install anchorecli

Install the Jenkins plugin

Now, we start working on the integration with Jenkins. The first step is to install the Anchore integration on Jenkins. We just need to go to the Jenkins management plugin area and install one called ‘Anchore Container Image Scanner Plugin’.

Configure Anchore in Jenkins

There is one more step we need to take to configure the Anchore plugin in Jenkins. We need to provide the engine URL and the access credentials. This credentials can be found in the file ‘~/aevolume/config/config.yaml’.

Configure Docker Hub repository

The last configuration we need to do, it is to add our access credential for our Docker Hub repository. I recommend here to generate an access token and not to use our real credentials. Once we have the access credential, we just need to add them to Jenkins.

Create a Jenkins pipeline

To be able to run our builds and to analyze our containers, we need to create a Jenkins pipeline. We are going to use the script feature for this. The script will look like this:

pipeline {
    environment {
        registry = "fjavierm/anchore_demo"
        registryCredential = 'DOCKER_HUB'
        dockerImage = ''
    }
    agent any
        stages {
            stage('Cloning Git') {
                steps {
                    git 'https://github.com/fjavierm/demo.git'
                }
            }

            stage('Building image') {
                steps {
                    script {
                        dockerImage = docker.build registry + ":$BUILD_NUMBER"
                    }
                }
            }

            stage('Container Security Scan') {
                steps {
                    sh 'echo "docker.io/fjavierm/anchore_demo:latest `pwd`/Dockerfile" > anchore_images'
                    anchore name: 'anchore_images'
                }
            }
            stage('Deploy Image') {
                steps{
                    script {
                        docker.withRegistry( '', registryCredential ) {
                            dockerImage.push()
                        }
                    }
                }
            }
            stage('Cleanup') {
                steps {
                sh'''
                    for i in `cat anchore_images | awk '{print $1}'`;do docker rmi $i; done
                '''
            }
        }
    }
}

This will create a pipeline like:

Execute the build

Now, we just need to execute the build and see the results:

Conclusion

With this, we finish the demo. We have installed Anchore Engine, integrate it with Jenkins, run a build and check the analysis results.

I hope it is useful.

Container Security: Anchore Engine

Maven archetypes

We live in a micro-services world, lately, does not matter where you go, big, medium or small companies or start-ups, everyone is trying to implement microservices or migrating to them.

Maybe not initially, but when companies achieve a certain level of maturity, they start having a set of common practices, libraries or dependencies they apply or use in all the micro-services they build. Let’s say, for example, authentication or authorization libraries, metrics libraries, … or any other component they use.

When this level of maturity is achieved, usually, to start a project basically we take the “How-To” article in our wiki and start copying and pasting common code, configurations and creating a concrete structure in the new project. After that, it is all set to start implementing business logic.

This copy and paste process is not something that it usually takes a long time but, it is a bit tedious and prone to human errors. To make our lives easier and to try to avoid unnecessary mistakes we can use maven archetypes.

Taken from the maven website, an archetype is:

In short, Archetype is a Maven project templating toolkit. An archetype is defined as an original pattern or model from which all other things of the same kind are made. The name fits as we are trying to provide a system that provides a consistent means of generating Maven projects. Archetype will help authors create Maven project templates for users, and provides users with the means to generate parameterized versions of those project templates.

In the next two sections, we are going to learn how to build some basic archetypes and how to build a more complex one.

Creating a basic archetype

Following the maven documentation page we can see there are a few ways to create our archetype:

From scratch

I am not going to go into details here because the maven documentation is good enough and because it is the method we are going to use in the “Creating a complex archetype” section below. You just need to follow the four steps the documentation is showing:

  1. Create a new project and pom.xml for the archetype artefact.
  2. Create the archetype descriptor.
  3. Create the prototype files and the prototype pom.xml.
  4. Install the archetype and run the archetype plugin

Generating our archetype

This is a very simple one also described in the maven documentation. Basically, you use maven to generate the archetype structure for you

mvn archetype:generate \
    -DgroupId=[your project's group id] \
    -DartifactId=[your project's artifact id] \
    -DarchetypeGroupId=org.apache.maven.archetypes \
    -DarchetypeArtifactId=maven-archetype-archetype

As simple as that. After executing the command, we can add our personalisations to the project and proceed to install it as seen before.

From an existing project

This option allows us to create a project and when we are happy with how it is, to transform the project into an archetype. Basically we need to follow the next steps:

  1. Build the project layout by scratch and add files as need.
  2. Run the Maven archetype plugin on an existing project and configure from there.
mvn archetype:create-from-project

This will generate an “archetype” folder into the “target” folder:

target/generated-sources/archetype

We just need to copy this folder structure to the desired location and we will have our archetype ready to go. It needs to be installed as usual to be able to use it.

Using our archetype

Once we have install our archetype, we can start using it:

mvn archetype:generate \
    -DarchetypeGroupId=dev.binarycoders \
    -DarchetypeArtifactId=simple-archetype \
    -DarchetypeVersion=1.0-SNAPSHOT \
    -DgroupId=org.example \
    -DartifactId=project1

This will create a new project using the archetype. The information we need to modify in the previous command is:

  • archetypeGroupId: It is the archetype group id we have defined when we created the archetype.
  • archetypeArtifactId: It is basically the name of our archetype.
  • archetypeVersion: It is the version of the archetype we want to use in case the archetype has been evolving over time and we have different versions.
  • groupId: It is the group id our new project is going to have.
  • artifactId: It is the name of our new project.

Deleting our archetype

Right now, after installing our archetype, it is only available in our local repository. This fact allows us to delete the archetype in a very simple way. We just need to take a look at the archetype catalogue in our repository and manually remove the archetype. We can find this file at:

~/.m2/repository/archetype-catalog.xml

Creating a complex archetype

For most cases, the already reviewed ways to create archetypes should be enough but, not for all of them. What happens if we need to define some modules we want to define the name when creating the project? Or classes? Or some other customisations?

Luckily, Maven gives us some level of flexibility allowing us to define some variables and use some concrete patterns to define folders and files in our archetypes in a way they will be replaced when the projects using the archetype are created.

As a general rule we will be using two kinds of notation for our dynamically elements:

  • Defined in files: ${varName}
  • Defined in file system: __varName__ (two underscores)

This will help us to achieve our goals.

As an example, I am going to create a small complex archetype to be able to see this in action. The projects created with the archetype are going to have:

  • A parent project with <artifactId> name.
  • Two modules called <artifactId>-one and artifactId-two.
  • A main class called <classPrefix>OneApp and <classPrefix>TwoApp respectibely.
  • The classes will be located in the package <package>.one and <package>.two respectibely.
  • The module One will have a properties class stored in the resources folder.

The code of the archetype can be found at the GitHub repository.

The first file we can check is archetype-metadata-xml located in META-INF/maven.

We can see here the definition of the variable classPrefix and groupId with a default value assigned.

<requiredProperties>
    <requiredProperty key="classPrefix" />
    <requiredProperty key="groupId">
        <defaultValue>dev.binarycoders</defaultValue>
    </requiredProperty>
</requiredProperties>

After that, we can see the definition of the project structure we want to achieve. In this case, we have the fileSets node with the files on the parent project and, after that, the definition of the modules we want to include. Here we should pay special attention to the way the module attributes are defined:

<module id="${rootArtifactId}-one"
         dir="__rootArtifactId__-one"
         name="${rootArtifactId}-one">

As we can see they use the notation described before, using the “${}” notation for variables in files and the notation “__” (two underscores) for file system elements. The rest of the file is pretty simple.

If we explore the folder structure, we can see a few elements defined with these two underscores notation like the module names and the class names. This will be dynamic elements that will take the name from the variable defined when the project is created.

We can define different filesets for the files we want to be copied to our generated project. For example, we can copy all the .java files we can find inside the path src/main/java:

<fileSet filtered="true" packaged="true" encoding="UTF-8">
    <directory>src/main/java</directory>
        <includes>
            <include>**/*.java</include>
        </includes>
</fileSet>

Finally, if we explore one of the classes, we can see the next content:

#set( $symbol_pound = '#' )
#set( $symbol_dollar = '$' )
#set( $smbol_escape = '\' )
package ${package}.one;

public class ${classPrefix}OneApp {
}

The first three lines are just alias to be able to use the symbols that have a specific meaning not just as literals.

After that, we can see the package definition that it is going to be built with one part dynamically added and one part statically defined. We can see the class name follows the same pattern.

Deserves special attention to the fact that, despite we are defining packages into the classes, we are not replicating this structure in the project structure, Maven will take care of that for us. This is because when we defined the fileset we defined the attribute package equals true. If this attribute is set to false, we will be in charge of defining the desired structure.

It is worth it to mention that because of the files in the maven archetype act as velocity templates, we can introduce some logic and some dynamic content in our files. For example, print something or not in a determinate file:

<requiredProperty key="greeting">
    <defaultValue>y</defaultValue>
</requiredProperty>
#if (${greeting == 'y'})
    // Hello, welcome here!
#end

This variable can be set using the command line when we generate our new project:

-Dgreeting=n

Finally, there is one more interesting thing we can do. We can use a post-generation script write in groovy to execute some actions after the project has been generated. One interesting use, it is to remove not desired files based on some variables defined when generating the project. This script will be located in the folder src/main/resources/META-INF with the name archetype-post-generate.groovy.

import groovy.io.FileType

def rootDir = new File(request.getOutputDirectory() + "/"
    + request.getArtifactId())
def oneBundle = new File(rootDir, request.getArtifactId()
    + "-one")

def projectPackage = request.getProperties().get("package")

assert new File(oneBundle, "src/main/java/" 
    + projectPackage.split("\\.").join('/')
    + "/toDelete.txt").delete()

With this, every time that we use the archetype to create a new project we will obtain the desired results.

We can use our recently created archetype with:

mvn archetype:generate \
    -DarchetypeGroupId=dev.binarycoders \
    -DarchetypeArtifactId=simple-archetype \
    -DarchetypeVersion=1.0-SNAPSHOT \
    -DgroupId=org.example \
    -DartifactId=project

And the result:

And, one of the classes:

This is all. I hope is useful.

See you.

Maven archetypes

Starting a project

There are multiple ways to learn how to code. Some people do it with some kind of formal education like high school, university, master… Other people through Bootcamp or more modern initiatives we are seen lately. And, finally, there are people that it learns self-studying. No matter which one is your case, at the end of the day, the best way to learn and acquire some coding skills is to code.

As developers, we code (we do other things, not just code). Usually, if we do it professionally, enterprises have their tools and procedures. This is not the scope of this article. This article is going to focus on small projects we start outside these corporative environments, just for fun, for learning purposes or, because why not? And, I am talking about projects, not just code snippets or small demos trying something we have read in an article or blog, or testing this crazy idea we had in mind the last few days.

The purpose of the article is to offer some guidance on possible free tools we can use to work on a project following more or less a methodology and using some tools similar if not the same than the ones you can find on a corporative environment.

The focus of the article is people learning how to code to allow them to have a bigger picture, or people starting a long term open-source project, or just anyone curious. It going to focus not on the coding part but on the areas around the project.

Every project when it starts it needs a way to manage the code and a way to manage the efforts. I am certain about the first one, all of you agree but, I can hear from here people questioning the second one. Well, initially, and especially if we are the only developer, we can think it is not necessary but in the long run, even more, if we expect contributions in the future, it is going to be a very useful thing to have. It will keep our focus, it will make us think in advance, to do some planning and, it will give us a history of the project and why we took a certain decision at a certain point or why we added a concrete functionality. And, if you are learning how to code, it will give you the bigger picture I was talking about before.

To manage our code we need some kind of distributed version control system for tracking changes. There are a few of them out there like Git, Subversion, Perforce, Team Foundation Version Control or Mercurial. If you stay long enough in the industry, you will see all of them but, in this case, my favourite preference is Git. There are some cloud platforms that offer you an account to use it (GitHub, GitLab, Bitbucket). All of them are similar and at this level, there is no big difference, I invite you to test all of them but, in this case, I am going to recommend GitHub. I like it, I am used to it, it is hugely extended among the open-source community, and, integrates easily and smoothly with other tools we are going to see in this article.

To manage our efforts we need some kind of project management software tool for tracking tasks and the progress of them. As in the previous case, here event more, there are a lot of them out there. One very simple to use and very extended is Trello. Trello offers you some customizable boards we can use to track efforts, progress and plan in advance. In addition, there are a lot of useful plugins to improve and highly customize the boards and the cards (tasks) there. Here just a quick mention to the ‘Projects’ tab in GitHub that it allows you to create some automated Kanban boards. It is interesting to play with it. But, I have never seen it in a corporative environment where I have seen Trello multiple times. The first place here is for JIRA.

Once we start coding, creating pull requests and merging code in our repository it is nice to have in place a CI/CD environment. There are multiple advantages of this but, even if we are just learning, it will keep your code healthy making sure that any change made still compiles and pass all our tests. Again, in this category, we can find some cloud platforms and on-premises solution but, for the article, I have chosen Travis CI (the dot org). It is simple to register, great integration with GitHub, powerful enough and well documented.

One thing that developers should be worried about it is the quality and maintainability of the code they write. And, I am not talking just if our code passes all the test, I am talking about bugs, vulnerabilities, test coverage, code duplication, format (we should be using our IDE auto-format or save actions for the last one). To cover all this list we can find the tool SonarQube, and a cloud solution SonarCloud. This tools will report us with all the found problems every time a build is done, allowing us to correct them as soon as possible and not let them pailing and just be found when there is a code auditory or similar. Again, it is an easy tool to manage and to integrate with GitHub and Travis CI.

Are these tools the best ones? The more useful ones? Yes, no, maybe. I am a strong believer that there are not perfect tools, there are tools perfect for a job and, this is what sometimes we as developers need to decide, which tool fits best the job. The tools in the article are just examples and, they were perfect for the article.

Starting a project

The Sec in SecDevOps

More or less, everyone that works in the software industry should be aware by now of the term DevOps and the life cycle of any application. Something similar to:

  • Plan: Gathering requirements.
  • Code: Write the code.
  • Test: Executing the tests.
  • Package: Package our application for the release.
  • Release: When the application artifacts are generated.
  • Deploy: We make it available to the users (QA, final users, product, …).
  • Operate: Take care of the applications needs.
  • Monitor: Keep an eye on the application.

All these bullet points are a rough description of the main stages of the application no matter what methodologies we are using.

The thing is that, nowadays, we should not just be aware of the term DevOps, we should be aware and, using or introducing, the term SecDevOps in our environments.

The ‘Sec’ stands by ‘Security’ and, it means, basically to add the security component and the help of security professionals to all the stages in the cycle, not just after the deployment like usually happens.

In addition to the common tasks we have in the different stages, if we add the ‘Sec’ approach, we should be adding:

  • Plan: We should be analyzing the Threat Model. For example, user authentication, external servers, encryption, … Here, we will be generating security focus stories.
  • Code: In this stage, the security proffesinal should be training and helping the team with the security concepts: tools, terminology, attacks. This will help developers to be aware of possible problems and write some defensive measures integrated in their code or follow best practices and recommendations.
  • Test: In this stage, specific tests to check security capabilities will be created allowing us to early detect possible problems.
  • Package: Here, we will check vulnerabilities in external libraries, vulnerabilities in our containers or code static analyzers.
  • Release: This stage can be combined with the previous one from a security point of view. Some platforms where the containers are stored, for example, integrate vulnerability scanners.
  • Deploy: Here, we can run dynamic security tests with typical ethical hacking tools.
  • Operate: Here is when the Red Team steps up in addition to resilience checks.
  • Monitor: We can analyze all our logs (we hope previously centralized) trying to discover security problems.

I hope this gives us a simplify view of how we can integrate security awareness and processes in our application cycles.

The Sec in SecDevOps

CD: Continuous Deployment

At this point, we are an organization that uses Continuous Integration and Continuous Delivery but, we can go one step farther and start using Continuous Deployment.

Now we work in a continuous delivery environment but we still need to push a button to deploy things into our production environments, not a big deal but still not moving as fast as we want. Deploying to production can be a risky and costly exercise that sometimes requires putting all development on hold. And, we are still dependent on a deployment cycle and the more commits we fit into a release the more risky it is and the more possibilities we have to introduce an error.

Continuos deployment tries to solve all these drawbacks by automatically shipping every change pushed to the main repository to production. We can obtain benefits like:

  • No one is required to drop their work to make a deployment because everything is automated.
  • Releases become smaller and easier to understand.
  • The feedback loop with our customers is faster: new features and improvements go straight to production when they are ready.

STEPS WE NEED TO TAKE

We are already applying continuous integration what means we have plenty of tests, they cover our functionality and guarantee that we are not introducing breaking changes but, are they good enough?

One of the things we need is that the quality of our test suite will determine the level of risk for our release, and our team will need to make automating testing a priority during development. This was important before but now, takes a new level of importance. This means implementing tests for every new feature, as well as adding tests for any regressions discovered after release.

Another thing we need is that fixing a broken build for the main branch should also be first on the list. Why? Because if we don not do that, changes are going to be accumulated and we are going to end up killing the benefits of our new continuous deployment environment.

We need to start using some kind of coverage tools. We can do a rough estimation of our coverage but this does not work. Just try it. If we do the rough estimation and after that use a coverage tool, we will be surprised. It is said that a good goal is to aim for 80% coverage but, there is a big but. Coverage must be meaningful it is not enough writing tests that go through every line of code, the tests need to challenge the code. The tests need to cover business cases, edge conditions and any possible problem we can think of. Code reviews help with this, in addition with helping to transfer knowledge. Our test suite is now the keeper of our production environment, we want to have the best and strongest keeper we can achieve.

Once our code has been deployed to production we need to monitor the situation, we need to check that everything is working as expected and we are serving our customers appropriately. And this monitoring needs to be real-time monitoring. There are some tools out there that can help us with this step. We do not just need to monitor our system is up and running, we need to monitor CPU, memory consumption, average request time… All the parameters we decide we need to keep our systems healthy and to receive alerts if something is off.

No matter how confident we are in our pre-deployment process and our monitoring capabilities, after every deployment we should have some kind of tests running. These test are smoke test, now they earn a lot of importance. We do not need to run anything big, just a  few simple tests loading some static pages and a few more that require to use all production services, micro services, 3rd parties, databases… Our smoke tests should be able to guarantee us that everything is working properly.

At this point, I am sure that some of you (I was doing it) are thinking, what happen with my QA team now? Fair question. Now, the QA team should be working closely with the product manager and the development team to define the risks associated with new improvements. They can help to define cases that need to be tested not just increasing coverage but increasing the quality coverage. They can work in automatization and, they can respond and define properly together with the development team possible bugs found into production.

If any of you readers is a project manager, delivery manager or similar, I am sure by now you are thinking, what about the release notes? There is not space for release notes unless we want to spam our clients. In a big continuous deployment environment we can release code hundreds or thousands of times per day. The advice here is to focus on key announcements for features or enhancements. If you fix a bug that it is particularly affecting one of your customers, just inform this customer. Everything else, you use one of the infinity number of management tools out there to handled changes like JIRA or Trello.

I can be a hard path to move from continuous delivery to continuous deployment. If the effort is worth it or not, it is a decision that the companies and teams need to take. The only advice here is start small and build up your continuous deployment knowledge and experience. If you have a new project, try to apply everything we have mention here, even you can start building your production infrastructure before you even code and push your changes after that. Once you have manage to use continuous deployment in one project, try to apply all the lessons learned to the rest of your projects, the existing ones or even the legacy ones if it is worth it.

CD: Continuous Deployment