Running X Apps on Mac with Docker

Today, just a quick article about how to run X Apss on Mac using Docker. Why? While it cannot be considered a sandbox, it can be considered a quick solution between running something in our system and running something inside a virtual machine. Again, why? Let’s say we want to open a PDF, we have to check it with VirusTotal, and while it looks like an innocuous file, it has some YARA rules warning and we do not want to take any risk. We can run a PDF viewer inside a container.

To do this we are going to need a few things:

That is all the pre-requirements we need. Once we have them installed, we can proceed to create our container.

In this case, I am going to use an Ubuntu container and I am going to install the Evince Document Viewer, but the steps should work for any flavour of the container you like and any document viewer you prefer. And, for any X App, we want.

First, we need to create our Dockerfile:

# Use a Linux base image
FROM ubuntu:latest

# Install necessary packages
RUN apt-get update && apt-get install -y \
    evince \
    x11-apps

# Set the display environment variable
ENV DISPLAY=:0

# Create a non-root user
RUN groupadd -g 1000 myuser && useradd -u 1000 -g myuser -s /bin/bash -m myuser

# Copy the entrypoint script
COPY entrypoint.sh /

# Set the entrypoint script as executable
RUN chmod +x /entrypoint.sh

# Switch to the non-root user
USER myuser

# Set the entrypoint
ENTRYPOINT ["/entrypoint.sh"]

Second, we need the entry point file to execute our app and pass the argument, in this case, the file we want to open:

#!/bin/bash
evince "$1"

Now, we can build our container by executing the proper docker command in the folder that contains both files:

docker build -t pdf-viewer .

Now, we just need to run it:

  1. Ensure docker is running in our system.
  2. Start the XQuartz application.
  3. Execute xhost +localhost, this command grants permission for the Docker container to access the X server display.
  4. Run our container with the following command:
docker run --rm -it \
    -e DISPLAY=$DISPLAY \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -v /Users/username/pdf-location:/data \
    --entrypoint /entrypoint.sh pdf-viewer /data/file.pdf

Let’s elaborate on the command a bit:

  • docker run: This command is used to run a Docker container.
  • --rm: This flag specifies that the container should be automatically removed once it exits. This helps keep the system clean by removing the container after it finishes execution.
  • -it: These options allocate a pseudo-TTY and enable interactive mode, allowing you to interact with the container.
  • -e DISPLAY=$DISPLAY: This option sets the DISPLAY environment variable in the container to the value of the DISPLAY variable on your local machine. It enables the container to connect to the X server display.
  • -v /tmp/.X11-unix:/tmp/.X11-unix: This option mounts the X server socket directory (/tmp/.X11-unix) from your local machine to the same path (/tmp/.X11-unix) within the container. It allows the container to access the X server for display forwarding.
  • -v /Users/username/pdf-location:/data: This option mounts the /Users/username/pdf-location directory on your local machine to the /data directory within the container. It provides access to the PDF file located at /Users/username/pdf-location inside the container.
  • --entrypoint /entrypoint.sh: This option sets the entry point for the container to /entrypoint.sh, which is a script that will be executed when the container starts.
  • pdf-viewer: This is the name of the Docker image you want to run the container from.
  • /data/file.pdf: This specifies the path to the PDF file inside the container that you want to open with the PDF viewer.

With this, we are already done. Executing this command the PDF Document Viewer will open with our PDF. Happy reading!

There is one extra and optional step we can take, it is to create an alias to facilitate the execution:

alias run-pdf-viewer='function _pdf_viewer() { xhost +localhost >/dev/null 2>&1;docker run -it --rm --env="DISPLAY=host.docker.internal:0" -v $1:/data --entrypoint /entrypoint.sh pdf-viewer /data/$2 };_pdf_viewer'

The above alias allows us to open a PDF file inside the container without remembering the more complex commands, we only need to ensure Docker is running. As an example, we can open a file with:

run-pdf-viewer ~/Downloads example.pdf

Running X Apps on Mac with Docker

Hadoop: Apache Pig

Following our next step on our Apache Hadoop discovery trip, the next thing we are going to check is Apache Pig. Apache Pig is built on top of Hadoop and MapReduce and allows us to use a scripting language called Pig Latin to create MapReduce jobs without the need of writing MapReduce code.

The definition on the project page is:

Apache Pig is a platform for analysing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. […] 

At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, […]. Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties: 

  • Ease of programming. […]
  • Optimisation opportunities. […]
  • Extensibility. […]
Apache Pig product description

The purpose of this article is to execute a basic script on Apache Pig using the data used in the previous article. Concretely, in this case, we are going to need the files: u.data and u.item.

As mentioned before, the first file has the next format:

user id | item id | rating | timestamp

While the second file has the next format (only fields to be used are shown):

movie id | movie title | release date | video release date | ...

The first thing we need to do is upload the data files to the Hadoop file system. Consider the first file is already there if you followed the previous article.

$ hadoop fs -mkdir /ml-100k
$ hadoop fs -copyFromLocal u.data /ml-100k/u.data
$ hadoop fs -copyFromLocal u.item /ml-100k/u.item
$ hadoop fs -ls /ml-100k

The next thing we need to do is to write our first Pig Latin script. While this article does not present to teach you how to write scripts but how to use and run them, let’s see a brief introduction to the scripting language commands. If you are familiar with any programming language or SQL, you will not find any difficulty understanding them. For more information, we can use the official documentation.

- LOAD, STORE, DUMP
- FILTER, DISTINCT, FOREACH/GENERATE, MAPREDUCE, STREAM, SAMPLE
- JOIN, COGROUP, GROUP, CROSS, CUBE
- ORDER, RANK, LIMIT
- UNION, SPLIT

We will understand them a bit more with an example.

Now, let’s built something. Based on the data we have, we can try to respond to the question “What is the most-voted one-star movie?”. To answer this we can build the next script:

// We load from a file the data
// It is worth mentioning that a relative path will work too
// For example, LOAD '/ml-100k/u.data'
ratings = LOAD 'hdfs://localhost:9000/ml-100k/u.data'
    AS (userID:int, movieID:int, rating:int, ratingTime:int);

// We load from a file the data
metadata = LOAD 'hdfs://localhost:9000/ml-100k/u.item' USING PigStorage('|')
	AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

// For each loaded row we store the id and title association
nameLookup = FOREACH metadata GENERATE movieID, movieTitle;

// We group the loaded ratings by movie
groupedRatings = GROUP ratings BY movieID;

// We calculate the rating average and ratings per movie
averageRatings = FOREACH groupedRatings GENERATE group AS movieID,
    AVG(ratings.rating) AS avgRating, COUNT(ratings.rating) AS numRatings;

// We are just interested in the worst movies
badMovies = FILTER averageRatings BY avgRating < 2.0;

// Once we have the worst movies, we resolved their names previously captured
namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;

// We generate the response rows
finalResults = FOREACH namedBadMovies GENERATE nameLookup::movieTitle AS movieName,
    badMovies::avgRating AS avgRating, badMovies::numRatings AS numRatings;

// We sort the response rows
finalResultsSorted = ORDER finalResults by numRatings DESC;

// We print the response
DUMP finalResultsSorted;

The only thing remaining is to check our results and if our script does its job properly. For this, we just need to execute it and see what happens.

Hadoop history server

Before we can execute our script, we need to start the Hadoop history server that offers REST APIs allowing users to get the status of finished applications.

Why does not run when the ‘start-all.sh‘ command is run? I do not know why is not the default option, I assume it is because it is not necessary to work with HDFS or MapReduce. We have a few options here:

  1. To run the deprecated command: mr-jobhistory-daemon.sh start|stop historyserver
  2. To run the new command: mapred –daemon start|stop historyserver
  3. Add it to the start-all.sh and stop-all.sh scripts:
# add to the start-all.sh script
# start jobhistory daemons if jobhistory is present
if [ -f "${hadoop}"/sbin/mr-jobhistory-daemon.sh ]; then
  "${hadoop}"/sbin/mr-jobhistory-daemon.sh start historyserver
fi

# add to the stop-all.sh script
# stop jobhistory daemons if jobhistory is present
if [ -f "${hadoop}"/sbin/mr-jobhistory-daemon.sh ]; then
  "${hadoop}"/sbin/mr-jobhistory-daemon.sh stop historyserver
fi

If we forget to run the history server, there will be a pretty clear error on the logs when executing the script. We will see multiple attempts of connection to localhost:10020.

Apache Pig has different execution modes. We are going to see the Local Mode execution and the MapReduce Mode execution.

Local Mode

This execution allows us to execute the script using our local files. As we have written the script, we can just go and run it, but it would not be technically a local execution because we are using the files stored in our HDFS. To run with all local files, we can use a relative path to the files in our local hard drive. I find it easier to just upload all my test data to the HDFS and not worry about where my data lives when executing locally or remotely. But, it is good to know we have the option to do a quick test with small files or files covering specific test cases. For my execution, I am going to use the script s it is but feel free to find your more comfortable way of working. To run the script we execute:

pig -x local most-rated-one-star-movie.pig
We can see the execution has been completed successfully and the results are dumped into the console.

MapReduce Mode

This execution will run with the files stored on HDFS. We already have the data there but we need to upload the script.

$ hadoop fs -mkdir /pig-scripts
$ hadoop fs -copyFromLocal most-rated-one-star-movie.pig /pig-scripts/most-rated-one-star-movie.pig

After that, we just need to execute our script:

pig -x mapreduce hdfs://localhost:9000/pig-scripts/most-rated-one-star-movie.pig

The results should be the same.

Whit this, we can finish the introduction to Apache Pig.

Hadoop: Apache Pig

Hadoop: First MapReduce example

In the previous article, we installed Hadoop, now it is time to play with it. In this article, we are going to take some example data, store it in Hadoop FS, and see how we can run a simple MapReduce task.

The data we are going to use is called MovieLens. It is a list of one hundred thousand movie ratings from one thousand users on one thousand seven hundred movies. The dataset is quite old, but the advantage is that it has been tested multiple times and it is very stable which will prevent any differences when following this article.

The dataset can be found here. We just need to download the zip file and extract its content, and we are interested in the file called u.data.

Using the file system

Let’s check how our file system looks like. We can do this using the UI or from the console. In this article, I am going to focus on working with the console, but I will try to point to the UI when something is interesting or that can be easily checked there in addition to the console command.

As I have said, let’s check the file system. To use the UI, we can open the Resource Manager Web UI, probably http://localhost:9870 and go to “Menu -> Utilities -> Browse the file system“. We should see something like the following screenshot because our file system is empty:

To achieve the same on the console, we can execute the next command:

hadoop fs -ls /

It is important to notice that we need to specify the path. If it is not specified it will throw an error. For more information about the error, we can read this link.

After executing the command, we will see that nothing is listed, but we know it is working. Let’s know upload our data. First, we will create a folder, and after that, we will upload the data file:

$ hadoop fs -mkdir /ml-100k
$ hadoop fs -copyFromLocal u.data /ml-100k/u.data
$ hadoop fs -ls /
$ hadoop fs -ls /ml-100k

Easy and simple. As we can see it works as we would expect from any file system. We can remove the file and folder we have just added:

$ hadoop fs -rm /ml-100k/u.data
$ hadoop fs -rmdir /ml-100k

Using MapReduce

The file we are using has the format:

user id | item id | rating | timestamp. 

To process it, we are going to be using Python and a library called MRJob. More information can be found here.

We need to install it to be able to use it:

pip3 install mrjob

And, we need to build a very basic MapReduce job, for example, one that lists the count of given ratings on the data.

The code should look like this:

from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):
 
  def mapper(self, _, line):
    (userID, movieID, rating, timestam) = line.split('\t')
    yield rating, 1

  def reducer(self, key, values):
    yield key, sum(values)

if __name__ == '__main__':
  RatingsBreakdown.run()

The code is pretty simple and easy to understand. A job is defined by a class that inherits from MRJob. This class contains methods that define the steps of your job. A “step” consists of a mapper, a combiner, and a reducer, but all of those are optional, though we must have at least one. The mapper() method takes a key and a value as args (in this case, the key is ignored, and a single line of text input is the value) and yields as many key-value pairs as it likes. The reduce() method takes a key and an iterator of values and also yields as many key-value pairs as it likes. (In this case, it sums the values for each key). The final required component of a job file is to invoke the run() method.

We can run it in different ways which are very useful depending on the type of goal we have. For example, if we have just a small file with a small sample of the data we are just using to test the code, we can run the Python script with a local file. If we want to run a similar sample data but see how it behaves in Hadoop, we can upload a local file on-demand while executing the script. Finally, if we want to execute the whole shebang, we can run the script against the big bulky file we have on the file system. For the different options, we have the following commands. Considering the whole file, in the case, is small we can run the three types of executions using the same file.

$ python3 ratings-breakdown.py u.data
$ python3 ratings-breakdown.py -r hadoop u.data
$ python3 ratings-breakdown.py -r hadoop hdfs:///ml-100k/u.data

The result, no matter what execution we have decided to use, would be:

"1"	6110
"2"	11370
"3"	27145
"4"	34174
"5"	21201

As a curiosity, if we now take a look at the ResourceManager Web UI probably http://localhost:8088, we will see the jobs executed for the last two commands:

This is all for today. We have stored something in our Hadoop file system and we have executed our first MapReduce job.

Hadoop: First MapReduce example

Hadoop: Installation on macOS

In previous articles, we have seen a quick introduction to Big Data and Hadoop and its ecosystem. Now it is time to install Hadoop in our local machines to be able to start playing with Hadoop and with some of the tools in its ecosystem.

In my case, I am going to install it on a macOS system, if any of you is running a GNU/Linux system, you can ignore the initial step (the homebrew installation) and take a look at the configurations because they should be similar.

Installing Hadoop

As I have said before, we are going to use homebrew to install Hadoop, concretely version 3.3.3 is going to be installed. Because we are using the package manager the installation is going to be pretty simple but, if you prefer to do it manually, you just need to go to the Hadoop download page, and download and extract the desired version. The configuration steps after that should be the same.

To install Hadoop using homebrew we just need to execute:

brew install hadoop

A very important consideration is that Hadoop does not work with the latest version of Java neither 17 nor 18. It is necessary to have the Java 11 version installed. We can still use whatever version we want on our system, but to execute the Hadoop commands we should set the correct Java version in our terminal. As a side note, as far as I know, it should work with Java 8 too but I have not tested it.

Allowing SSH connections to localhost

One previous consideration we need to have is that for Hadoop to start all nodes properly it needs to be able to connect using SSH to our machine, in this case, localhost. By default, in macOS this setting is disabled, to be able to enable it, we just need to go to “System Preferences -> Shared -> Remote Session“.

Setting environment variables

Reading about this I have only found confusion, some tutorials do not even mention the addition of the environment variables, others only declare the HADOOP_HOME variable, and others declare a bunch of stuff. After a few tests, I have collected a list of environment variables we can (we should) define. They allow us to start Hadoop without problems. Is it possible my list has more than required? Yes, absolutely. But, for now, I think the list is comprehensive and short enough, my plan as I have said before is to play with Hadoop and tools around it, not to become a Hadoop administrator (yet). There are some of them, I have just as a reminder that they exist, for example, the two related to the native libraries: HADOOP_COMMON_LIB_NATIVE_DIR and HADOOP_OPTS, they can be skipped if we want, there will be a warning when running Hadoop but nothing important (see comment below about native libraries). We can add the variables to our “~/.zshrc” file.

The list contains the next variables:

# HADOOP env variables                                                                                                  export HADOOP_HOME="/usr/local/Cellar/hadoop/3.3.3/libexec"                                                              
export PATH=$PATH:$HADOOP_HOME/bin                                                                                       
export PATH=$PATH:$HADOOP_HOME/sbin                                                                                      
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop                                                                           
export HADOOP_MAPRED_HOME=$HADOOP_HOME                                                                                   
export HADOOP_COMMON_HOME=$HADOOP_HOME                                                                                   
export HADOOP_HDFS_HOME=$HADOOP_HOME                                                                                     
export YARN_HOME=$HADOOP_HOME                                                                                            
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native                                                              
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"                                                                
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Configuring Hadoop:

File $HADOOP_HOME/etc/hadoop/hadoop-env.sh

In this file, we need to find the JAVA_HOME path and fill it with the appropriate value, but if we carefully read the file, one of the comments says “variable is REQUIRED on ALL platforms except OS X!“. Despite the comment, we are going to set it. If we do not set it, we will have problems running the NodeManager. It will not start and it will show in the logs an error related to the Java version in use.

File $HADOOP_HOME/etc/hadoop/core-site.xml

Here, we need to configure the HDFS port and a folder for temporary directories. This folder can be wherever you want, in my case, it will reside in a sub-folder of my $HOME directory.

<configuration>                                                                                                          
  <property>                                                                                                             
    <name>fs.defaultFS</name>                                                                                            
    <value>hdfs://localhost:9000</value>                                                                                 
  </property>                                                                                                            
  <property>                                                                                                             
    <name>hadoop.tmp.dir</name>                                                                                          
    <value>/Users/user/.hadoop/hdfs/tmp</value>                                                                      
    <description>A base for other temporary directories</description>                                                    
  </property>                                                                                                            
</configuration> 

File $HADOOP_HOME/etc/hadoop/mapred-site.xml

We need to add a couple of extra properties:

<configuration>                                                                                                          
  <property>                                                                                                             
    <name>mapreduce.framework.name</name>                                                                                
    <value>yarn</value>                                                                                                  
  </property>                                                                                                            
  <property>                                                                                                             
    <name>mapreduce.application.classpath</name>                                                                         
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>         
  </property>                                                                                                            
</configuration>

File $HADOOP_HOME/etc/hadoop/yarn-site.xml

In a similar way, we need to add a couple of extra properties:

<configuration>                                                                                                          
  <property>                                                                                                             
    <name>yarn.nodemanager.aux-services</name>                                                                           
    <value>mapreduce_shuffle</value>                                                                                     
  </property>                                                                                                            
  <property>                                                                                                             
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
  </property>                                                                                                            
</configuration>

File $HADOOP_HOME/etc/hadoop/hdfs-site.xml

And, again, one more property:

<configuration>                                                                                                          
  <property>                                                                                                             
    <name>dfs.replication</name>                                                                                         
    <value>1</value>                                                                                                     
  </property>                                                                                                            
</configuration>

Formatting the HDFS Filesystem

Before we can start Hadoop to be able to use HDFS, we need to format the file system. We can do this by running the following command:

hdfs namenode -format

Starting Hadoop

Now, finally, we can start Hadoop. If we have added $HADOOP_HOME/bin and $HADOOP_HOME/sbin properly to our $PATH, we should be able to execute the command start-all.sh. If the command is not available, have you restarted the terminal after editing your .zshrc file or executed the command source .zshrc?

If everything works, we should be able to see something like:

And, to verify that everything is running accordingly, we can execute the jps command:

Some useful URLs are:

NameNode Web UI
ResourceManager Web UI
NodeManager Web UI

What are the native libraries?

If someone has paid close attention to the screenshot representing the starting of Hadoop, I am sure they have observed a warning line like:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Despite us setting the variables HADOOP_COMMON_LIB_NATIVE_DIR and HADOOP_OPTS on the environment, the truth is the lib folder is not present in our installation.

Hadoop has native implementations of certain components for performance reasons and the non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native Hadoop library. On the *nix platforms the library is named libhadoop.so. On mac systems it is named libhadoop.a. Hadoop runs fine while using the java classes. But you will gain speed with native. More importantly, some compression codecs are only supported in native, if any other application is dependent on Hadoop or your jars depend on these codecs then jobs will fail unless native libraries are available.

The installation of the native Java libraries is beyond this article.

The last thing is to stop Hadoop, to do this we just need to run the command stop-all.sh:

Hadoop: Installation on macOS

Hadoop and its ecosystem

In the previous article, we had a brief introduction to Big Data. To be able to take advantage of the emerging opportunities for organisations that it creates, some frameworks or tools have appeared in the last few years. One of these frameworks is Hadoop and all the ecosystem of tools created around it.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop website

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Hadoop Distributed File System (HDFS™)

HDFS creates an abstraction that allows users to see HDFS logically as a single storage unit, but the data is stored across multiple nodes in a distributed fashion. HDFS follows master-slave architecture where the master is the NameNode and the slaves – there can be multiple slaves – are known as DataNodes.

NameNodes have the next characteristics or functions:

  • It is the master daemon that maintains and manages the DataNodes.
  • It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
  • It records every change that takes place to the file system metadata.
  • If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
  • It regularly receives a heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive.
  • It keeps a record of all the blocks in the HDFS and DataNode in which they are stored.
  • It has high availability and federation features.

DataNodes have the next characteristics or functions:

  • It is the slave daemon which runs on each slave machine.
  • The actual data is stored on DataNodes.
  • It is responsible for serving read and write requests from the clients.
  • It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
  • It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

Some of the problems that it fixes are:

  • Storage space problems: HDFS provides a distributed way of storing Big Data. The data is stored in blocks across the available DataNodes. In addition, it will also replicate the data blocks on different DataNodes. Taking advantage of this architecture allows for horizontal scaling allowing the addition of some extra data nodes to the HDFS cluster when required increasing the available storage space.
  • Variety of data: HDFS can store all kinds of data whether it is structured, semi-structured or unstructured.
  • Access and process velocity: To increase the access and process velocity HDFS moves processing to data and not data to processing. Instead of moving data to the master nodes and then processing the data, the processing logic is sent to the various data nodes where the data is processed in parallel across different nodes. Once they are done, the processed results are sent to the master node where the results are merged and the response is sent back to the client.

Hadoop YARN

YARN performs all the processing activities by allocating resources and scheduling tasks. YARN has a ResourceManager and one or more NodeManagers.

The ResourceManager has the next characteristics or functions:

  • It is a cluster-level (one for each cluster) component and runs on the master machine.
  • It manages resources and schedules applications running on top of YARN.
  • It has two components: Scheduler & ApplicationManager.
  • The Scheduler is responsible for allocating resources to the various running applications.
  • The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application.
  • It keeps a track of the heartbeats from the Node Manager.

The NodeManagers have the next characteristics or functions:

  • It is a node-level component (one on each node) and runs on each slave machine.
  • It is responsible for managing containers and monitoring resource utilisation in each container.
  • It also keeps track of node health and log management.
  • It continuously communicates with ResourceManager to remain up-to-date.

Despite all the benefits Hadoop offers, it is not a silver bullet and its use needs to be carefully considered. Some of the use cases where it is not recommended are:

  • Low Latency data access: Queries that have a maximum and short response time.
  • Constant data modifications: Hadoop is a better fit when the primarily concerned is reading data.
  • Lots of small files: While Hadoop can store multiple small datasets, it is much more suitable for scenarios where there are few but large files.

Hadoop MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, and thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.

Hadoop ecosystem

Hadoop was designed as both a computing (MapReduce) and storage (HDFS) platform from the very beginning. With the increasing need for big data analysis, Hadoop attracts lots of other software to resolve big data questions and merges into a Hadoop-centric big data ecosystem. The following diagram gives a brief overview of the Hadoop big data ecosystem:

O’Reilly – Overview of the Hadoop ecosystem

We will go into more detail on some of the elements in future articles.

Hadoop and its ecosystem

Big Data

Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. In general, the data contains greater variety, has been originated from more sources and arrives in increasing volumes and with more velocity. The advantage and why Big Data is important is that these massive volumes of data can be used to address business problems we would not have been able to tackle before, and gain insights and make predictions that were out of our reach before.

While initially, Big Data was defined around the concepts of volume, velocity and variety, lately two more concepts have been added: value and veracity. All data can be initially considered noise till its intrinsic value has been discovered. In addition, to be able to discover this value and utilise the data, we need to consider how reliable and truthful is.

A few factors have influenced the growth of the importance of Big Data:

  • The cheap storage possibilities.
  • Cloud Computing offers truly elastic scalability.
  • The advent of the Internet of Things (IoT) and the gathering of data related to customer usage patterns and product performance.
  • The emergence of Machine Learning.

Big Data offers challenges not only around data storage but around curation. For data to be useful needs to be clean and relevant to the organisations in a way that enables meaningful analysis. Data scientists spend a big chunk of their time curating the data to allow data analysts to extract proper conclusions from it.

Part of this curation process is influenced by the source of the data and its format. We can categorise our data into three main types:

  • Structured data: Any data that can be processed, accessed and stored in a fixed format is named structured data.
  • Unstructured data: Any data which has an unfamiliar structure or model is named unstructured data.
  • Semi-structured data: It refers to the data that, even though it has not been ordered under a specific database, yet contains essential tags or information that isolate singular components inside the data.

As we have mentioned before, Big data gives us new insights that open up new opportunities and business models. Getting started involves three key actions:

  • Integration: Big data brings together data from many disparate sources and applications. Traditional data integration mechanisms, such as extract, transform, and load (ETL) generally are not up to the task. It requires new strategies and technologies to analyze big data sets at terabyte, or even petabyte, scale. During integration, you need to bring in the data, process it, and make sure it’s formatted and available in a form that your business analysts can get started with.
  • Management: Big data requires storage. Where to store the data? In the cloud, on-premises or a hybrid solution. How to store the data? In which form the data is going to be stored that is going to allow on-demand processes to work with it.
  • Analysis: To obtain any kind of insight the data needs to be analysed and act on. We can build visualisation to clarify the meaning of data sets, mix and explore different datasets to make new discoveries, share our findings with others, or build data models with machine learning and artificial intelligence.

Some best practices when working in the Big Data space are:

  • Align big data with specific business goals: More extensive data sets enable us to make new discoveries. To that end, it is important to base new investments in skills, organization, or infrastructure with a strong business-driven context to guarantee ongoing project investments and funding. To determine if we are on the right track, ask how big data supports and enables your top business and IT priorities. Examples include understanding how to filter web logs to understand e-commerce behaviour, deriving sentiment from social media and customer support interactions, and understanding statistical correlation methods and their relevance for the customer, product, manufacturing, and engineering data.
  • Ease skills shortage with standards and governance: One of the biggest obstacles to benefiting from our investment in big data is a skills shortage. You can mitigate this risk by ensuring that big data technologies, considerations, and decisions are added to your IT governance program. Standardizing your approach will allow you to manage costs and leverage resources. Organizations implementing big data solutions and strategies should assess their skill requirements early and often and should proactively identify any potential skill gaps. These can be addressed by training/cross-training existing resources, hiring new resources, and leveraging consulting firms.
  • Optimise knowledge transfer with a centre of excellence: Use a centre of excellence approach to sharing knowledge, control oversight, and manage project communications. Whether big data is a new or expanding investment, the soft and hard costs can be shared across the enterprise. Leveraging this approach can help increase big data capabilities and overall information architecture maturity in a more structured and systematic way.
  • The top payoff is aligning unstructured with structured data: It is certainly valuable to analyse big data on its own. But you can bring even greater business insights by connecting and integrating low-density big data with the structured data you are already using today. Whether you are capturing customer, product, equipment, or environmental big data, the goal is to add more relevant data points to your core master and analytical summaries, leading to better conclusions. For example, there is a difference in distinguishing all customer sentiment from that of only your best customers. This is why many see big data as an integral extension of their existing business intelligence capabilities, data warehousing platform, and information architecture. Keep in mind that the big data analytical processes and models can be both human – and machine-based. Big data analytical capabilities include statistics, spatial analysis, semantics, interactive discovery, and visualization. Using analytical models, you can correlate different types and sources of data to make associations and meaningful discoveries.
  • Plan your discovery lab for performance: Discovering meaning in your data is not always straightforward. Sometimes we do not even know what we are looking for. That is expected. Management and IT needs to support this “lack of direction” or “lack of clear requirement”.
  • Align with the cloud operating model: Big data processes and users require access to a broad array of resources for both iterative experimentation and running production jobs. A big data solution includes all data realms including transactions, master data, reference data, and summarized data. Analytical sandboxes should be created on demand. Resource management is critical to ensure control of the entire data flow including pre- and post-processing, integration, in-database summarization, and analytical modelling. A well-planned private and public cloud provisioning and security strategy plays an integral role in supporting these changing requirements.
Big Data

Spring Boot layered deployment

For a long time, when doing cloud-native development on Java, Spring Boot has been one of the go-to frameworks. Being a very popular and excellent option, it is not excluded from criticism. One of the biggest ones I have heard over the years is the enforcement of fat jars.

Despite the idea of having all dependencies packaged on a big fat jar, or uber jar, that you can just include in your container images sounds amazing, and it is, the truth is that it increases the release times, and while our application is composed of just a few megabytes, fat jars can be in de order of hundreds of megabytes.

As part of the 2.3 release, Spring Boot added the possibility of “Build layered jars for inclusion in a Docker image“. In their own words: “The layering separates the jar’s contents based on how frequently they will change. This separation allows more efficient Docker images to be built. Existing layers that have not changed can be reused with the layers that have changed being placed on top.“.

This feature allows to walk away from the fat jars long release times, allowing to be able to deploy an application, even, in milliseconds depending on the changes. Basically, it uses the Docker layering capabilities to split our application and its dependencies into a different layer, and only re-build them is required due to changes.

By default, the following layers are defined:

  • dependencies for any dependency whose version does not contain SNAPSHOT.
  • spring-boot-loader for the jar loader classes.
  • snapshot-dependencies for any dependency whose version contains SNAPSHOT.
  • application for application classes and resources.

The default order is dependencies, spring-boot-loader, snapshot-dependencies, and application. This is important as it determines how likely previous layers are to be cached when part of the application changes. If, for example, we put the application as the first layer, we will not be having any effect at all saving release time because every time we introduce a change all layers will be re-built. The layers should be sorted by having the least likely to change added first, followed by layers that are more likely to change.

The Spring Boot Gradle Plugin documentation contains a section explicitly referring to this point that can be located here. And, it should be enough to start experimenting with the feature, but as you all know, I am a hands-on person, then, let’s make a quick implementation.

The first thing you are going to need is a simple Spring Boot project, in my case is going to be based on Java 18 and Gradle. To easily get one ready to go, we can use the Spring Initializr.

To be able to build our first application using this feature, only two simple changes are required.

Modify the Gradle configuration to specify we want a layered build:

bootJar {
  layered()
}

The second modification is going to be in the Dockerfile:

FROM openjdk:18-alpine as builder
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} application.jar
RUN java -Djarmode=layertools -jar application.jar extract

FROM openjdk:18-alpine
COPY --from=builder dependencies/ ./
COPY --from=builder spring-boot-loader/ ./
COPY --from=builder snapshot-dependencies/ ./
COPY --from=builder application/ ./
ENTRYPOINT ["java", "org.springframework.boot.loader.JarLauncher"]

With only this, we will see already good improvements in our release times, but any seasoned developer will be pointing out soon that not all dependencies are equal. There is usually a huge difference in the release cycles between internal and external dependencies. The former tend to change quite often, especially if they are just modules of the same project, while the latter tend to be more stable and versions are upgraded more carefully. In the previous scenario, if the dependencies layer is the first one and it contains internal dependencies the more likely scenario is that all layers are going to be re-built all the time, and that is bad.

For that reason, the layered jars feature allows the specification of custom layers. To tackle the problem we have just mentioned, we can add just a few modifications to our files.

We can extend the recently added Gradle configuration with the specification of layers we want to use:

bootJar {
    layered {
        application {
            intoLayer("spring-boot-loader") {
                include("org/springframework/boot/loader/**")
            }
            intoLayer("application")
        }
        dependencies {
            intoLayer("snapshot-dependencies") {
                include("*:*:*SNAPSHOT")
            }
            intoLayer("internal-dependencies") {
                include("dev.binarycoders:*:*")
            }
            intoLayer("dependencies")
        }
        layerOrder = ["dependencies",
                      "spring-boot-loader",
                      "internal-dependencies",
                      "snapshot-dependencies",
                      "application"
        ]
    }
}

As we can see, we have added a few more layers and ordered them at our convenience.

  • internal-dependencies will build all our internal modules if we are using proper artefact names.
  • snapshot-dependencies will build any SNAPSHOT dependencies we are using.
  • spring-boot-loader will build the loader context.

The second change is to update our Dockerfile to reflect these changes:

FROM openjdk:18-alpine as builder
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} application.jar
RUN java -Djarmode=layertools -jar application.jar extract

FROM openjdk:18-alpine
COPY --from=builder dependencies/ ./
COPY --from=builder spring-boot-loader/ ./
COPY --from=builder internal-dependencies/ ./
COPY --from=builder snapshot-dependencies/ ./
COPY --from=builder application/ ./
ENTRYPOINT ["java", "org.springframework.boot.loader.JarLauncher"]

Basically, it has the same content but using the new layers.

Spring Boot layered deployment

Apache Kafka (II): Interesting learning resources

As explained in my last article, I needed to learn Apache Kafka at a fast pace. My plan after that was to write a series of articles ordering my learns, thoughts and ideas but something has changed. Despite I am confident in my communication skills, I found a series of videos by Confluent (I have no relation with them) I would have loved to discover when I was starting. They are four mini-courses that explain Apache Kafka, Kafka Stream and ksqlDB. I would have loved to have them before I started because I think they are clear enough, concise and platform agnostic (not linked to their platform).

For all these reasons, I have considered it is better to help people to point to the resources instead of just writing a few articles. Everyone that follows the blog knows that I do not do this, to share other materials, too often but I think they do a good job explaining all concepts with examples related to Apache Kafka and there is no point rewriting what they are doing. There is, maybe, space for one last article on this series with some configurations, tips and pitfalls (I have a long list of that).

The series of videos are:

  • Apache Kafka 101: A brief introduction, the explanation of basic concepts related to it and the basic use.
  • Kafka Streams 101: A nice, full of examples, course about Kafka Streams and related concepts.
  • ksqlDB 101: Same format as the previous on focus on ksqlDB.
  • Inside ksqlDB: Going one step further on ksqlDB showing some of the operations.
Apache Kafka (II): Interesting learning resources

Apache Kafka (I): Intro + Installation

Due to some recent changes, I have been forced to learn about Apache Kafka. It has been suddenly and, despite the fact I love to learn new things, a bit in a hurry for my taste. For this reason, I have planned to write a few articles trying to sort my learnings and my ideas and put some order into the chaos. During this extremely quick learning period, I have learned more complex concepts before than basics ones, and tricks and exceptions before than rules. I hope these articles help me to get a better understanding of the technology, and to be able to help anyone else in the same position.

What is Apache Kafka?

On the project’s webpage, we can see the following definition:

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Maybe, a more elaborated definition is that Apache Kafka is an event streaming platform, a highly scalable and durable system capable of continuously ingesting gigabytes of events per second from various sources and making them available in milliseconds, used to collect, store and process real-time data streams at scale. It has numerous use cases, including distributed logging, stream processing and Pub-Sub messaging.

At this point, probably still confused, I know how you feel, it happens to me the first time. Let’s see if we can dig a bit more into all of this. Splitting the definition into smaller parts, we have:

  • An event is something that happens in a system. It can be anything from a particular user logging in, to a sensor registering a temperature.
  • A data pipeline reliably processes and moves data from one system to another, in this case, events.
  • A streaming application is an application that consumes streams of data.

How does Apache Kafka work?

Apache Kafka combines two messaging models:

  • Queueing: Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues are not multi-subscriber.
  • Publish-subscribe: The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes.

To be able to combine both approaches, Apache Kafka uses a partitioned log model, considering a log an ordered sequence of events distributed into different partitions. This allows the existence of multiple subscribers (one per partition) and, while it guarantees the delivery order per partition, this is not guaranteed per log.

This model provides replayability, which allows multiple independent applications to process the events in a log (data stream) at their own pace.

RabbitMQ vs Apache Kafka

One of the first questions that pop up into my mind when reviewing Apache Kafka for the first time was “why do not use RabbitMQ?” (I have. more extensive experience with it). Over time, I discovered that this was not the right question, a more accurate one should be “what each service excels at?“. The second question is closer to the mentality we should have as software engineers.

The one sentence to remember when deciding is that RabbitMQ’s message broker design excelled in use cases that had specific routing needs and per-message guarantees, whereas Kafka’s append-only log allowed developers access to the stream history and more direct stream processing.

RabbitMQ seems better suited for long-running tasks when it is needed to run reliable background jobs. And for communication and integration within, and between applications, i.e as a middleman between microservices. While Apache Kafka seems ideal if a framework for storing, reading (re-reading), and analyzing streaming data is needed.

Some more concrete scenarios are:

  • Apache Kafka:
    • High-throughput activity tracking: Kafka can be used for a variety of high-volume, high-throughput activity-tracking applications. For example, you can use Kafka to track website activity (its original use case), ingest data from IoT sensors, monitor patients in hospital settings, or keep tabs on shipments.
    • Stream processing: Kafka enables you to implement application logic based on streams of events. You might keep a running count of types of events or calculate an average value over the course of an event that lasts several minutes. For example, if you have an IoT application that incorporates automated thermometers, you could keep track of the average temperature over time and trigger alerts if readings deviate from a target temperature range.
    • Event sourcing: Kafka can be used to support event sourcing, in which changes to an app state are stored as a sequence of events. So, for example, you might use Kafka with a banking app. If the account balance is somehow corrupted, you can recalculate the balance based on the stored history of transactions.
    • Log aggregation: Similar to event sourcing, you can use Kafka to collect log files and store them in a centralized place. These stored log files can then provide a single source of truth for your app.
  • RabbitMQ:
    • Complex routing: RabbitMQ can be the best fit when you need to route messages among multiple consuming apps, such as in a microservices architecture. RabbitMQ consistent hash exchange can be used to balance load processing across a distributed monitoring service, for example. Alternative exchanges can also be used to route a portion of events to specific services for A/B testing.
    • Legacy applications: Using available plug-ins (or developing your own), you can deploy RabbitMQ as a way to connect consumer apps with legacy apps. For example, you can use a Java Message Service (JMS) plug-in and JMS client library to communicate with JMS apps.

Apache Kafka installation (on macOS)

The articles are going to be focused on a macOS system, but steps to install Apache Kafka in other systems should be easy enough, and the rest of the concepts should be indistinguishable from one system to another.

We are going to install Apache Kafka using Homebrew, a package manager for macOS. The installation is as simple as running:

brew install kafka

This command will install two services on our system:

  • Apache Kafka
  • Zookeeper

We can run them as services using the Homebrew commands ‘brew services’ (similar to the ‘systemctl’ if you are a Linux user), or execute them manually.

To run them as services:

brew services start zookeeper
brew services start kafka

To run them manually:

zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
kafka-server-start /usr/local/etc/kafka/server.properties

Everything should work out of the box. With this, we will have a simple Apache Kafka server running locally.

I have seen some people having connection problems when starting Apache Kafka, if this happens, we can edit the file ‘/usr/local/etc/kafka/server.properties‘, find the line ‘#listeners=PLAINTEXT://:9092‘, uncommented and run it again.

With these steps, we should be ready to start making some progress.

Apache Kafka (I): Intro + Installation

Blockchain, Ethereum and Smart Contracts

Usually, I try to make all the content I create original, based on readings, investigations, or just based on situations or questions I have heard or have been asked while working. Today, this article is a little bit different. It is not a copy from somewhere but, it has pieces that have been taken from other articles. You can find all the references at the end of this article. This article was just built as a part of some hack days I enjoyed recently where I decided to dig a little bit deeper on Blockchain, Ethereum, and Smart Contracts. This article is my field notes just in case someone else finds them useful.

Blockchain

What is Blockchain?

A blockchain is essentially an immutable digital ledger of transactions that is duplicated and distributed (shared) across the entire network of computer systems on the blockchain. Each block in the chain contains several transactions, and every time a new transaction occurs on the blockchain, a record of that transaction is added to every participant’s ledger. Blockchain facilitates the process of recording transactions and tracking assets in a business network. An asset can be tangible (i.e., a house, car, cash, land) or intangible (i.e., intellectual property, patents, copyrights, branding). Virtually anything of value can be tracked and traded on a blockchain network, reducing risk and cutting costs for all involved.

A decentralised database managed by multiple participants is known as Distributed Ledger Technology (DLT).

Blockchain is a type of DLT in which transactions are recorded with an immutable cryptographic signature called a hash.

The properties of Distributed Ledge Technology are:

  • Programmable: A blockchain is programmable i.e., smart contracts.
  • Secure: All records are individually encrypted.
  • Anonymous: The identity of participants is either anonymous or pseudo-anonymous.
  • Unanimous: All members of the network agree to the validity of each of the records.
  • Time-stamped: A transaction timestamp is recorded on a block.
  • Immutable: Any validated records are irreversible and cannot be changed.
  • Distributed: All network participants have a copy of the ledger for complete transparency.

How is Blockchain used?

Blockchain technology is used for many different purposes, from providing financial services to administering voting systems.

Cryptocurrency

The most common use of blockchain today is as the backbone of cryptocurrencies, like Bitcoin or Ethereum. When people buy, exchange or spend cryptocurrency, the transactions are recorded on a blockchain. The more people use cryptocurrency, the more widespread blockchain could become.

Banking

Beyond cryptocurrency, blockchain is being used to process transactions in fiat currency, like dollars and euros. This could be faster than sending money through a bank or other financial institution as the transactions can be verified more quickly and processed outside of normal business hours.

Asset Transfers

Blockchain can also be used to record and transfer the ownership of different assets. This is currently very popular with digital assets like NFTs, a representation of ownership of digital art and videos.

However, blockchain could also be used to process the ownership of real-life assets, like the deed to real estate and vehicles. The two sides of a party would first use the blockchain to verify that one owns the property and the other has the money to buy; then they could complete and record the sale on the blockchain. Using this process, they could transfer the property deed without manually submitting paperwork to update the local county’s government records; it would be instantaneously updated in the blockchain.

Smart Contracts

Another blockchain innovation is self-executing contracts commonly called “smart contracts”. These digital contracts are enacted automatically once conditions are met. For instance, a payment for a good might be released instantly once the buyer and seller have met all specified parameters for a deal. Another example of this is to automate legal contracts such as “A properly coded smart legal contract on a distributed ledger can minimize, or preferably eliminate, the need for outside third parties to verify performance.”

Supply Chain Monitoring

Supply chains involve massive amounts of information, especially as goods go from one part of the world to the other. With traditional data storage methods, it can be hard to trace the source of problems, like which vendor poor-quality goods came from. Storing this information on the blockchain would make it easier to go back and monitor the supply chain, such as with IBM’s Food Trust, which uses blockchain technology to track food from its harvest to its consumption.

Voting

Experts are looking into ways to apply blockchain to prevent fraud in voting. In theory, blockchain voting would allow people to submit votes that could not be tampered with as well as would remove the need to have people manually collect and verify paper ballots.

Healthcare

Health care providers can leverage blockchain to securely store their patients’ medical records. When a medical record is generated and signed, it can be written into the blockchain, which provides patients with the proof and confidence that the record cannot be changed. These personal health records could be encoded and stored on the blockchain with a private key, so that they are only accessible by certain individuals, thereby ensuring privacy.

Advantages of Blockchain

Higher Accuracy of Transactions

Because a blockchain transaction must be verified by multiple nodes, this can reduce error. If one node has a mistake in the database, the others would see it is different and catch the error.

In contrast, in a traditional database, if someone makes a mistake, it may be more likely to go through. In addition, every asset is individually identified and tracked on the blockchain ledger, so there is no chance of double spending it (like a person overdrawing their bank account, thereby spending money twice).

No Need for Intermediaries

Using blockchain, two parties in a transaction can confirm and complete something without working through a third party. This saves time as well as the cost of paying for an intermediary like a bank. This can bring greater efficiency to all digital commerce, increase financial empowerment to the unbanked or underbanked populations of the world and power a new generation of internet applications as a result.

Extra Security

Theoretically, a decentralized network, like blockchain, makes it nearly impossible for someone to make fraudulent transactions. To enter forged transactions, they would need to hack every node and change every ledger. While this is not necessarily impossible, many cryptocurrency blockchain systems use proof-of-stake or proof-of-work transaction verification methods that make it difficult, as well as not in participants’ best interests, to add fraudulent transactions.

More Efficient Transfers

Since blockchains operate 24/7, people can make more efficient financial and asset transfers, especially internationally. They do not need to wait days for a bank or a government agency to manually confirm everything.

Disadvantages of Blockchain

Limit on Transactions per Second

Given that blockchain depends on a larger network to approve transactions, there is a limit to how quickly it can move. For example, Bitcoin can only process 4.6 transactions per second versus 1,700 per second with Visa. In addition, increasing numbers of transactions can create network speed issues. Until this improves, scalability is a challenge.

High Energy Costs

Having all the nodes working to verify transactions takes significantly more electricity than a single database or spreadsheet. Not only does this make blockchain-based transactions more expensive, but it also creates a large carbon burden for the environment.

Risk of Asset Loss

Some digital assets are secured using a cryptographic key, like cryptocurrency in a blockchain wallet. You need to carefully guard this key, there is no centralised entity that can be called to recover the access key.

Potential for Illegal Activity

Blockchain’s decentralization adds more privacy and confidentiality, which unfortunately makes it appealing to criminals. It is harder to track illicit transactions on blockchain than through bank transactions that are tied to a name.

Common misconception: Blockchain vs Bitcoin

The goal of blockchain is to allow digital information to be recorded and distributed, but not edited. Blockchain technology was first outlined in 1991 but it was not until almost two decades later, with the launch of Bitcoin in January 2009, that blockchain had its first real-world application.

The key thing to understand here is that Bitcoin merely uses blockchain as a means to transparently record a ledger of payments, but blockchain can, in theory, be used to immutably record any number of data points. As discussed above, this could be in the form of transactions, votes in an election, product inventories, state identifications, deeds to homes, and much more.

How does a transaction get into the blockchain?

For a new transaction to be added to the blockchain a few steps need to happen:

Authentication

The original blockchain was designed to operate without a central authority (i.e. with no bank or regulator controlling who transacts), but transactions still have to be authenticated.

This is done using cryptographic keys, a string of data (like a password) that identifies a user and gives access to their “account” or “wallet” of value on the system.

Each user has their own private key and a public key that everyone can see. Using them both creates a secure digital identity to authenticate the user via digital signatures and to “unlock” the transaction they want to perform.

Authorisation

Once the transaction is agreed between the users, it needs to be approved, or authorised, before it is added to a block in the chain.

For a public blockchain, the decision to add a transaction to the chain is made by consensus. This means that the majority of “nodes” (or computers in the network) must agree that the transaction is valid. The people who own the computers in the network are incentivised to verify transactions through rewards. This process is known as “proof of work”.

Proof of Work

Proof of Work requires the people who own the computers in the network to solve a complex mathematical problem to be able to add a block to the chain. Solving the problem is known as mining, and “miners” are usually rewarded for their work in cryptocurrency.

But mining is not easy. The mathematical problem can only be solved by trial and error and the odds of solving the problem are about 1 in 5.9 trillion. It requires substantial computing power which uses considerable amounts of energy. This means the rewards for undertaking the mining must outweigh the cost of the computers and the electricity cost of running them, as one computer alone would take years to find a solution to the mathematical problem.

The Problem with Proof of Work

To create economies of scale, miners often pool their resources together through companies that aggregate a large group of miners. These miners then share the rewards and fees offered by the blockchain network.

As a blockchain grows, more computers join to try and solve the problem, the problem gets harder and the network gets larger, theoretically distributing the chain further and making it even more difficult to sabotage or hack. In practice though, mining power has become concentrated in the hands of a few mining pools. These large organisations have the vast computing and electrical power now needed to maintain and grow a blockchain network based around Proof of Work validation.

Proof of Stake

Later blockchain networks have adopted “Proof of Stake” validation consensus protocols, where participants must have a stake in the blockchain – usually by owning some of the cryptocurrency – to be in with a chance of selecting, verifying, and validating transactions. This saves substantial computing power resources because no mining is required.

In addition, blockchain technologies have evolved to include “Smart Contracts” which automatically execute transactions when certain conditions have been met.

Risk in public blockchains

51% Attacks

Where blockchains have consensus rules based on a simple majority, there is a risk that malign actors will act together to influence the outcomes of the system. In the case of a cryptocurrency, this would mean a group of miners controlling more than 50% of the mining computing power can influence what transactions are validated and added (or omitted) from the chain. On a blockchain that uses the Proof of Work (PoW) consensus protocol system, a 51% attack can also take the form of a “rival” chain – including fraudulent transactions – being created by malicious parties.

Through their superior mining capacity, these fraudsters can build an alternative chain that ends up being longer than the “true” chain, and therefore – because part of the Bitcoin Nakamoto consensus protocol is “the longest chain wins” – all participants must follow the fraudulent chain going forward.

In a large blockchain like Bitcoin this is increasingly difficult, but where a blockchain has “split” and the pool of miners is smaller, as, in the case of Bitcoin Gold, a 51% attack is possible.

A 51% double spend attack was successfully executed on the Bitcoin Gold and Ethereum Classic blockchains in 2018, where fraudsters misappropriated millions of dollars of value.

Proof of Work vs Proof of Stake

A 51% attack on a new blockchain called Ethereum Classic in January 2019 prompted a change in strategic direction from Proof-of-Work (PoW) mining to Proof-of-Stake (PoS) voting for the Ethereum blockchain.

However, Proof of Stake is more vulnerable to schisms or splits known as “forks”, where large stakeholders make different decisions about the transactions that should comprise blocks and end up creating yet another new currency. Ethereum briefly tried this validation method but, due to forking issues, reverted back to Proof of Work. It is expected to introduce a revised Proof of Stake validation system in 2020.

Double Spending

There is a risk that a participant with, for example, one bitcoin can spend it twice and fraudulently receive goods to the value of two bitcoins before one of the providers of goods or services realises that the money has already been spent. But this is, in fact, an issue with any system of electronic money, and is one of the principal reasons behind clearing and settlement systems in traditional currency systems.

How Does Blockchain Work?

Blockchain consists of three important concepts: blocks, nodes, and miners.

Blocks

Every chain consists of multiple blocks and each block has three basic elements:

  • The data in the block.
  • A 32-bit whole number is called a nonce. The nonce is randomly generated when a block is created, which then generates a block header hash.
  • The hash is a 256-bit number wedded to the nonce. It must start with a huge number of zeroes (i.e., be extremely small).

When the first block of a chain is created, a nonce generates the cryptographic hash. The data in the block is considered signed and forever tied to the nonce and hash unless it is mined.

Miners

Miners create new blocks on the chain through a process called mining.

In a blockchain every block has its own unique nonce and hash, but also references the hash of the previous block in the chain, so mining a block isn’t easy, especially on large chains.

Miners use special software to solve the incredibly complex math problem of finding a nonce that generates an accepted hash. Because the nonce is only 32 bits and the hash is 256, there are roughly four billion possible nonce-hash combinations that must be mined before the right one is found. When that happens miners are said to have found the “golden nonce” and their block is added to the chain.

Making a change to any block earlier in the chain requires re-mining not just the block with the change, but all of the blocks that come after. This is why it’s extremely difficult to manipulate blockchain technology. Think of it is as “safety in maths” since finding golden nonces requires an enormous amount of time and computing power.

When a block is successfully mined, the change is accepted by all of the nodes on the network and the miner is rewarded financially.

Nodes

One of the most important concepts in blockchain technology is decentralization. No one computer or organization can own the chain. Instead, it is a distributed ledger via the nodes connected to the chain. Nodes can be any kind of electronic device that maintains copies of the blockchain and keeps the network functioning.

Every node has its own copy of the blockchain and the network must algorithmically approve any newly mined block for the chain to be updated, trusted, and verified. Since blockchains are transparent, every action in the ledger can be easily checked and viewed. Each participant is given a unique alphanumeric identification number that shows their transactions.

Combining public information with a system of checks and balances helps the blockchain maintain the integrity and creates trust among users. Essentially, blockchains can be thought of as the scalability of trust via technology.

Ethereum

What is Ehtereum?

Ethereum is often referred to as the second most popular cryptocurrency, after Bitcoin. But unlike Bitcoin and most other virtual currencies, Ethereum is intended to be much more than simply a medium of exchange or a store of value. Instead, Ethereum calls itself a decentralized computing network built on blockchain technology.

It is distributed in the sense that everyone participating in the Ethereum network holds an identical copy of this ledger, letting them see all past transactions. It is decentralised in that the network is not operated or managed by any centralised entity – instead, it is managed by all of the distributed ledger holders.

In Ethereum’s case, participants, when mining, are rewarded with cryptocurrency tokens called Ether (ETH).

Ether can be used to buy and sell goods and services, like Bitcoin. It’s also seen rapid gains in price over recent years, making it a de-facto speculative investment. But what’s unique about Ethereum is that users can build applications that “run” on the blockchain like software “runs” on a computer. These applications can store and transfer personal data or handle complex financial transactions. This is one of the big differences with Bitcoin, the Ethereum network can perform computations as part of the mining process, this basic computational capability turns a store of value and medium of exchange into a decentralized global computing engine and openly verifiable data store.

While inferring from the previous paragraphs, it is worth it to emphasise here the meanings of the terms Ethereum and Ether are different: Ether as a digital currency for financial transactions, while Ethereum is the blockchain network on which Ether is held and exchanged, and the network offers a variety of other functions outside of ETH.

As said before, the Ethereum network can also be used to store data and run decentralized applications. Rather than hosting software on a server owned and operated by Google or Amazon, where the one company controls the data, people can host applications on the Ethereum blockchain. This gives users control over their data and they have open use of the app as there’s no central authority managing everything.

Perhaps one of the most intriguing use cases involving Ether and Ethereum are self-executing contracts or so-called smart contracts. Like any other contract, two parties make an agreement about the delivery of goods or services in the future. Unlike conventional contracts, lawyers are not necessary: The parties code the contract on the Ethereum blockchain, and once the conditions of the contract are met, it self-executes and delivers Ether to the appropriate party.

Ethereum Benefits

  • Large, existing network: Ethereum is a tried-and-true network that has been tested through years of operation and billions of value trading hands. It has a large and committed global community and the largest ecosystem in blockchain and cryptocurrency.
  • Wide range of functions: Besides being used as a digital currency, Ethereum can also be used to process other types of financial transactions, execute smart contracts and store data for third-party applications.
  • Constant innovation: A large community of Ethereum developers is constantly looking for new ways to improve the network and develop new applications. Because of Ethereum’s popularity, it tends to be the preferred blockchain network for new and exciting (and sometimes risky) decentralised applications.
  • Avoids intermediaries: Ethereum’s decentralised network promises to let users leave behind third-party intermediaries, like lawyers who write and interpret contracts, banks that are intermediaries in financial transactions or third-party web hosting services.

Ethereum Disadvantages

  • Rising transaction costs: Ethereum’s growing popularity has led to higher transaction costs. Ethereum transaction fees, also known as “gas,” hit a record $23 per transaction in February 2021, which is great if you’re earning money as a miner but less so if you’re trying to use the network. This is because unlike Bitcoin, where the network itself rewards transaction verifiers, Ethereum requires those participating in the transaction to cover the fee.
  • Potential for crypto inflation: While Ethereum has an annual limit of releasing 18 million Ether per year, there’s no lifetime limit on the potential number of coins. This could mean that as an investment, Ethereum might function more like dollars and may not appreciate as much as Bitcoin, which has a strict lifetime limit on the number of coins.
  • The steep learning curve for developers: Ethereum can be difficult for developers to pick up as they migrate from centralised processing to decentralised networks.
  • Unknown future: Ethereum continues to evolve and improve, and the release of Ethereum 2.0 (1st of December 2020) holds out the promise of new functions and greater efficiency. This major update to the network, however, is creating uncertainty for apps and deals currently in use. Is someone familiar with migrations?

Ethereum as a Platform for Applications

Single Source of Truth

The Ethereum blockchain establishes a single source of truth for the ecosystem by maintaining a public transaction database. This shared database captures all transactions that occur between users and applications. Unique virtual addresses identify actors, and each transaction captures participating addresses. These addresses do not reveal any personal information, allowing users to remain anonymous. Transactions are batched together into blocks and validated by thousands of computers or “nodes” before joining the public ledger.

Once posted, no one can remove or alter transactions. Because records are sealed, bad actors cannot revert a transaction after the fact or tamper with the public record. Balances are known and transactions are settled in real-time, so the entire ecosystem is on the same page. This append-only approach assures users and applications that the current state of the blockchain is final and trustworthy.

Platform for Applications

In addition to providing a shared source of truth, the Ethereum blockchain provides a platform for applications. The ability to store and power these applications sets the Ethereum blockchain apart from the Bitcoin blockchain.

The Ethereum blockchain provides critical application infrastructure similar to web services. Thousands of nodes that maintain the single source of truth also supply resources like storage, processing power, and bandwidth. Because people run these nodes all over the globe, Ethereum is referred to as the “world computer” because the collective resources function as a single machine.

Ethereum is different from centralized web services in that transaction data and applications are distributed across thousands of nodes, rather than a few data centres controlled by a corporation. This feature, known as decentralization, leads to a highly redundant and resilient ecosystem that cannot be controlled or censored by a single entity.

The code for these applications lives on the blockchain just like the transactions created by users and other applications. As a result, applications deployed on Ethereum are open and auditable. These apps are also designed to interoperate with other apps in the ecosystem (a stark departure from the traditional “black box” approach for software). Once applications are deployed to Ethereum, they operate autonomously, meaning that they will execute the programs they are designed to run without manual intervention. They are controlled by code, not by individuals or companies. For this reason, applications are referred to as “smart contracts”.

A common analogy for smart contracts is vending machines. Vending machines are programmed to automatically deliver specific items based on specific inputs. Users punch in a code and receive the corresponding item. Likewise, smart contracts receive inputs from users, execute their programmed code, and produce an output.

Decentralized Finance (DeFi)

Because the most fundamental data recorded on the Ethereum blockchain is accounts, balances, and transactions, it makes sense to build financial applications on top of Ethereum. Users and other applications can freely interact with these financial applications because they are public and permissionless by default.

Simple financial services like lending and borrowing can be programmed and deployed on the blockchain as applications, allowing users (and other applications) to earn interest on digital assets or take out loans. Order book exchanges can autonomously pair buyers and sellers at no charge. Automated market makers can minimise spreads by creating liquidity pools that automatically rebalance according to predefined logic. Derivatives can be deployed on the blockchain so that contract terms are known and the underlying assets are priced in real-time.

Smart contracts

A smart contract is a self-executing contract with the terms of the agreement between buyer and seller being directly written into lines of code. The code and the agreements contained therein exist across a distributed, decentralised blockchain network. The code controls the execution, and transactions are trackable and irreversible. In other words, a smart contract is simply a piece of code that is running on Ethereum and can control valuable things like ETH or other digital assets.

Smart contracts permit trusted transactions and agreements to be carried out among disparate, anonymous parties without the need for a central authority, legal system, or external enforcement mechanism.

The fact that smart contracts are computer programs deployed on a blockchain network brings to the table some inherent characteristics:

  • Immutable: Once deployed, the code of a smart contract cannot change. Unlike traditional software, the only way to modify a smart contract is to deploy a new instance.
  • Deterministic: The outcome of the execution of a smart contract is the same for everyone who runs it, given the context of the transaction that initiated its execution and the state of the Ethereum blockchain at the moment of execution.
  • Ethereum Virtual Machine (EVM) context: Smart contracts operate with a very limited execution context. They can access their own state, the context of the transaction that called them, and some information about the most recent blocks.
  • Decentralized world computer: The EVM runs as a local instance on every Ethereum node, but because all instances of the EVM operate on the same initial state and produce the same final state, the system as a whole operates as a single “world computer” (as mentioned before).

Languages to write smart contracts

  • LLL: A functional (declarative) programming language, with Lisp-like syntax. It was the first high-level language for Ethereum smart contracts but is rarely used today.
  • Serpent: A procedural (imperative) programming language with a syntax similar to Python. Can also be used to write functional (declarative) code, though it is not entirely free of side effects.
  • Solidity: A procedural (imperative) programming language with a syntax similar to JavaScript, C++, or Java. The most popular and frequently used language for Ethereum smart contracts. (We will be using this one)
  • Vyper: A more recently developed language, similar to Serpent and again with Python-like syntax. Intended to get closer to a pure-functional Python-like language than Serpent, but not to replace Serpent.
  • Bamboo: A newly developed language, influenced by Erlang, with explicit state transitions and without iterative flows (loops). Intended to reduce side effects and increase audibility. Very new and yet to be widely adopted.

Non-fungible tokens (NFT)

NFTs are tokens that we can use to represent ownership of unique items. They let us tokenise things like art, collectables, even real estate. They can only have one official owner at a time and they’re secured by the Ethereum blockchain – no one can modify the record of ownership or copy/paste a new NFT into existence.

NFT stands for non-fungible token. Non-fungible is an economic term that you could use to describe things like your furniture, a song file, or your computer. These things are not interchangeable with other items because they have unique properties.

Fungible items, on the other hand, can be exchanged because their value defines them rather than their unique properties. For example, ETH or dollars are fungible because 1 ETH / 1 USD is exchangeable for another 1 ETH / 1 USD.

Comparison

An NFT internetThe internet today
NFTs are digitally unique, no two NFTs are the same.A copy of a file, like a .mp3 or .jpg, is the same as the original.
Every NFT must have an owner and this is of public record and easy for anyone to verify.Ownership records of digital items are stored on servers controlled by institutions – you must take their word for it.
NFTs are compatible with anything built using Ethereum. An NFT ticket for an event can be traded on every Ethereum marketplace, for an entirely different NFT. You could trade a piece of art for a ticket!Companies with digital items must build their own infrastructure. For example, an app that issues digital tickets for events would have to build its own ticket exchange.
Content creators can sell their work anywhere and can access a global market.Creators rely on the infrastructure and distribution of the platforms they use. These are often subject to terms of use and geographical restrictions.
Creators can retain ownership rights over their own work, and claim resale royalties directly.Platforms, such as music streaming services, retain the majority of profits from sales.
Items can be used in surprising ways. For example, you can use digital artwork as collateral in a decentralised loan.

Solidity

Solidity is a contract-oriented, high-level language for implementing smart contracts. It was influenced by C++, Python, and JavaScript and is designed to target the Ethereum Virtual Machine (EVM).

Solidity is statically typed, supports inheritance, libraries, and complex user-defined types among other features.

Let’s see an example. Let’s build a simple token contract:

pragma solidity ^0.4.0;

contract SimpleToken {
    int64 constant TOTAL_UNITS = 100000;
    int64 outstanding_tokens;
    address owner;
    mapping(address => int64) holdings;

    function SimpleToken() public { // Constructor
        outstanding_tokens = TOTAL_UNITS;
        owner = msg.sender; // msg.sender represents the address that initiated this contract call
    }

    // Declaring some events
    event TokenAllocation(address holder, int64 number, int64 remaining);
    event TokenMovement(address from, address to, address value);
    event InvalidTokenUsage(string reason);

    function getOwner() public constant returns (address) {
        return owner;
    }

    // Allocate tokens
    function allocate(address newHolder, int64 value) public {
        if (msg.sender != owner) {
            InvalidTokenUsage('Only owner can allocate tokens');
            return;
        }

        if (value < 0) {
            InvalidTokenUsage('Cannot allocate negative value');
        }

        if (value <= outstanding_tokens) {
            holdings[newHolder] += value;
            outstanding_tokens -= value;
            TokenAllocation(newHolder, value, outstanding_tokens);
        } else {
            InvalidTokenUsage('Value to allocate longer that outstanding tokens');
        }
    }

    // Move tokens
    function move(address destination, int64 value) public {
        address source = msg.sender;

        if (value < 0) {
            InvalidTokenUsage('Must move value greater than zero');
        }

        if (holdings[source] >= value) {
            holdings[destination] += value;
            holdings[source] -= value;
            TokenMovement(source, destination, value);
        } else {
            InvalidTokenUsage('Value to move longer than holdings');
        }
    }

    // Getters & fallback
    function myBalance() constant public returns (int64) {
        return holdings[msg.sender];
    }

    function holderBalance(address holder) constant public returns (int64) {
        if (msg.sender != owner) {
            return;
        }

        return holdings[holder];
    }

    function outstandingBalance() constant public returns (int64) {
        if (msg.sender != owner) {
            return;
        }

        return outstanding_tokens;
    }

    function() public {
        revert();
    }
}

For reasons related to the environment where I am working, it is a bit restricted, the contract does not follow the latest style described on the documentation of the most recent version but, as an example, it should work. You can find the latest version here.

Building stuff

Setting up the environment

Dependencies required:

  • nodejs
  • Truffle framework: Framework to create Ethereum smart contracts
  • Ganache: Quickly fire up a personal Ethereum blockchain which you can use to run tests, execute commands, and inspect state while controlling how the chain operates.
  • Metamask: Configure out fake ether addresses to interact with the apps we are building. It is a Chrome Extension.

Installing dependencies

# Installing node
$ sudo apt install nodejs
$ node -v
 
 
# Install Truffle
$ sudo npm install -g truffle
 
 
# Installing Ganache
$ wget https://github.com/trufflesuite/ganache/releases/download/v2.5.4/ganache-2.5.4-linux-x86_64.AppImage
$ chmod a+x ganache-2.5.4-linux-x86_64.AppImage
$ ./ganache-2.5.4-linux-x86_64.AppImage
 
 
# Installing Solidity compiler
$ sudo add-apt-repository ppa:ethereum/ethereum
$ sudo apt update
$ sudo apt install solc

Project 1: Memory Game with Blockchain

Memory Game, also known as the Concentration card game or Matching Game, is a simple card game where you need to match pairs by turning over 2 cards at a time. Once this match has been done, we can keep the card forever adding it to the blockchain.

Elements involved:

  • Smart contract
  • NFT

Source code can be found here.

Project 2: Decentralised Twitter

Just a very basic decentralized Twitter.

Elements involved:

  • Drizzle: A collection of front-end libraries that make writing Dapp user interfaces easier and more predictable.
  • Smart contract

Source code can be found here.

Notes

Ether, Gas, Gas Cost, Fees

  • Ether – the cryptocurrency underpinning Ethereum.
  • Gas – the unit used to measure the execution of your transaction.
  • Gas Cost – the price of one “gas unit” that you are prepared to pay.
  • Set the higher gas costs to get faster confirmation.
  • Fee – the (gas * gasCost) cost you pay to run your transaction.

Tools

  • Ethereum nodes
    • Geth: It is an Ethereum-client, which means that we can run our own private blockchain with it. Command-line
    • parity: Parity Ethereum is a software stack that lets you run blockchains based on the Ethereum Virtual Machine (EVM) with several different consensus engines.
    • Ganache: It allows you to create your own private blockchain mainly for testing purposes It has UI
  • Cloud environments
    • Infura.io: Infura’s development suite provides instant, scalable API access to the Ethereum and IPFS networks.
    • Microsoft Azure
  • IDEs
    • Normal IDEs. IntelliJ have a plugin for Solidity
    • Javascript editors are good for building the tests and any app
  • Dev environment
    • Web3j: Web3j is a library for working with Smart Contracts and integrating with Ethereum blockchains. This allows you to work with Ethereum blockchains, without the additional overhead of having to write your own integration code for the platform.
    • Embark: The all-in-one developer platform for building and deploying decentralized applications
    • Truffle: Framework to create Ethereum smart contracts
    • Brownie: Brownie is a Python-based development and testing framework for smart contracts targeting the EVM.
  • Tools
    • Etherchain: makes the Ethereum blockchain accessible to non-technical end-users.
    • remix: Remix IDE is an open-source web and desktop application. It fosters a fast development cycle and has a rich set of plugins with intuitive GUIs. Remix is used for the entire journey of contract development as well as being a playground for learning and teaching Ethereum.
    • etherscan: Etherscan is a Block Explorer and Analytics Platform for Ethereum, a decentralized smart contracts platform
    • EthGasStation: ETH Gas Station aims to increase the transparency of gas prices, transaction confirmation times, and miner policies on the Ethereum network.
    • Metamask: MetaMask is a software cryptocurrency wallet used to interact with the Ethereum blockchain. It allows users to access their Ethereum wallets through a browser extension or mobile app.

Blockchain for Development

TypeTool
EmulatorsGanache, Embark
Lightweight nodesEthereumjs-vm, Pyethereum
Local Regular BlockchainsGeth, Parity
Hosted Nodes or ChainsInfura, Azure
Public Testing BlockchainsRinkeby, Ropsten
Public BlockchainMainnet

References

What Is Blockchain?

Blockchain explained

Blockchain technology defined

Mastering Ethereum (book)

Ethereum Whitepaper

What is Ethereum and How does it work?

Understanding Ethereum

Intro to Ethereum programming (video)

Blockchain, Ethereum and Smart Contracts