Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. In general, the data contains greater variety, has been originated from more sources and arrives in increasing volumes and with more velocity. The advantage and why Big Data is important is that these massive volumes of data can be used to address business problems we would not have been able to tackle before, and gain insights and make predictions that were out of our reach before.
While initially, Big Data was defined around the concepts of volume, velocity and variety, lately two more concepts have been added: value and veracity. All data can be initially considered noise till its intrinsic value has been discovered. In addition, to be able to discover this value and utilise the data, we need to consider how reliable and truthful is.
A few factors have influenced the growth of the importance of Big Data:
- The cheap storage possibilities.
- Cloud Computing offers truly elastic scalability.
- The advent of the Internet of Things (IoT) and the gathering of data related to customer usage patterns and product performance.
- The emergence of Machine Learning.
Big Data offers challenges not only around data storage but around curation. For data to be useful needs to be clean and relevant to the organisations in a way that enables meaningful analysis. Data scientists spend a big chunk of their time curating the data to allow data analysts to extract proper conclusions from it.
Part of this curation process is influenced by the source of the data and its format. We can categorise our data into three main types:
- Structured data: Any data that can be processed, accessed and stored in a fixed format is named structured data.
- Unstructured data: Any data which has an unfamiliar structure or model is named unstructured data.
- Semi-structured data: It refers to the data that, even though it has not been ordered under a specific database, yet contains essential tags or information that isolate singular components inside the data.
As we have mentioned before, Big data gives us new insights that open up new opportunities and business models. Getting started involves three key actions:
- Integration: Big data brings together data from many disparate sources and applications. Traditional data integration mechanisms, such as extract, transform, and load (ETL) generally are not up to the task. It requires new strategies and technologies to analyze big data sets at terabyte, or even petabyte, scale. During integration, you need to bring in the data, process it, and make sure it’s formatted and available in a form that your business analysts can get started with.
- Management: Big data requires storage. Where to store the data? In the cloud, on-premises or a hybrid solution. How to store the data? In which form the data is going to be stored that is going to allow on-demand processes to work with it.
- Analysis: To obtain any kind of insight the data needs to be analysed and act on. We can build visualisation to clarify the meaning of data sets, mix and explore different datasets to make new discoveries, share our findings with others, or build data models with machine learning and artificial intelligence.
Some best practices when working in the Big Data space are:
- Align big data with specific business goals: More extensive data sets enable us to make new discoveries. To that end, it is important to base new investments in skills, organization, or infrastructure with a strong business-driven context to guarantee ongoing project investments and funding. To determine if we are on the right track, ask how big data supports and enables your top business and IT priorities. Examples include understanding how to filter web logs to understand e-commerce behaviour, deriving sentiment from social media and customer support interactions, and understanding statistical correlation methods and their relevance for the customer, product, manufacturing, and engineering data.
- Ease skills shortage with standards and governance: One of the biggest obstacles to benefiting from our investment in big data is a skills shortage. You can mitigate this risk by ensuring that big data technologies, considerations, and decisions are added to your IT governance program. Standardizing your approach will allow you to manage costs and leverage resources. Organizations implementing big data solutions and strategies should assess their skill requirements early and often and should proactively identify any potential skill gaps. These can be addressed by training/cross-training existing resources, hiring new resources, and leveraging consulting firms.
- Optimise knowledge transfer with a centre of excellence: Use a centre of excellence approach to sharing knowledge, control oversight, and manage project communications. Whether big data is a new or expanding investment, the soft and hard costs can be shared across the enterprise. Leveraging this approach can help increase big data capabilities and overall information architecture maturity in a more structured and systematic way.
- The top payoff is aligning unstructured with structured data: It is certainly valuable to analyse big data on its own. But you can bring even greater business insights by connecting and integrating low-density big data with the structured data you are already using today. Whether you are capturing customer, product, equipment, or environmental big data, the goal is to add more relevant data points to your core master and analytical summaries, leading to better conclusions. For example, there is a difference in distinguishing all customer sentiment from that of only your best customers. This is why many see big data as an integral extension of their existing business intelligence capabilities, data warehousing platform, and information architecture. Keep in mind that the big data analytical processes and models can be both human – and machine-based. Big data analytical capabilities include statistics, spatial analysis, semantics, interactive discovery, and visualization. Using analytical models, you can correlate different types and sources of data to make associations and meaningful discoveries.
- Plan your discovery lab for performance: Discovering meaning in your data is not always straightforward. Sometimes we do not even know what we are looking for. That is expected. Management and IT needs to support this “lack of direction” or “lack of clear requirement”.
- Align with the cloud operating model: Big data processes and users require access to a broad array of resources for both iterative experimentation and running production jobs. A big data solution includes all data realms including transactions, master data, reference data, and summarized data. Analytical sandboxes should be created on demand. Resource management is critical to ensure control of the entire data flow including pre- and post-processing, integration, in-database summarization, and analytical modelling. A well-planned private and public cloud provisioning and security strategy plays an integral role in supporting these changing requirements.