Use this link to check out our new artificial intelligence and machine learning class on Skillshare, and get two free months of premium membership!


Want to publish your articles on InnoArchiTech? Click here to learn more!

Scalable Software and Big Data Architecture - Big Data and Analytics Architectural Patterns

Articles in This Series

  1. Application Types, Requirements, and Components
  2. Software Architectural Patterns and Design Patterns
  3. Big Data and Analytics Architectural Patterns

Introduction

Welcome to the third and final article in a multi-part series about the design and architecture of scalable software and big data solutions.

In this article, we’ll focus on architectural patterns associated with big data and analytics applications.

InnoArchiTech post image

Big Data and Analytics, An Overview

Big data is a bit of an overused buzzword, but it’s definitely a useful term. With the explosion of high volume, high variety, and high velocity data sources and streams (i.e., the 3 Vs), the term big data has become popularized to represent the architectures, tools, and techniques created to handle these increasingly intensive requirements.

Most architectural patterns associated with big data involve data acquisition, integration, ingestion, processing (transformation, aggregation, …), storage, access (e.g., querying), and analytics. The ordered combination of these stages and associated components is usually called a data pipeline.

The field of data analytics is not new, but it’s probably more important than ever. Data is being generated in unprecedented quantities by businesses, sensors, applications, and so on. Once data is generated and stored (i.e., persisted), it’s usually used for one or two primary purposes.

The first is that data is used in a transactional or operational sense, which is best described by the term CRUD. Data records, objects, and so on are created, read, updated, and deleted as needed to support a given application and its user’s intended interactions.

The second, and part of the focus of this article, is when data is used to:

  • Extract useful information and actionable insights
  • Make predictions and recommendations
  • Identify patterns and trends
  • Drive business decisions
  • Track and report key KPIs
  • And much more.


With that, let’s look at specific architectural patterns associated with big data and analytics.

Big Data and Analytics Architectural Patterns

Some solution-level architectural patterns include polyglot, lambda, kappa, and IOT-A, while other patterns are specific to particular technologies such as data management systems (e.g., databases), and so on.

Given the so-called data pipeline and different stages mentioned, let’s go over specific patterns grouped by category.

Data storage and modeling

All data must be stored. It can be stored on physical disks (e.g., flat files, B-tree), virtual memory (in-memory), distributed virtual file systems (e.g., HDFS), and so on.

Data storage systems are usually classified as being either relational (RDBMS), NoSQL, or NewSQL. This designation is made based on the data model, physical storage strategy, query language, querying capabilities, CAP tradeoffs, and so on (see How to Choose the Right Database System: RDBMS vs. NoSql vs. NewSQL).

Databases can also be further classified by their usage in an overall solution. Here is a list of common data storage classifications.

  • Operational data store (ODS)
  • Data warehouse
  • Data lake
  • Data mart
  • OLAP and OLTP/TDS
  • Master data store
  • In-memory
  • File system
  • Distributed file system (e.g., HDFS)


Some of these systems are better suited for transactional and event-driven data storage (e.g., OLTP), while others are more analytics focused (OLAP, data warehouse, …).

Each type of storage typically involves one or more data models. The term data model describes the virtual representation of the data, as opposed to the physical representation and storage of the data, which is handled by the data management system.

Typical data models and schemas include:

  • Relational for RDBMS systems
  • NoSQL
    • Key value
    • Document
    • Wide-column
    • Graph
  • NewSQL
    • Hybrid of relational and NoSQL
  • Data Warehouse and OLAP
    • Star schema
    • Snowflake schema
    • OLAP cube


Data acquisition, ingestion, and integration

Data can be acquired and collected in many different ways, including physical sensor measurement (e.g, IoT), software transactions and events, server logging, web and mobile app usage tracking, and so on.

Acquired data is usually stored in a data storage system (i.e., database) of some sort, and may also need to be ingested into a data pipeline and ultimately another data store (e.g., data warehouse). The data may also need to be combined (integrated) with data from other sources as well.

Once data has been generated physically by sensors or by software code, data can be ingested into a database or data pipeline in a variety of ways. Here is a list of some common data architectural patterns.

  • Messaging and message queues
    • Using patterns like message oriented middleware (MOM), pub/sub, …
  • Query-based (i.e., query and extract data from other data sources)
  • Event-based
  • APIs
  • Change data capture (CDC)


In some patterns, data is pushed to a database or pipeline using an event-based messaging system or through API calls, whereas in others data is pulled by querying other data sources.

Data availability, performance, and scalability

Once data is stored, applications and/or end users will likely need to access it. Three of the most important requirements of a data system, at the solution level, and depending on the expected load on it, are availability, performance, and scalability. These three requirements are related and have impacts on one another.

Availability says that every request receives a response, without guarantee that it contains the most recent version of the information. It is implied that availability here refers to database nodes that are in a non-failing state, and that return requested information as opposed to an error.

Performance is a term used to describe the speed by which data operations (e.g., read, writes, …) are executed successfully. Research has shown that slow application speeds can cause significant losses in revenue, as well as degrade customer’s experience and satisfaction, ultimately leading to customer abandonment and churn. Highly performant systems are therefore very important.

Lastly, scalability refers to the system’s ability to maintain a certain degree of availability and performance regardless of the load (e.g., concurrent user requests) imposed on the system.

There are many architectural patterns used to address these three requirements. Here is a list of some of them.

  • Shared nothing and shared *
  • Sharding
  • Replication
  • Distributed and parallel computing and storage
  • Load balancing
  • Horizontal and/or vertical scaling


Data processing and movement

Data processing is a stage of the data pipeline that involves data cleaning, dealing with missing or bad values, transformations, metrics calculations, and so on.

Typically data processing is categorized as either batch, real-time, near real-time, or streaming. Batch processing involves processing data in a batch either as a single job or recurring process, and is not usually a very fast process depending on the amount of data involved.

Real-time processing on the other hand deals with processing data as it moves through the data pipeline in real-time, and often the data moves directly from the processing stage into presentation of some sort (dashboard, notification, report, …), persistent storage, and/or as a response to a request (POS CC approval, bank ATM transaction, …). Real-time processing is characterized by very fast processing where minimal time delays are critical.

Near real-time (NRT) processing on the other hand, differs in that the delay introduced by data processing and movement means that the term real-time is not quite accurate. Some sources differentiate near real-time as being characterized by a delay of several seconds to several minutes, and where the delay is acceptable for the given application.

The data processing stage can be accompanied by a data movement stage, which is usually called extract transform load (ETL), or extract load transform (ELT) depending on the situation. In the case of ETL, data is extracted from one or more data sources, often loaded into temporary storage (staging), transformed by processing logic, and then loaded into the data’s final storage place.

The final data store is usually a data warehouse, data lake, OLAP system, or analytics database of some sort.

InnoArchiTech post image

Data access, querying, analytics, and business intelligence (BI)

We finish the data architecture discussion with patterns associated with data access, querying, analytics, and business intelligence.

Data isn’t really useful if it’s generated, collected, and then stored and never seen again. Most data contains very useful information that can be extracted in order to gain actionable insights, drive business decisions, make predictions and recommendations, and so on.

Data is typically accessed by creating a connection to a datastore, querying the data (e.g., using SQL) to retrieve a specific subset of it, and then finally performing analytics of some type on the data. Analytics can be in the form of statistical analysis, data visualization, machine learning, artificial intelligence, predictive analytics, and so on.

Business intelligence refers to the extraction of useful information from data to drive business decisions that help achieve business goals. Usually business intelligence is carried out using a so-called BI tool, which is software that can connect to different data sources in order to query, visualize, and perform analysis on data.

Typical job roles that carry out the described analytics tasks depends on the complexity (statistical, mathematical, and algorithmic), and amount of computer programming required. Usually data scientists perform very technical analytics tasks that require significant programming, whereas roles with a variation of an analyst title (e.g., data analyst) usually leverage business intelligence tools (Excel, Tableau, Looker, …).

Here is a list of some common architectural patterns in this category.

  • Distributed and parallel querying, processing, and analytics (e.g., Apache Spark)
  • Access patterns, languages, and tools
    • Structured query language (SQL)
    • MapReduce
    • Spark resilient distributed datasets (RDD), SparkSQL, and Spark DataFrames
    • Hive
    • Pig
    • Impala
    • Sqoop
  • Processing categories
    • Streaming
    • Real-time
    • Near real-time
    • Batch
  • Analytics types
    • Descriptive
    • Predictive
    • Prescriptive
    • Business intelligence (BI)


To learn more about analytics, and machine learning in particular, check out my five-part series called Machine Learning: An In-Depth Guide.

Summary

At this point, and to conclude this series, we’ve covered virtually all major aspects of scalable software and big data architecture at a high level.

We’ve discussed different application types, requirements, and components. We also covered many software architectural and design patterns. We finished this series with a solid overview of many of the widely used architectural patterns and styles found in scalable big data and analytics solutions.

I will discuss each of the patterns given in this article in greater depth in an upcoming article outside of this series.

Thank you, and happy learning!