Use this link to check out our new artificial intelligence and machine learning class on Skillshare, and get two free months of premium membership!


Want to publish your articles on InnoArchiTech? Click here to learn more!

Advanced Analytics Packages, Frameworks, and Platforms by Scenario or Task

Articles in This Series

  1. Python vs R for Artificial Intelligence, Machine Learning, and Data Science
  2. Production vs Development Artificial Intelligence and Machine Learning
  3. Advanced Analytics Packages, Frameworks, and Platforms by Scenario or Task

Introduction

Welcome to the third and final article in this three-part series. In the second article, we had an in-depth discussion of production vs. development artificial intelligence and machine learning.

InnoArchiTech post image

In this article, we’ll cover which programming languages, software packages (aka libraries), frameworks, and/or platforms to use in the context of tasks and considerations such as:

  • Local or remote development
  • Production vs. development environments
  • Single server vs. scalable and/or distributed computing
  • Statistical and exploratory data analysis
  • Advanced analytics, including artificial intelligence and machine learning
  • Big data and real-time analytics
  • Data mining and business intelligence
  • Data instrumentation and logging


The article is broken down by different use cases (aka scenarios) or tasks, and recommendations will be given where applicable.

InnoArchiTech post image

These use cases consider situations where data science tasks (analytics, machine learning, artificial intelligence) are carried out either on a local physical desktop, laptop, or server machine by a data scientist, for instance, or where these tasks are performed on remote physical or virtual machines (single server or distributed system) via a cloud infrastructure such as Amazon Web Services (AWS).

We’ll further differentiate between performing these tasks for development or production purposes, where production implementations and deployments have a unique set of considerations, challenges, and requirements. For more on that, check out the previous article in this series, and Dataiku has written multiple white papers on this topic that are definitely worth checking out as well.

Note that these lists aren’t exhaustive, and there is no particular preference implied by the order of the items below. Also, everything listed leans more towards open-source software, as opposed to corporate or enterprise solutions (with some exceptions).

Given that discussion or comparison of many of the items below could warrant an entire article, this article is meant to provide a stepping stone for the reader to do further research based on their particular interest and needs.

Single Machine (Local or Remote) for Analysis: Statistical, Data, and Exploratory (EDA)

In this section, we are focused mainly on statistical analysis, exploratory data analysis (EDA), and data analysis in general.

Part of performing these tasks, and those in sections throughout this article, is using the languages and packages below to carry out common tasks with data such as loading, querying, parsing, munging/wrangling, filtering, sorting, aggregating, visualizing, and performing exploratory data analysis (EDA).

Both Python and R are perfect for this scenario, and here are some recommended packages grouped by language.

Python

  • Reproducible research and notebooks
    • Jupyter
  • Data reading, parsing, munging, wrangling, and numerical
    • pandas
    • numpy
    • scipy
  • Data visualization
    • matplotlib
    • seaborn
    • bokeh
    • Altair
  • Advanced statistical analysis
    • statsmodels
  • Data scraping
    • Scrapy
    • Beautiful soup


R

  • Reproducible research and notebooks
    • R Markdown
    • Knitr
  • Data reading, parsing, munging, and wrangling
    • dplyr
    • plyr
    • stringr
    • tidyr
    • lubridate
    • reshape2
    • party
  • Data visualization
    • ggplot2
    • ggvis
    • htmlwidgets
    • lattice
    • googleVis
    • rCharts
  • Advanced statistical analysis
    • zoo
  • Interactive visualization and analysis
    • Shiny


Single Machine (Local or Remote) For Artificial Intelligence, Machine Learning, and Advanced Analytics Development

This use case is where a person such as a data scientist performs artificial intelligence, machine learning, and advanced analytics tasks in a development environment.

In many cases, this scenario is intended to extract information, answer questions, identify patterns and trends, or derive actionable insights, but without necessarily creating a deliverable to be deployed to a production environment.

That said, everything listed in this section can certainly apply to tasks with the goal of developing and deploying a high-performing model to a production environment as well. In addition to the languages (Python and R) and packages listed above (e.g. pandas, numpy, …), the following are perfect for this scenario.

Python

  • General machine learning and advanced statistics
    • scikit-learn
    • statsmodels
    • PyMC3
  • Neural networks, deep learning, and general AI
    • Keras
    • Tensorflow
    • Theano
    • Caffe and Caffe2
    • MXNet
    • Torch
    • PyBrain
    • Nervana neon
  • Natural language processing (NLP)
    • NLTK
    • spaCy
    • TextBlob
    • Gensim


R

  • General machine learning and advanced statistics
    • caret
    • randomForest
    • e1071
    • glmnet
    • rpart
    • gbm
    • kernlab
    • tree
  • Neural networks, deep learning, and general AI
    • nnet
    • neuralnet
  • Natural language processing (NLP)
    • wordcloud
    • tm


Single or Multiple Production Servers (Non-distributed) For Artificial Intelligence, Machine Learning, and Advanced Analytics

This scenario covers the situation where artificial intelligence, machine learning, and advanced analytics solutions are deployed to one or more production servers. In the case of multiple servers, this scenario does not cover the distributed case, which is discussed in the next section. All languages and packages covered thus far are applicable to this scenario.

Non-distributed, multiple production servers can be employed for scalability purposes in order to handle a large number of concurrent requests that require processing data through a production model (e.g. to make a prediction). Architecturally, this can be done by deploying the same model to many different machines and distribute the requests through routing (e.g. load balancing), or to spin up ephemeral, serverless workers using a technology such as AWS’s Lambda.

We briefly mentioned in the previous article that deploying data science code and predictive models to production is a matter of either plugging functionality into an existing framework (e.g. as a service or microservice), or translating code into something more compatible with that already in use. Keep in mind however that the latter option can be very costly and time consuming.

There are certainly benefits (e.g. consistency and simplicity) in deploying solutions using the same languages and tools as when developed. One way to do this is to modularize, or package functionality as a service or microservice, which can be leveraged by a wide variety of applications such as a web application, SaaS platform, and so on. In this case, deploying new or updated production code can be a relatively simple matter of deploying and loading an encoded file containing new model coefficients for an existing solution in production.

Regardless of the language ecosystem chosen, environment and package management is critical when deploying Python and R solutions, since both will likely rely on a significant number of dependent and versioned packages. Making sure code and packages are properly and regularly updated is a definite consideration. There are many ways to handle this, including using packaging systems to wrap and deploy code and all dependent libraries.

Package managers like Anaconda and Miniconda can help with this, and also allow the user to export and import so-called virtual environments, which define versions of languages and packages to install in a given environment. Python also has a virtualenv tool that is able to create isolated virtual environments if the user is not using a tool like Anaconda.

Other options include pushing trusted packages to production from a curated and maintained master repository for additional environment control, or creating builds or binaries that are then deployed to production. There are many binary-building tools available for this.

In all the cases described, it’s worth noting that there is a relatively significant amount of devops-related work associated with deploying and maintaining solutions in production.

Scalable Production Machine Learning, Artificial Intelligence, and Analytics

As previously discussed in this series and other articles of mine, scalability can be achieved in many different ways (e.g. distributed computing, load balancing).

In addition to using inherently distributed software frameworks like Hadoop or Spark, there are multiple other options to achieve scalability. One is to create and implement a custom solution, including the required devops and system admin. This can be very time and cost intensive, as well as require specialized skillsets.

Another options is to leverage a platform as a service (PaaS) or infrastructure as a service (IaaS) provider like Amazon Web Services (AWS) or Google Cloud Platform (GCP). These platforms help abstract away many of the complexities associated with devops, site reliability engineering (SRE), system and network administration, and so on.

A particularly interesting and useful technology to consider is on-demand, scalable, and server-like computing resources like Amazon’s Lambda. This allows for highly scalable and dynamic computing power, but without the creation, maintenance, and general overhead of running a complete server application (e.g. written in node.js).

Lastly, there is a growing number of scalable, specialized APIs being made available as a service that provide various artificial intelligence, machine learning, and advanced analytics-related functionality.

Here are some common packages, frameworks, platforms, and APIs used for scalable, production machine learning and artificial intelligence applications.

Packages, Frameworks, and Platforms

  • Apache Spark MLib
  • Databricks
  • H2O
  • Apache Mahout
  • Apache SystemML
  • DataRobot
  • BigML
  • OpenNN
  • The VELES
  • AWS Deep Learning AMIs
  • Amazon Machine Learning
  • Amazon EMR
  • Microsoft Azure Machine Learning


APIs

  • AWS
    • Lex
    • Rekognition
    • Polly
  • Google Cloud Platform (GCP)
    • Machine learning engine
    • Jobs
    • Video intelligence
    • Vision
    • Speech
    • Natural language
    • Translation
  • Microsoft Cognitive Services
    • Vision
    • Speech
    • Language
    • Knowledge
    • Search
  • Miscellaneous
    • Houndify
    • API.AI
    • Wit.ai
    • Clarifai
    • AlchemyAPI
    • Diffbot
    • IBM Watson
    • PredictionIO


SDKs

  • ai-one


Big Data and Real-Time Analytics

This section discusses scenarios that fall under the admittedly overhyped term big data, or real-time analytics solutions.

Big data can be thought of as being a relative term that applies to huge data sets that require an entity (person, company, etc.) to leverage specialized hardware, software, processing techniques, visualization, and database technologies in order to solve the problems associated with the 3Vs (volume, variety, and velocity) and similar characteristic models, i.e., more than 3Vs.

Here is a list of popular and powerful software packages, frameworks, and databases (aka database management systems) categorized by different components of typical big data and real-time analytics solutions.

Data Acquisition, Ingestion, Integration, and Messaging

  • Apache Kafka
  • Apache Spark Streaming
  • Apache Sqoop
  • Apache Flume


Data Processing and Streaming

  • Apache Spark
  • AWS Kinesis
  • Apache Storm
  • Apache Flink


Data Storage and Management

  • Big data stores
    • Apache Hadoop
    • Enterprise and service-based Hadoop
      • Cloudera
      • Hortonworks
      • MapR
    • Apache HBase
  • Relational (RDBMS)
    • PostgreSQL
    • MySQL
    • Amazon RDS
    • Amazon Aurora
  • NoSQL
    • MongoDB
    • Cassandra
    • Redis
    • Amazon DynamoDB
    • Couchbase
    • CouchDB
    • Google BigTable
  • Data warehouse and analytics databases
    • Amazon Redshift
    • Snowflake
    • Google BigQuery
  • Search engine
    • Elasticsearch
    • Solr
    • Splunk
  • Graph
    • Neo4j
    • Titan
    • Giraph
  • Time-series
    • InfluxDB
    • Druid
    • Prometheus
  • In-memory (as covered in The Forrester Wave™: In-Memory Databases, Q1 2017)
    • Oracle TimesTen
    • SAP HANA
    • Teradata Intelligent Memory (TIM)
    • Microsoft SQL Server
    • IBM DB2
    • Red Hat JBoss Data Grid
    • Redis
    • Couchbase Server
    • Aerospike
    • DataStax
    • VoltDB
    • MemSQL
    • Starcounter


Data Access and Querying

  • Apache Hadoop MapReduce
  • Apache Hive
  • Apache Pig
  • Apache Drill
  • Presto
  • Apache Lens
  • Apache Kylin


ETL/ELT

  • Stitch
  • Fivetran
  • Singer


Data Mining and Business Intelligence (BI)

Data science and data scientists tend to perform a lot of their data analysis and other tasks by writing custom software or code. Analysts, data miners, and other similarly related roles tend to use pre-packaged software instead. Note that data scientists may also use some of the software listed below for quick and dirty analysis and visualization.

Data mining is a relatively broad term that includes components of machine learning, statistical and general data analysis, data visualization, and so on. Since data is required, data mining also involves databases, data management and processing, and other data handling techniques as already discussed.

Wikipedia highlights the goal of data mining as the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. Many of the techniques involved are of the unsupervised machine learning variety, and include clustering, anomaly detection, and so on.

Business intelligence is similar to data mining in that specialized software is used for the extraction of patterns and knowledge from data, but where the data is specific to a business (e.g. operations, CRM, Marketing, ERP, Sales). BI’s ultimate goals are to generate actionable insights, drive data-based decision making, make key predictions, and achieve top-level business goals and KPIs in general (e.g. increase revenue, increase customer acquisition).

Because of the data generated, and many different data sources available to most businesses, often data is collected, integrated, and stored in a database system known as a data warehouse or data lake. This results in a comprehensive, single “source of truth” database that can be leveraged for analytics, decision support, and so on.

In many cases, subsets of the data from a data warehouse or data lake can be exposed independently to different functional groups, or departments of an organization. These independent subsets of data are usually called data marts, and can include data specific to sales, marketing, operations, and so on.

Data Mining

  • RapidMiner
  • KNIME
  • Weka
  • Orange


Business Intelligence

  • Plot.ly
  • Tableau
  • Looker
  • Periscope
  • Mode Analytics
  • Amazon QuickSight
  • Qlik
  • SlamData (for MongoDB)
  • Google Data Studio


Data Instrumentation and logging

This final section lists technologies available for instrumentation (i.e., data measurement) and data logging. Data instrumentation and logging can be implemented on the front end (UI) and/or back end (server) of any software application, and with the goal of storing the data to be made available for various forms of analytics. Instrumentation and measurement also applies to data generated by sensors in the case of the internet of things (IoT).

For web and mobile applications, a common technique is to place an HTML tag within the pages of the application’s front-end, and also track user actions and events within the app (e.g. click, swipe, pageview) using code written in JavaScript.

Often server and other application execution and operational data is logged as well. This logging can be very useful for application/server monitoring and status (health), application performance metrics (e.g. response time), API usage and processing, transaction and CRUD auditing, user actions and engagement auditing, error handling, application troubleshooting, and the list goes on.

Here are some relevant technologies, and note that some of these also offer functionality like data integration, processing, and analytics.

  • Google Analytics
  • Segment
  • Snowplow
  • Mixpanel
  • Heap
  • Kissmetrics
  • Keen IO
  • Splunk
  • Sumo Logic
  • New Relic
  • Loggly
  • ELK stack
    • Elasticsearch
    • Logstash
    • Kibana


Summary

This article series has covered a lot of information.

We’ve now had a solid overview of the many aspects of developing and deploying production-ready machine learning and artificial intelligence-based solutions. This includes covering relevant programming languages, packages, libraries, techniques, considerations, and so on.

For a much more thorough resource on topics discussed in this article series, and for a more in-depth listing of software and packages by category, please feel free to check out my related GiHub resources repository.

Happy learning!