Articles in This Series
- Python vs R for Artificial Intelligence, Machine Learning, and Data Science
- Production vs Development Artificial Intelligence and Machine Learning
- Advanced Analytics Packages, Frameworks, and Platforms by Scenario or Task
Ah yes, the debate about which programming language, Python or R, is better for data science. In this series, I am considering machine learning and artificial intelligence as included in the term data science. This is almost the data science equivalent of tabs vs spaces for software engineers, at least at the time of this writing.
This series is intended to be a somewhat definitive guide on this topic, including recommendations for languages and packages (aka libraries) applicable to different use cases, including data science in production and big data scenarios. This series is not intended to give side-by-side code comparisons, as there are plenty of other articles covering that.
From my experience, which language to use is one of, if not the first question that someone interested in learning data science wants answered. The world of data science and analytics is vast and can be quite daunting for newcomers. I find that any guidance in determining what languages, tools, and specific tasks to start with is invaluable and highly appreciated.
Given that, I’ve written this series to help give guidance to those wanting to start learning more about data science, machine learning, and/or artificial intelligence, and need help choosing a language. This series is also intended for practitioners that wonder which language and packages work best in certain scenarios.
Although we’ll cover most considerations in this series, including fundamental computer science concepts, the short answer is that you should learn Python and R, and should definitely learn SQL too. If you’re really feeling ambitious, give Java, C++, and Scala a shot as well. While not specific to data science, the TIOBE Index is a great, up-to-date way to assess the popularity and relevance of different programming languages.
When I say ‘learn’, I mean learn fundamental programming concepts and control flow structures, which are applicable to any computer programming language. You should also learn to carry out common tasks with data such as loading, querying, parsing, munging/wrangling, filtering, sorting, aggregating, visualizing, and performing exploratory data analysis (EDA). Part of this includes learning one or more packages that help with these tasks, which we’ll discuss later in the series.
The question isn’t which programming language to learn, it’s why to learn a specific language, when to learn it, and to what degree of expertise. It then becomes how and when to use a certain language and package to achieve certain goals, i.e., the why. For those of you that have read other articles of mine, we will be emphasizing the why throughout our discussion.
The rest of this series is intended to help those interested in figuring this all out.
Programming Languages, Characteristics, and Paradigms
Most programming languages implement many concepts, paradigms, and algorithmic flow structures as taught in the field of computer science. This includes things like literals, data types, keywords, operators, statements, expressions, assignments, variables, conditionals, loops, and so on.
Therefore, the goal in my opinion is less about mastering any specific programming language, but rather to gain a deep command of these concepts and structures. Once that’s accomplished, all languages should be relatively easy to pick up as needed, and ultimately just becomes a matter of differing syntax.
Programming languages differ by design and are characterized by many different attributes. While a full discussion of this is out of scope, there are a few characteristics worth mentioning. Those are code processing and execution, typing, and paradigm. Let’s discuss these in that order.
Once written, code is sometimes pre-processed before execution, but other times code is just executed directly. Compiled languages are those that pass code, written by a programmer, through software known as a compiler in order to translate high-level and abstract syntax into very low-level, specific-machine targeted, optimized code. Once the code is compiled, it is executed by a code execution engine or runtime.
Languages that are executed on the fly and without requiring a compilation step are known as interpreted languages, and the code execution engine is called an interpreter. These languages tend to be faster to prototype code with, but may not enjoy some of the optimization and performance benefits of running compiled code.
The term typing refers to the way in which a programming language handles different types, where a type is simply a designation given to data represented in the written code, and available to a running application (or program), so that the compiler or interpreter knows the data’s intended usage. Types can include numbers (e.g., integers, float), booleans (e.g., true/false), strings, arrays, objects, and so on.
From there, a language’s specific typing implementation can be further specified as being either strongly typed or dynamically typed. Strongly typed languages require that all variables and data in code are explicitly declared as a very specific type, and that types must be consistent throughout code flow and execution.
Usually compiled languages are strongly typed, and have checks in place at compile time to check for errors and inconsistencies. Having this mechanism promotes and provides type safety. Many software engineers view this as a benefit.
Dynamically typed languages, on the other hand, allow that types can be declared or not, and often the types are implicitly determined at runtime. Given that, it’s certainly easier to introduce bugs and other potential issues given the lack of type safety.
Despite type safety, this approach is often thought to be much more flexible and often speeds up the development flow, and some say results in increased programmer productivity, efficiency, enjoyment, and so on. Like with strongly typed languages, many software engineers view this as a benefit.
The last important thing to note before moving forward, are three of the most prominent and common paradigms of programming languages. These include scripting, procedural, and object-oriented. Note that many programming languages can be characterized by one or more of these paradigms simultaneously.
Other common paradigms include imperative, declarative, functional, symbolic, and logic, although we will not be covering those here. Let’s now take a closer look at the three paradigms that we will cover.
Scripting or script programming is a relatively simple paradigm. In this paradigm, code is written in script files that are meant to be run or executed by the language’s execution engine, and are usually intended to automate regularly occurring tasks. For scripting languages, the execution engine is usually an interpreter.
Scripting languages are executed by loading and running a script file through an interpreter, or also in a terminal running the language’s read–eval–print loop (REPL) interactive environment. This environment allows the programmer to execute command line interface (CLI) commands directly at the prompt in a terminal.
Procedural programming is characterized primarily by code organization, coupling, and data scope. Procedural code is written in such a way as to promote code reuse through well-defined functions and modularity, i.e., modules that work together to provide the functionality of a larger program or application. Each module typically is concerned with a certain group of related functionality.
Modules contain functions, which are also known as procedures (or routine, subroutine, …), hence the name procedural. These functions can be reused throughout the code, and usually take inputs in order to generate one or more outputs.
The more the code is organized into modules, each containing their own independent functionality and having well-defined interfaces (i.e., APIs), the more the code is considered to be loosely coupled. This should definitely be a goal of any software engineer or data scientist.
Lastly, data scope refers to the visibility and access of data to individual modules and functions. Often one defines data (e.g., variables) that are either local or global in scope. Data that’s globally scoped is able to be seen and used by code anywhere in the program, whereas locally scoped data is understood by, and available to a single function for example.
Data scope, and the related concepts of free and bound variables, can be further characterized using terms such as lexical, block, function, module dynamic, local, global, closure, and immediately invoked function expression (IIFE). Further discussion is out of scope for this article, no pun intended :)
Object-oriented programming (OOP) centers on the idea of objects in code, which are created from templates known as classes. These objects have self-contained (i.e., encapsulated) properties and methods.
For example, let’s say you’re building an application to sell used cars. The code could be written to define a car class as a template for individual car objects created (i.e., instantiated) from the class. Object properties are static values that characterize the object, whereas object methods (functions) represent functionality or actions that the object can perform.
Continuing with the car example, one could instantiate a car object from the car class, and give it properties like year, make, model, and selling price. As you can see, while all cars in this application will have these properties (as defined by the class), each car object will have different values (e.g., 2017 Audi A3: $20K, 2012 Honda Fit: $7K).
Object methods associated with cars could represent functionality like set year, mark as sold, add review, and so on. Methods usually follow naming conventions such as SetYear, setYear, or set_year. They are also often used to modify the values of certain object properties rather than allowing the properties to be modified directly.
Object-oriented programming also involves many more concepts that are out of scope here, including abstraction, polymorphism, inheritance, composition, interface, and so on.
Python is a multi-paradigm programming language that can be characterized as a dynamically-typed, scripting, procedural, interpreted, and object-oriented language. It comes with a very comprehensive built-in library called the standard library. Python’s built in functionality, power, and flexibility are strong reasons for learning it.
Python is also multi-purpose, and can be used for everything from data science, to system and network administration, building web applications, running utility scripts on your local machine, and so on.
Python is also a relatively simple language as compared to some others (e.g., Java, C#), and should be easy to learn for those completely new to computer programming. With the help of an online tutorial or two, someone new to the language can write simple working Python code in no time flat.
Python also enjoys a massive community, both online and in general. Anything that one wishes to do with the language, learn about, or have answered, should find it very easy to quickly find appropriate resources.
Python has become a formidable language in the data science, artificial intelligence, and machine learning spheres. This is largely due to the language’s flexibility and community, but it’s also a direct result of the production of many ultra-powerful, high-quality packages and modules.
These packages are fully capable of carrying out tasks such as exploratory data analysis (EDA), statistical analysis, predictive analytics, machine learning, artificial intelligence (neural networks and deep learning), recommender systems, and the list goes on.
One nice aspect of Python and its package ecosystem, and depending on perspective, is that while there are many packages available, certain ones stand out and are much more popular than others (scikit-learn for example). Given that, users can accomplish a tremendous amount just by using a relatively small group of very well known packages (discussed more later).
On the other hand, Python’s usefulness for data science was previously criticized for lack of packages and functionality as compared to R, although that gap has certainly continued to close.
As compared to the R package ecosystem, Python packages usually need to be obtained from the vendor themselves, or by using a package management software such as Anaconda, Miniconda, or PIP (see PyPI, Python’s package index).
Python’s lack of a single centralized package repository, which R does have, isn’t necessarily a bad thing, although it does make package installation and maintenance (e.g., updating) become a little more involved.
Lastly, many point out the Python is much more performant (i.e., faster) as compared to R, which some people believe is significantly slower for certain tasks. That said, I’ve also read articles claiming that an enhanced R distribution, Microsoft R, is now faster, but I have not verified that.
Like Python, R is a multi-paradigm language that can be characterized as a dynamically-typed, scripting, procedural, and interpreted language. It can also support a type of object-oriented programming, but is less known for that as compared to Python.
R is considered statistical software (similar to SAS and SPSS) and is very specialized and well-suited for statistics, data analysis, and data visualization. It is therefore less flexible and diverse of a language as compared to Python. That said, due to its specialization, R enjoys a vast community of people also specialized in these fields.
R has a relatively strange syntax in my opinion, but that’s not a bad thing. It’s not a particularly difficult language to learn once you understand the CS fundamentals mentioned earlier, but it’s gotten the reputation of having a very steep learning curve from some people.
One way that R definitely sets itself apart from Python is in its natural implementation and support of matrix arithmetic and associated data structures such as vectors and matrices. It’s on par with Matlab and Octave in that regard, and similar implementation in Python usually involves the numpy package, which some think is a clumsier implementation.
It’s also worth noting that many view R as being superior to Python for both statistics and data visualization in terms of packages, implementation details, and final aesthetic results for visualizations.
R also has a significant and very large repository of curated packages known as CRAN. CRAN is a centralized and well-maintained repository of all packages available to the R language, and includes many very powerful and useful packages applicable to a large number of tasks.
While popular Python packages are straightforward and easy to identify and get started with, there seem to be way more packages available for R, and many of them highly specialized and relatively obscure. I find it to be a little less clear what packages are best for what tasks, but there are plenty of great ones to choose from.
Integrated Development Environments (IDEs)
Both Python and R have extensive command line interfaces (CLIs) which can be leveraged via a terminal running the language’s read–eval–print loop (REPL) interactive environment.
The most popular integrated development environment (IDE) for R programming is R Studio, and I must say, it’s pretty awesome. R programs and tasks can also be run from plain text files in combination with a terminal running the R REPL environment and CLI.
The concept of reproducible research and reporting, and subsequently the many notebook options used to produce it, is very popular with both R and Python. For R, the most common notebook and reporting tools are R Markdown and Knitr.
Unlike R, Python doesn’t really have a somewhat “official” IDE like R Studio. Two of the most common Python IDEs are Rodeo and Spyder, although any text editor can be considered an IDE in the same way as with R. Python code and scripts can also be run via the Python REPL environment and CLI.
Reproducible research and notebooks-wise, you’d be hard pressed to find a data scientist writing Python code that isn’t also using Jupyter notebooks. In fact, Jupyter is often used as a primary IDE for Python programming in general.
Which Language to Use, When, and Why
We’ve had a thorough discussion of both languages, along with their characteristics and associated paradigms. Now let’s talk about which language to use, when, and why.
The first thing worth mentioning is that both languages, and many packages written for each, are very solid and highly capable. You aren’t making an incorrect choice going one way or the other, at least initially. In certain situations that we’ll discuss here, one language or the other may be better suited for a certain situation or use case, or perhaps even a different language such as Scala (for Spark), C++, etc.
Cutting to the chase, I recommend to most beginners to start with Python with Jupyter notebooks. It’s very easy to learn and get up and running relatively quickly. Learning Python will also expose newcomers to object-oriented programming, although one can certainly write Python code in a more scripting/procedural fashion without delving into OOP.
For a very comprehensive and useful infographic comparing both languages, I highly recommend taking a look at DataCamp’s article on the subject here.
Further considerations should include the situations where data science tasks (analytics, machine learning, artificial intelligence) are carried out either on a local desktop or laptop machine by a data scientist (for example), or where these tasks are performed on servers (usually in the cloud). In the latter case, servers can either be a single server, or a distributed system (described below).
I further differentiate between performing these tasks for development or production purposes, where production implementations and deployments have a unique set of considerations, challenges, and requirements (e.g., packaging, devops, site reliability, rollback/failover, monitoring, …).
Lastly, and from a requirements perspective, we must also consider whether production deployments produce results in real-time or near real-time, and also whether a given task’s deliverable (e.g., predictive model, classifier, recommender system) is trained to a target level of performance either offline or online.
The next InnoArchiTech article will cover all of this in depth, along with suggestions and recommendations for which languages, packages, and platforms should be considered for each scenario.
I hope you have been able to learn something about the most popular programming languages used for artificial intelligence, machine learning, and data science tasks.
Keep in mind that things change over time, and there’s usually not one single perfect solution for every use case. Sometimes experimentation and testing is needed to find the optimal solution.
The next article will be a continuation of this guide, and will focus on which languages, packages, and platforms are best suited for certain use cases and environments.
Good luck on your data science pursuits, and stay tuned!