Comparing R, Python, SAS, SPSS from the perspective of European data scientists

nasa-53884-small.jpeg

As I mentioned in the article Third Wave-Democratization of AI, Machine Learning, and Data Science, data is now available anywhere in the world. Open source programming languages and tools such as R or Python have become widespread in the world of science.

In fact, we often interact with different types of customers, especially in Silicon Valley, but nowadays the enormous cost of using older enterprise data analysis and statistics tools such as SAS or SPSS in a company. The pressure to review seems to be increasing day by day, and it seems basically impossible or even quite difficult for new projects or newly hired people to get approval for such expenses. So, in the end, whatever you're trying to do with SAS or SPSS, you can do it with R or Python, and with access to even more state-of-the-art algorithms, that transition is happening. There seems to be. That's why for training in those programming languages, or for tools like our Exploratory that allow you to access various features of R without programming. Demand is starting to rise.

However, even so, in reality, there are many old companies and large companies that still pay a lot of money to use SAS or SPSS when such costs are not yet visible and problematic. Is also a reality. So today, data scientist Jeroem Kromme, who consults on data analysis in Europe, shares his team's experience. Based on this, we will compare major data science related tools such as R, Python, SAS, SPSS in an easy-to-understand manner, and give advice on introducing such tools this person. 2017/03/18 / python-r-vs-spss-sas /) I would like to introduce to you what was translated into Japanese with the consent of the person himself. I will.

I often see SAS, SPSS, R, and Python in my data analysis related projects, but SAS and SPSS are probably the most used. However, customers' interest in the open source programming languages R and Python is increasing day by day. Recently, some of our customers have also moved from SAS, SPSS to R or Python. Furthermore, even if you haven't migrated yet, recent commercial software (including SAS and SPSS) has been able to use R and Python from the inside, so in reality it is already R and I'm using Python.

SAS was originally developed at North Carolina State University with the primary purpose of analyzing numerical agricultural data. The acronym for SAS comes from the Statistical Analysis System. Based on the demand for such software at the time, SAS was founded as a company in 1976.

The Statistical Package for the Social Sciences, or SPSS, was originally developed for the analysis of social sciences, but at the same time it was also the world's first statistical programming language for PCs. Development first began at Stanford University, but eight years later the company was founded as SPSS and was acquired by IBM in 2009.

The R language was originally developed at the University of Auckland, New Zealand, with the main purpose of analysis using statistical models, and the first version was released as open source in 2000.

Compared to the other tools I've seen so far, the Python language is the only project that hasn't started in college. Python was originally created by a favorite Dutchman of Monty Python (named after this). He was looking for a project to do during Christmas one year and created a new development language called Python based on a programming language called ABC. ABC was actually created by him for the purpose of teaching programming to non-programmers. Python, along with C ++ and Java, is a versatile programming language. However, it is overwhelmingly easier to learn than such languages. Since then, many programmers have developed numerous modules, and recently with tools for various statistical analyzes, it is a level suitable for being listed together with the above-mentioned statistical tools today. It has become.

In this article, I would like to compare these four languages in terms of method, technology, ease of learning, visualization, support, and cost. This time, the comparison focuses on programming languages, so UIs like SAS Enterprise Miner or SPSS Modeler are not included in this comparison list.

Statistical methods and techniques

In the first place, when analyzing data, it goes back and forth between ** analysis for explanation ** and ** analysis for prediction **. Which analysis method to use depends on what your goals are at the time. Take, for example, a customer leaving or canceling. First you may have a question as to why customers are leaving, or you may be asking which customers are leaving. The main purpose of the first question is to ** explain ** why the customer is leaving. The second question is to ** predict ** which customers will leave. These are fundamentally different types of questions, and these differences will affect what analysis method you will use in the future. This prediction is more commonly referred to as data mining, or machine learning.

SAS and SPSS originated from ** data analysis for explanation **, partly because the perspective of hypothesis testing was originally developed in an important academic setting. Therefore, algorithms such as machine learning and AI are overwhelmingly less than R and Python. Of course, both tools have recently come out with packages suitable for creating predictive models such as SAS Enterprise Miner or SPSS Modeler, but they are completely separate tools from the main body and require a separate license. I will.

There are many reasons why open source software is great, but it's worth noting that it's constantly being improved by the community and that features are being added more and more. R was originally created with the hope that as many people around the world as possible would use the algorithms they created for academic and research people such as universities. As such, R can be said to be powerful in both ** explanatory analysis ** and ** predictive analysis ** with respect to the types and numbers of algorithms.

On the other hand, Python was originally developed with a focus on creating business applications rather than from an academic / research or statistical point of view, so it is especially strong when you want to use it directly in an application. .. For that reason, it can be said that there is a strong aspect of predicting its statistical function. Python is often used in machine learning applications where data analysts don't get in the way. Therefore, Python is also strong when processing and analyzing images and videos. For example, last summer we used Python to build our own autonomous radio-controlled car. I made it. Also, Python is probably the easiest language to use when using big data frameworks like Spark.

Easy to learn

Both SAS and SPSS have a comprehensive UI, so users don't have to program all the time. In addition, SPSS has a paste function that allows you to generate code based on the steps being performed. SAS has something called Proc SQL, which may be relatively easy for people who know SQL. However, both tools use completely different syntax and are completely different from any other programming language, so if you come across a situation where you have to learn one of these. I'm sorry for that.

There are some old UI-based solutions such as Rattle in R, but they are not at all comparable to SAS and SPSS. R is an easy language for programmers to learn, but the reality is that most data analysts have no programming experience. So for those people, R will be a difficult language to learn. However, once you have the basics, it becomes relatively easy after that. With that in mind, we are going to introduce Educational course where you can experience programming in R language as a data scientist. Made for those who want to be.

As I mentioned earlier, Python is based on the language ABC, which was originally created to teach programming to non-programmers, so the scripts written are easy to read. , The easiest language to learn. Python is a common programming language, so there is no UI.

In summary, it may be the easiest language to get started with, in that SAS / SPSS can start learning and you can analyze without programming.

support

Both SAS and SPSS are commercial software, so there is formal support. This may also be a reason for some companies to choose these tools. Because there is a sense of security that you can get help if you run into any problems.

On the other hand, I think the general public is aware that support for open source tools cannot be expected very much. But the reality is not. Sure, these open source tools may not be directly supported by their creators, but in reality a much larger community is spontaneously formed in both R and Python. Most of the time, many people there are willing to help. For example, someone has already reported a problem or question that you will face with a probability of 99% or more, and someone has already answered it, for example, on Stack Overflow (technical Q & A site). Is what you have. In addition, some companies already offer R and Python support services. So, ironically, there is no formal support for R or Python, but if you actually have a question, you can find the answer much faster than SAS or SPSS.

Data visualization

Regarding data visualization, although there are SAS and SPSS, it is extremely functionally important. You can make small changes, but it's still quite annoying or impossible to really customize. In that respect, you can customize a lot more with R and Python. There is ggplot2 as a visualization tool widely used in the R world, but basically anything can be adjusted and it is also possible to create interactive applications using a tool called shiny.

R and Python are constantly learning from each other, and one of the best examples is that Python actually offers the equivalent of this ggplot today. Another module often used when visualizing data in Python is Matplotlib.

cost

Both R and Python are open source, so anyone can use them for free. Of course, on the downside, as mentioned above, it may be harder to learn than something with a UI like SAS or SPSS. As a result, data analysts who are proficient in R and Python are paid much higher than those who aren't. And the cost of educating those who do not currently have such skills is not negligible. So, in reality, open source doesn't mean it's completely free, but even if you deduct it, it's still compared to tools like SAS and SPSS that still require extremely high license fees. R and Python are still by far the cheapest.

My choice

“Software is like sex, it’s better when it’s free” 
— Linus Torvalds (creator Linux)

Software is like sex, and it's better when it's free.
-Linus Torvalds (the creator of Linux)

By the way, the tools I often use when analyzing data are R or Python, and I hope they can be used anytime, anywhere, without having to buy a license or bother with them. The main reason for using R or Python outside of licensing is that it has a wide range of statistical methods. After all, you can choose and use any type of algorithm that best suits the data analysis you are doing at the time.

When asked whether to use R or Python, it depends on the ultimate purpose at the time. As I said earlier, Python is a language created with a focus on creating applications in a versatile programming language, so it is strong in machine learning and AI. Therefore, Python is used when creating applications that recognize faces and objects and perform deep learning. On the other hand, when an analysis is required to explain, such as understanding customer behavior patterns in order to understand why it is not just a prediction of which customer will cancel, R Is used.

That said, the two programming languages are complementary to each other. For example, there is a tool (reticulate, rPython) for executing Python code from within R, and conversely a tool (rpy2) for executing R code from within Python. So if you can mix and match the two languages, it's an even more powerful solution.


Translation up to here

So, I tried to translate it all at once, but how was it?

I think the following points are particularly noteworthy.

  • SAS / SPSS has a UI, so it's easy for people with no programming experience to get started. However, the license is very high.
  • R / Python is difficult to get started for those who have no programming experience. Especially R may be harder to get at first than Python.
  • Compared to SAS / SPSS, R / Python has an overwhelmingly faster number of algorithms and improvements in functionality.
  • Regarding support, although SAS / SPSS has formal commercial support, R / Python is spontaneously formed by passionate users and developers around the world for actual users. Thanks to the community we are in, it's easy to find the information we're looking for, and most of the questions and problems are already solved by others and shared online.
  • R is especially easy to use for explanatory analysis or when the analyst wants to analyze the data interactively. Furthermore, since it originally came out of the academic and research fields, algorithms such as statistics and machine learning cover various areas, and the number is very large.
  • Python is originally a versatile programming language, and it is also a language for developing applications, and it shows its true value when creating applications using machine learning and AI.

It is fair (fair) because it is not a comparison that smells like the vendor who sells the product, which is often found in articles that make such comparisons, but a comparison from the perspective of a person who has actually analyzed data in the field for many years. ), It is a particularly useful article for those who are uncertain about the selection of these tools or who are interested in other tools than the ones they are currently using, because they are summarized in an easy-to-understand manner with more points. I can say that.

Data science boot camp training

Last but not least, I'd like to improve my business even more with these cutting-edge open source data science tools or algorithms such as R and Python, but I can start with the programming barrier. For those who haven't done it, we will hold ** Bootcamp Training ** in Tokyo next month and June to enable you to do data science without programming.

If you are interested, please see this page for details.

By Kan (Twitter)

Recommended Posts

Comparing R, Python, SAS, SPSS from the perspective of European data scientists
Python points from the perspective of a C programmer
Understand the status of data loss --Python vs. R
Existence from the viewpoint of Python
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
Learning notes from the beginning of Python 1
Learning notes from the beginning of Python 2
[Introduction to Data Scientists] Basics of Python ♬
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
Get the contents of git diff from python
The story of reading HSPICE data in Python
The transition of baseball as seen from the data
The wall of changing the Django service from Python 2.7 to Python 3
Learn Nim with Python (from the beginning of the year).
[Python] Get the text of the law from the e-GOV Law API
Kaggle competition process from the perspective of score transitions
Study from the beginning of Python Hour1: Hello World
Get the return code of the Python script from bat
Not being aware of the contents of the data in python
Let's use the open data of "Mamebus" in Python
Study from the beginning of Python Hour8: Using packages
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
the zen of Python
Try scraping the data of COVID-19 in Tokyo with Python
Different from the import type of python. from A import B meaning
[Python] Extract text data from XML data of 10GB or more.
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
The story of copying data from S3 to Google's TeamDrive
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Rewrite the record addition node of SPSS Modeler with Python.
Comparison of data frame handling in Python (pandas), R, Pig
[Python] I tried collecting data using the API of wikipedia
What beginners learned from the basics of variables in python
[Data science memorandum] Confirmation of the contents of DataFrame type [python]
Pass OpenCV data from the original C ++ library to Python
Towards the retirement of Python2
About the ease of Python
About the features of Python
The Power of Pandas: Python
Recommended books and sources of data analysis programming (Python or R)
[Python] Try to graph from the image of Ring Fit [OCR]
Automatic acquisition of gene expression level data by python and R
Studying web scraping for the purpose of extracting data from Filmarks # 2
How to avoid duplication of data when inputting from Python to SQLite.
[Python] Get the update date of a news article from HTML
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Conditional element extraction from data frame: R is% in%, Python is .isin ()
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Important unit seen from the Python lecture materials of Kyoto University
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
List of disaster dispatches from the Sapporo City Fire Department [Python]
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.