[PYTHON] Organizing basic procedures for data analysis and statistical processing (2)

From Yesterday, I have organized the basic procedure of data analysis, but it is a continuation.

Select tools for analysis

At PyCon JP 2014 held on the weekend, the story that there are many Python 2 users / entry / pycon-jp-2014) was there, and Heroku's Kenneth Reitz said No Benefit 81% AE1% E6% 97% A5% E7% 9B% AE_keynote_% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6), but in fact Python as of 2014 If you use, you can choose 3.x series. By the way, I think that most people recommended Python 3 in the subsequent Japanese announcement.

Why use Python 3

Here's my idea why I use Python 3 for data analysis.

String encoding defaults to Unicode
No new features have been added to Python 2, and upcoming libraries will be based on Python 3.

These two are no longer enough.

The readers of this article are probably overwhelmingly Japanese, and of course the author is also Japanese. English-speaking people may not enjoy the benefits of Unicode, but today there is no language where UTF-8 is not the default. Once you've touched Python 3, no Japanese will ever want to go back to Python 2. In an era when there is almost no distance between countries on the Internet, languages that are difficult to use in a specific region should not be used as they are, and should comply with global standards.

Also, Python 2 series will no longer add new features. In contrast, powerful libraries such as pandas, which appeared around 2008, have developed rapidly in the last few years, and combined with the big data boom, Python is analyzed for data. Elevating to a powerful major language status for. Since Python 3 was introduced in 2008, the assets before that will be small or obsolete in this area, so we should move to the Python 3 series immediately. The cost of migration is unlikely to be that high.

Data analysis Three sacred treasures

Now MATLAB and R Language When trying to perform numerical analysis, the focus is on calculating vectors and matrices. The tools required at this time are as follows, if I list them as three types of sacred treasures at my own discretion.

pandas (data structure and its manipulation)
matplotlib (data visualization)
IPython (Interactive Data Analysis Environment)

Again, from I've dealt with a lot in the articles so far, no additional explanation is needed anymore.

Do you really need Hadoop?

When it comes to big data in private companies, I think that the name Hadoop comes out immediately after entering the keyword or product name. But given the content of the analysis, is it really necessary? Please think carefully about the size of the data as well as the nature of the calculation each time.

MapReduce and Spark are very powerful tools if they are used correctly, for example, Use it for sampling from the population or [do it in a simple language like Pig or Hive](http://qiita. com / ynakayama / items / d2a8c125360e053d5a2f), but the applicable part is only a small part of the data analysis phase. In modern times, a computer at hand is sufficient for analyzing and visualizing data on samples, and in that case, not only Python but also spreadsheet software such as R or Excel can be used.

Make sure that using certain software or tools is not an end in itself.

Write sample code

The best way to see the usefulness of a data analysis tool is to move your hands and write code anyway.

File I / O

First of all, analysis is from reading the prepared data from a file. Also in NumPy / SciPy genfromtxt and [loadtxt](http://docs.scipy.org/doc/ There are high-level functions such as numpy / reference / generated / numpy.loadtxt.html # numpy.loadtxt), but if you can use pandas, basically read_csv It's a good idea to use pandas functions like /stable/generated/pandas.io.parsers.read_csv.html).

Generate random numbers according to a specific distribution

Random number generation is often used when writing and trying sample code. NumPy / SciPy is useful because it has various random number generators implemented. In particular, random numbers can be generated with arbitrary mean and variance according to the standard normal distribution (Gaussian distribution) [numpy.randam.normal](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal] .html) comes in handy.

As a test, let's generate a bivariate from random numbers that follow a standard normal distribution with an arbitrary mean and variance, find the basic statistics, calculate the covariance and correlation coefficient, and visualize it with a scatter plot.

#Generate 50 random numbers that follow a standard normal distribution with mean 10 standard deviation 5
a = np.random.normal(10, 5, size=50)
#Generate 50 random numbers that follow a standard normal distribution with a mean of 20 and a standard deviation of 8.
b = np.random.normal(20, 8, size=50)
#Visualize
plt.figure()
plt.scatter(a, b)
plt.savefig("image.png ")

I was able to plot a scatter plot that visualizes the relationship between the bivariates.

Use data frame

If you do the same thing with pandas, it will be like this.

#Create a data frame
df = pd.DataFrame({
    "a": np.random.normal(10, 5, size=50),
    "b": np.random.normal(20, 8, size=50)
})

#Visualize
plt.scatter(df['a'], df['b'])

The generated images are the same, so they will be omitted.

Matrix reference and manipulation

Using data frames with pandas in this way has the advantage of making matrix statistics and operations easier.

#Take a look at the contents of the generated random numbers
df
#=>
#             a          b
# 0   14.104370  13.301508
# 1    9.053707  32.101631
# 2    7.731780  14.075792
# 3    8.629411  38.876371
# 4   15.604993  24.380662
# 5   13.678605  16.517300
# ... (Omitted on the way)
# 46  12.229324  24.926788
# 47  16.650234  23.308550
# 48   8.101379  20.404972
# 49   0.807786  34.109284

#Calculate key basic statistics
df.describe()
#=>
#                a          b
# count  50.000000  50.000000
# mean   10.517972  20.291032
# std     4.229104   8.104303
# min    -0.618973   4.451698
# 25%     8.006941  14.085385
# 50%    11.442714  20.018789
# 75%    13.828762  24.770515
# max    17.617583  39.991811

#Find the covariance between matrices
df.cov()
#=>
#            a          b
# a  17.885320  -5.215284
# b  -5.215284  65.679722

#Find the correlation coefficient
df.corr()
#=>
#           a         b
# a  1.000000 -0.152165
# b -0.152165  1.000000

#Find the transposed matrix
df.T
#=>
#           0          1          2          3          4          5   \
# a  14.104370   9.053707   7.731780   8.629411  15.604993  13.678605   
# b  13.301508  32.101631  14.075792  38.876371  24.380662  16.517300   

#           6          7          8          9     ...             40  \
# a  12.321283   3.325300   5.439189  15.693431    ...      15.220284   
# b  30.198993  24.853103  10.381890  32.567924    ...      15.801350   

#           41         42         43         44         45         46  \
# a  13.493986   6.756807   9.030604  11.044724  11.443239  12.229324   
# b  14.278252  20.388216  20.582722  25.731553  18.479491  24.926788   

#           47         48         49  
# a  16.650234   8.101379   0.807786  
# b  23.308550  20.404972  34.109284

It's easy.

How about in other languages

By the way, as an aside, there are many cases where, for example, development in Java is the main business for companies whose main business is contract-type contract development. However, no matter how much business you do, it is best to use a language suitable for data analysis for data analysis.

As a trial, I mentioned in the sample code above, "Generate a bivariate from random numbers that follow a standard normal distribution with arbitrary mean and variance, find the basic statistics, calculate the covariance and correlation coefficient, and visualize it on a scatter plot." It's easy to see that trying to solve a simple problem in a common language other than statistical languages such as R can quickly become difficult.

Three points of social statistics in data analysis

Now that the preparation is a bit long, there are three major points in social statistics in a series of analyzes of the data obtained from the survey.

1. Understand the distribution of data

How the variates of numbers are distributed is a very important premise. A distribution is a distribution of variates, but some ** statistics ** are used to represent the state. For example, "mean" and "variance".

To capture the distribution status, use a frequency distribution table that presents the distribution of variable values in a tabular format, or a boxplot to visualize the summarized statistics.

2. Infer the state of the population from the sample

The next point is to infer the state of the entire population, called the ** population **, based on the statistics of the data. Most social research will be based on ** samples ** extracted from a part of a population.

Since the sample was taken as part of the population, the data does not necessarily accurately reflect the values in the population and will contain some error. It is necessary to acquire the concept and technique of "inference statistics" to interpret the analysis result obtained from the sample as the state of the population in which it was extracted. In inference statistics, after assuming the probability of error, "estimation" that specifies the value in the population and "test" that sets a hypothesis and judges whether it fits or not are performed.

3. Examine the association between multiple variables

Finally, find out what or to what extent there is a relationship between multiple variables. In this way, we will clarify the relationship between which factors affect what. It is also common to infer whether there is a relationship between variables in the population from which the data was obtained.

Summary

Each of these three points is intended to be presented as a straightforward indicator (= statistics) or conclusion (= guess) by focusing on some point of view. Statistical analysis plays the role of extracting and presenting information according to the purpose in a summarized form from a huge amount of data.

In most cases, statistical analysis methods and statistics can only be understood by accumulating items to be learned in advance. Therefore, even if you throw data into a computer in the dark clouds and see the results, you cannot proceed to the next step without sufficient knowledge.

From the next time onward, we will take steps to clarify the contents of the above three points.