From Yesterday, I have organized the basic procedure of data analysis, but it is a continuation.

At PyCon JP 2014 held on the weekend, the story that there are many Python 2 users / entry / pycon-jp-2014) was there, and Heroku's Kenneth Reitz said No Benefit 81% AE1% E6% 97% A5% E7% 9B% AE_keynote_% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6), but in fact Python as of 2014 If you use, you can choose 3.x series. By the way, I think that most people recommended Python 3 in the subsequent Japanese announcement.

Here's my idea why I use Python 3 for data analysis.

- String encoding defaults to Unicode
- No new features have been added to Python 2, and upcoming libraries will be based on Python 3.

These two are no longer enough.

The readers of this article are probably overwhelmingly Japanese, and of course the author is also Japanese. English-speaking people may not enjoy the benefits of Unicode, but today there is no language where UTF-8 is not the default. Once you've touched Python 3, no Japanese will ever want to go back to Python 2. In an era when there is almost no distance between countries on the Internet, languages that are difficult to use in a specific region should not be used as they are, and should comply with global standards.

Also, Python 2 series will no longer add new features. In contrast, powerful libraries such as pandas, which appeared around 2008, have developed rapidly in the last few years, and combined with the big data boom, Python is analyzed for data. Elevating to a powerful major language status for. Since Python 3 was introduced in 2008, the assets before that will be small or obsolete in this area, so we should move to the Python 3 series immediately. The cost of migration is unlikely to be that high.

Now MATLAB and R Language When trying to perform numerical analysis, the focus is on calculating vectors and matrices. The tools required at this time are as follows, if I list them as three types of sacred treasures at my own discretion.

- pandas (data structure and its manipulation)
- matplotlib (data visualization)
- IPython (Interactive Data Analysis Environment)

Again, from I've dealt with a lot in the articles so far, no additional explanation is needed anymore.

When it comes to big data in private companies, I think that the name Hadoop comes out immediately after entering the keyword or product name. But given the content of the analysis, is it really necessary? Please think carefully about the size of the data as well as the nature of the calculation each time.

MapReduce and Spark are very powerful tools if they are used correctly, for example, Use it for sampling from the population or [do it in a simple language like Pig or Hive](http://qiita. com / ynakayama / items / d2a8c125360e053d5a2f), but the applicable part is only a small part of the data analysis phase. In modern times, a computer at hand is sufficient for analyzing and visualizing data on samples, and in that case, not only Python but also spreadsheet software such as R or Excel can be used.

Make sure that using certain software or tools is not an end in itself.

The best way to see the usefulness of a data analysis tool is to move your hands and write code anyway.

First of all, analysis is from reading the prepared data from a file. Also in NumPy / SciPy genfromtxt and [loadtxt](http://docs.scipy.org/doc/ There are high-level functions such as numpy / reference / generated / numpy.loadtxt.html # numpy.loadtxt), but if you can use pandas, basically read_csv It's a good idea to use pandas functions like /stable/generated/pandas.io.parsers.read_csv.html).

Random number generation is often used when writing and trying sample code. NumPy / SciPy is useful because it has various random number generators implemented. In particular, random numbers can be generated with arbitrary mean and variance according to the standard normal distribution (Gaussian distribution) [numpy.randam.normal](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal] .html) comes in handy.

As a test, let's generate a bivariate from random numbers that follow a standard normal distribution with an arbitrary mean and variance, find the basic statistics, calculate the covariance and correlation coefficient, and visualize it with a scatter plot.

```
#Generate 50 random numbers that follow a standard normal distribution with mean 10 standard deviation 5
a = np.random.normal(10, 5, size=50)
#Generate 50 random numbers that follow a standard normal distribution with a mean of 20 and a standard deviation of 8.
b = np.random.normal(20, 8, size=50)
#Visualize
plt.figure()
plt.scatter(a, b)
plt.savefig("image.png ")
```

I was able to plot a scatter plot that visualizes the relationship between the bivariates.

If you do the same thing with pandas, it will be like this.

```
#Create a data frame
df = pd.DataFrame({
"a": np.random.normal(10, 5, size=50),
"b": np.random.normal(20, 8, size=50)
})
#Visualize
plt.scatter(df['a'], df['b'])
```

The generated images are the same, so they will be omitted.

Using data frames with pandas in this way has the advantage of making matrix statistics and operations easier.

```
#Take a look at the contents of the generated random numbers
df
#=>
# a b
# 0 14.104370 13.301508
# 1 9.053707 32.101631
# 2 7.731780 14.075792
# 3 8.629411 38.876371
# 4 15.604993 24.380662
# 5 13.678605 16.517300
# ... (Omitted on the way)
# 46 12.229324 24.926788
# 47 16.650234 23.308550
# 48 8.101379 20.404972
# 49 0.807786 34.109284
#Calculate key basic statistics
df.describe()
#=>
# a b
# count 50.000000 50.000000
# mean 10.517972 20.291032
# std 4.229104 8.104303
# min -0.618973 4.451698
# 25% 8.006941 14.085385
# 50% 11.442714 20.018789
# 75% 13.828762 24.770515
# max 17.617583 39.991811
#Find the covariance between matrices
df.cov()
#=>
# a b
# a 17.885320 -5.215284
# b -5.215284 65.679722
#Find the correlation coefficient
df.corr()
#=>
# a b
# a 1.000000 -0.152165
# b -0.152165 1.000000
#Find the transposed matrix
df.T
#=>
# 0 1 2 3 4 5 \
# a 14.104370 9.053707 7.731780 8.629411 15.604993 13.678605
# b 13.301508 32.101631 14.075792 38.876371 24.380662 16.517300
# 6 7 8 9 ... 40 \
# a 12.321283 3.325300 5.439189 15.693431 ... 15.220284
# b 30.198993 24.853103 10.381890 32.567924 ... 15.801350
# 41 42 43 44 45 46 \
# a 13.493986 6.756807 9.030604 11.044724 11.443239 12.229324
# b 14.278252 20.388216 20.582722 25.731553 18.479491 24.926788
# 47 48 49
# a 16.650234 8.101379 0.807786
# b 23.308550 20.404972 34.109284
```

It's easy.

By the way, as an aside, there are many cases where, for example, development in Java is the main business for companies whose main business is contract-type contract development. However, no matter how much business you do, it is best to use a language suitable for data analysis for data analysis.

As a trial, I mentioned in the sample code above, "Generate a bivariate from random numbers that follow a standard normal distribution with arbitrary mean and variance, find the basic statistics, calculate the covariance and correlation coefficient, and visualize it on a scatter plot." It's easy to see that trying to solve a simple problem in a common language other than statistical languages such as R can quickly become difficult.

Now that the preparation is a bit long, there are three major points in social statistics in a series of analyzes of the data obtained from the survey.

How the variates of numbers are distributed is a very important premise. A distribution is a distribution of variates, but some ** statistics ** are used to represent the state. For example, "mean" and "variance".

To capture the distribution status, use a frequency distribution table that presents the distribution of variable values in a tabular format, or a boxplot to visualize the summarized statistics.

The next point is to infer the state of the entire population, called the ** population **, based on the statistics of the data. Most social research will be based on ** samples ** extracted from a part of a population.

Since the sample was taken as part of the population, the data does not necessarily accurately reflect the values in the population and will contain some error. It is necessary to acquire the concept and technique of "inference statistics" to interpret the analysis result obtained from the sample as the state of the population in which it was extracted. In inference statistics, after assuming the probability of error, "estimation" that specifies the value in the population and "test" that sets a hypothesis and judges whether it fits or not are performed.

Finally, find out what or to what extent there is a relationship between multiple variables. In this way, we will clarify the relationship between which factors affect what. It is also common to infer whether there is a relationship between variables in the population from which the data was obtained.

Each of these three points is intended to be presented as a straightforward indicator (= statistics) or conclusion (= guess) by focusing on some point of view. Statistical analysis plays the role of extracting and presenting information according to the purpose in a summarized form from a huge amount of data.

In most cases, statistical analysis methods and statistics can only be understood by accumulating items to be learned in advance. Therefore, even if you throw data into a computer in the dark clouds and see the results, you cannot proceed to the next step without sufficient knowledge.

From the next time onward, we will take steps to clarify the contents of the above three points.

Recommended Posts