I passed the Python data analysis test, so I summarized the points

About this article

I passed the Python Data Analysis Test, so I summarized the points.

1. Role of data analysis engineer

Supervised learning and unsupervised learning

Supervised learning is a learning method that has a label that gives the correct answer. The target data that is the correct label is called the objective variable. Data other than the objective variable is called the explanatory variable. Supervised learning is a learning method that predicts the objective variable using ** explanatory variables **.

On the other hand, unsupervised learning is a learning method that does not use correct labels. Since there is no correct label, it means ** a learning method without an objective variable **.

Classification and clustering

The classification of supervised learning is ** clearly defined in advance how many groups to divide. For example, if you want to classify dogs and cats, you would divide them into two groups.

Clustering, on the other hand, is categorized as unsupervised learning, and ** it is not clear how many groups there will be **. Maybe it's 3 groups, maybe 5 groups.

Machine learning processing procedure

Machine learning is processed in this way.

Get data->Data processing->Data visualization->Algorithm selection->Learning process->Accuracy evaluation->Trial operation->Result use (service operation)

Machine learning just needs ** data **.

Data analysis package

The main packages for data analysis are:

Even if I make a mistake, I don't use django. Although SciPy has little presence in reference books, it is a package used for data analysis.

2. Python and environment

pip command

The pip command will update the installed library to the latest version by adding the -U option. To install the latest version explicitly, it looks like this.

$ pip install -U numpy pandas

Remove whitespace string

Use the strip method to remove the ** left and right whitespace characters **.

in


bird = '   Condor Penguin Duck    '
print("befor strip: {}".format(bird))
print("after strip: {}".format(bird.strip()))

out


befor strip:    Condor Penguin Duck    
after strip: Condor Penguin Duck

pickle module

The ** pickle module ** serializes Python objects so that they can be read and written in files.

pathlib module

If you want to use paths in Python, use the ** pathlib module **.

Magic command

Jupyter Notebook has a command called ** Magic Command **. For example, %% timeit and% timeit. Both are commands that execute a program multiple times and measure the execution time.

% timeit measures the time for a single line of program. On the other hand, %% timeit measures the processing time of the entire cell.

in


%%timeit
x = np.arange(10000)
fig, ax = plt.subplots()
ax.pie(x, shadow=True)
ax.axis('equal')
plt.show()

out


#Output of figures is omitted
12 s ± 418 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Basic knowledge for reading mathematical formulas

Mathematics takes time to write in Qiita, so I will briefly introduce it. I think it's a good idea to take a closer look at the graphs to see what the trends are.

Logarithmic function

The function expressed by the following formula is called ** logarithmic function **.

f\left( x\right) =\log _{2}x

Euclidean distance

There is ** Euclidean distance ** as a method to find the scalar of the magnitude of the vector, that is, to find the norm.

\left\| x\right\| _{1}=\left| x_{1}\right| +\left| x_{2}\right| +\ldots +\left| x_{n}\right| 

Simply put, the absolute values of each element of the vector are added together.

Matrix multiplication

Multiplying the m × s matrix by the s × n matrix gives the m × n matrix.

Like the m × s matrix and the x × n matrix, it cannot be multiplied unless the number of matrices matches. Also, unlike mathematical multiplication, matrix multiplication results in different results when the order changes.

Differentiation of natural logarithm

$ f \ left (x \ right) = e ^ {x} $ does not change even if it is differentiated **.

f'\left( x\right) =e^{x}

4.1 NumPy

dtype attribute

You can check the ** element data type ** of the NumPy array ndarray with the dtype attribute. By the way, the Python type method can check the type (ndarray) of the array itself.

in


a = np.array([1, 2, 3])
print("ndarray dtype: {}".format(a.dtype))
print("ndarray type: {}".format(type(a)))

out


ndarray dtype: int32
ndarray type: <class 'numpy.ndarray'>

Copy and reference

In ndarray, the operation b = a is a reference. (If you change the value of b, the value of ** also changes **) If you operate b = a.copy (), it will be treated as a copy. (Change the value of b does not change the value of ** a)

If you slice a Python standard list, you will be passed a ** copy **, but if you slice the result in Numpy, you will be passed a ** reference **.

If you try various combinations, you will get a better understanding.

nan Use np.nan to declare non-numeric in NumPy.

in


a = np.array([1, np.nan, 3])
print(a)

out


[ 1. nan  3.]

Matrix division

The vpslit function decomposes the matrix in the ** row direction **, and the hsplit function decomposes the matrix in the ** column direction **.

in


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
first1, second1 = np.vsplit(a, [2])
first2, second2 = np.hsplit(second1, [2])
print(second2)

out


[[9]]

Average value

Use the mean method to find the mean of the matrix.

in


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a.mean()

out


5.0

Logical value

ndarray is displayed as True / False when compared by operator.

in


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a > 4

out


array([[False, False, False],
       [False,  True,  True],
       [ True,  True,  True]])

4.2 Pandas

Index / column name specification

Use ** loc method / iloc method ** to extract data by specifying index or column from DataFrame.

The loc method specifies the index and column name ** index name and column name **. The iloc method specifies indexes and columns by ** position or range **.

in


df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]

display(df.loc[["01", "03"], ["A", "C"]])
display(df.iloc[[0, 2], [0, 2]])

image.png

Write / read data

Data is written with to_xxx and read with to_xxx. excel, csv, pickle, etc. are supported.

in


df.to_excel("FileName.xlsx")
df = pd.read_excel("FineName.xlsx")

Sorting data

The data is sorted by the sort_values method. ** By default, the sort is done in ascending order. ** ** Set ʻascending = False` as an argument to sort in descending order.

in


df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]

df.sort_values(by="C", ascending=False)

image.png

One-hot encoding

You can convert to One-hot encoding using the get_dummies method. One-hot encoding adds ** columns ** only for categorical variable types.

Date array

Use the data_range method to get a date array. You can set dates ** to the arguments ** start and end **.

in


dates = pd.date_range(start="2020-01-01", end="2020-12-31")
print(dates)

out


DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
               '2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
               '2020-12-30', '2020-12-31'],
              dtype='datetime64[ns]', length=366, freq='D')

4.3 Matplotlib

Subplot

Specify the number of subplots to place in the argument of the subplots method. ** A two-row subplot is placed for one number, and a two-column subplot is placed for ncols **.

in


fig, axes = plt.subplots(2)
display(plt.show())

image.png

in


fig, axes = plt.subplots(ncols=2)
display(plt.show())

image.png

Scatter plot

Scatter plots can be drawn with the scatter method.

histogram

The histogram can be drawn with the hist method. You can specify the number of bins ** with the ** bins argument.

pie chart

Pie charts can be drawn with the pi method. By default, it is drawn ** counterclockwise ** from the right.

style

For the color, you can specify ** the color name defined in HTML, X11, or CSS4 **. Font styles can also be ** defined in a dictionary and applied collectively, or applied individually **.

4.4 scikit-learn

Classification model

The classification model dataset is divided into ** training data ** and ** test data **. This is because the model's ** generalization ability ** needs to be evaluated.

Decision tree

The decision tree has features that the model can be visualized and the contents are easy to understand. The parameters must be set by the user. The purpose of the decision tree is to ** maximize information gain ** or minimize ** impure **. (Both have the same meaning)

Dimensionality reduction

Dimensionality reduction is the task of reducing dimensions without damaging the data as much as possible. For example, you can delete the unimportant Y data from the X and Y 2D data to make it X-only 1D data.

ROC curve and AUC

The ROC curve is to predict that all data above the probability of each data is a positive example when the data are arranged in descending order of probability. As the AUC value approaches 1, the sample with a relatively high probability tends to be a positive example, and the sample with a relatively low probability tends to be a negative example. In other words, AUC can compare the goodness between models.

Reference / Citation

A new textbook for data analysis using Python

Recommended Posts

I passed the Python data analysis test, so I summarized the points
I passed the python engineer certification exam, so I released the study method
[Super basics of Python] I learned the basics of the basics, so I summarized it briefly.
I did Python data analysis training remotely
Data analysis python
How do I represent the data passed in Curl --data-urlencode in Python Requests?
Extraction of synonyms using Word2Vec went well, so I summarized the analysis
[Data analysis] Should I buy the Harumi flag?
I passed the 1st AI implementation test [A grade], so I tried various things
Data analysis with python 2
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
I tried to predict the J-League match (data analysis)
[Python3 engineer certification data analysis test] Examination / passing experience
How to study Python 3 engineer certification data analysis test by Python beginner (passed in September 2020)
I touched the latest automatic test tool "Playwright for Python"
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
[Python] Notes on data analysis
[Understand in the shortest time] Python basics for data analysis
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
Python data analysis learning notes
I tried programming the chi-square test in Python and Java.
Python for Data Analysis Chapter 2
[Python] I tried collecting data using the API of wikipedia
Data analysis using python pandas
I downloaded the python source
Python for Data Analysis Chapter 3
I studied four libraries of Python 3 engineer certified data analysis exams
Python: Time Series Analysis: Preprocessing Time Series Data
AtCoder: Python: Daddy the sample test.
In the python command python points to python3.8
I liked the tweet with python. ..
Preprocessing template for data analysis (Python)
I passed the 1st AI implementation test [A grade], so I tried various things
A python implementation of the Bayesian linear regression class
November 2020 data analysis test passing experience
I wrote the queue in Python
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
I wrote the stack in Python
I'm tired of Python, so I analyzed the data with nehan (corona related, is that word now?)
I tried logistic regression analysis for the first time using Titanic data
Experience Part I "Multinational Currencies" in the book "Test Driven Development" in Python
Since handling the Cython mold was troublesome, I summarized the points I was careful about
I tried fMRI data analysis with python (Introduction to brain information decoding)
Python visualization tool for data analysis work
Write the test in a python docstring
Generate Japanese test data with Python faker
I summarized the folder structure of Flask
I tried factor analysis with Titanic data!
I saved the scraped data in CSV!
The Python project template I think of.
Data analysis starting with python (data preprocessing-machine learning)
[Python beginner] I collected the articles I wrote
I touched the data preparation tool Paxata
I studied about Linux, so I summarized it.
Python 3 Engineer Certified Data Analysis Exam Preparation
I tried the Python Tornado Testing Framework
python setup.py test the code using multiprocess
Test whether the observed data follow the Poisson distribution (Test of the goodness of fit of the Poisson distribution by Python)