[PYTHON] Data Scientist Training Course Chapter 2 Day 2

Today we proceeded with Chapter 2. As usual, the environment uses Docker.

The main libraries used are as follows

Numpy
Scipy
Pandas
Matplotlib

Numpy memo

Basically, I have touched Numpy itself until the last time, but there are some points that I do not understand, so I am proceeding while checking each time.

When it comes to dataframe calculations, you may be more likely to use Pandas DataFrames. With that in mind, I think Numpy is often used for calculations here and for generating random numbers.

As for Numpy random number generation, here

np.random.randn()

Seems to be often used. The randn function is a standard normal random number. In other words, it seems to generate normally distributed numerical values randn

There are several random number generators other than randn, and uniform will appear at the end. When generating multiple random numbers

np.random.randn(1000)

You can create 1000 random numbers by writing. In this case, the returned value will be an array

Also to get the numerical value that was in order

np.arrange(1000)

There was also a usage like. In the above case, an array from 1 to 1000 is returned. Used as the X-axis value when drawing the graph.

Scipy memo

Mainly used for matrix calculation. Since I have forgotten the matrix calculation itself, I proceed while reading, but I feel that there was certainly such a thing as the eigenvalues of the matrix or the inverse matrix. Revenge will be needed around here if necessary.

I learned a method called Newton's method as a method of solving equations.

from scipy.optimize import newton newton(sample_function, 0)

The value when sample_function is the argument 0 in the description like this. In other words, it finds x when f (x) = 0.

As an example, the formula x ** 2 + 2 * x + 1. In other words, the solution of f (x) = x ^ 2 + 2x + 1 was applied to the Newton function to find the answer. However, when I tried to feed f (x) = 2x ^ 2 + 2x + 1 to the Newton function, an error occurred.

Failed to converge after 50 iterations, value is 0.6246914113887032

I tried 50 iterations, but it seems that it didn't work.

newton(sample_function2,0,maxiter=1000)

The Newton function seems to be able to specify this iteration as an argument, so I tried to rotate it about 1000 times, but in the end it was useless. I don't really understand the characteristics of the Newton function itself, so it probably won't work, but I'm wondering what happened.

Pandas memo

I read the basics shit. The merge function is prepared as a method to join multiple DataFrames like an RDB table, but in the Chapter field, there is only an automatic join, and I could not find a way to manually specify the join condition.

This was written in the official Reference with detailed examples.

pandas.DataFrame.merge

If you take a closer look, in Chapter it's called pd.merge (data_frame1, data_frame2), but in the reference it's called data_frame1.merge (data_frame2). Since there is self in the first argument of the merge function, it seems that both are supported as a way of writing.

When I look it up, I get hit by many articles written as Japanese translations, and I get the urge to refer to them, but I think it's better to get into the habit of reading the formula firmly. First of all, I will try to challenge that. While shedding tears.

Matplotlib memo

Whether or not you understand the meaning properly, it's probably because you're getting tired that something that is drawn in the form of a graph makes you a little happy.

import matplotlib.pyplot as plt plt.plot(x, y, "o")

I made it draw in the form of, but when I changed this "o" part appropriately, the plotted one changed. It was as expected that "x" was a cross, but when I tried "g" for example, it became a bar graph. Looking at the Reference, instead of making it a bar graph by setting it to g, the default was a bar graph in the first place, and g had a green color specification.

matplotlib.puplot.plot

It seems that you can specify some other Plot Markers, so it's good to try it for fun.

The end

I managed to finish up to Chapter 2. I can proceed because I only have to remember it, but it hurts that I can't take the time to advance one chapter.

This is the basic usage of Python and libraries, and from the next Chapter, we will start with actual statistics and analysis, so the difficulty level will rise at once.