[PYTHON] Make predictions using regression on actual data ~ part1

Last time, I output a scatter plot, so this time I will use a little more specific data. I would like to perform regression analysis as well as scatter plots.

This time, we will start with a regression when there is only one feature.

Data to be used

https://sites.google.com/site/datajackets/data-jackets/list2/dj0514 Https://vincentarelbundock.github.io/Rdatasets/csv/datasets/airmiles.csv on the site I will use the CSV file that can be downloaded from here.

This is data on the number of taxable passenger miles of commercial airlines in the United States for each year from 1937 to 1960.

First, let's draw a scatter plot.

Let's output a scatter plot using the previous code.

from matplotlib import pyplot as plt
import numpy as np

def main():
  data = np.genfromtxt("airmiles.csv",delimiter=",", skiprows=1)
  plt.scatter(data[:,1], data[:,2])
  plt.xlabel('year')
  plt.ylabel('')
  plt.show()

if __name__ == '__main__':
    main()

Click here for output results

スクリーンショット 2016-06-16 0.47.32.png

I think it will look like this. Now that you have a scatter plot, try drawing a regression line.

Let's draw a regression line!

Let's start with the code.


from matplotlib import pyplot as plt
import numpy as np

def main():
  data = np.genfromtxt("airmiles.csv",delimiter=",", skiprows=1)

  x = data[:,1]
  y = data[:,2]
  A = np.array([x,np.ones(len(x))])
  A = A.T
  m,c = np.linalg.lstsq(A,y)[0]

  plt.scatter(x, y)
  plt.xlabel('year')
  plt.ylabel('airmiles')
  plt.plot(x,(m*x+c))
  plt.show()

if __name__ == '__main__':
    main()

Click here for output results

スクリーンショット 2016-06-16 1.47.10.png

I am drawing a straight line that fits that data. Regression by the least squares method. Now, you'll notice that there's a new code here.

  A = np.array([x,np.ones(len(x))])
  A = A.T
  m,c = np.linalg.lstsq(A,y)[0]

I'd like you to google this area in detail (I'm not sure, so I'll add an explanation as soon as I understand it.)

np.linalg.lstsq

I will explain about this function. This is a module related to linear algebra included in numpy, and what kind of function it has Here http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.linalg.html It is detailed in.

lstsq is an abbreviation for least-square and is called "least squares" in Japanese.

"When approximating a set of numerical values obtained by measurement using a specific function such as a linear function or logarithmic curve assumed from an appropriate model, the assumed function should be a good approximation to the measured value. To determine the coefficient that minimizes the sum of squares of the residuals. "
(From Wikipedia)

This time by this least squares method Tilt: 1350.28173913 Intercept: -2620496.13536 Was calculated.

Next time, I would like to deal with analysis of multiple features (multidimensional regression).

Recommended Posts

Make predictions using regression on actual data ~ part1
Try using Pillow on iPython (Part 1)
Try using Pillow on iPython (Part 2)
Try using Pillow on iPython (Part 3)
[Python3] Let's analyze data using machine learning! (Regression)
Explanation of the concept of regression analysis using python Part 2
Visualize network data using Cytoscape from IPython Notebook Part 1
Explanation of the concept of regression analysis using Python Part 1
Make Python segfault on one line without using ctypes
[For recording] Keras image system Part 2: Make judgment by CNN using your own data set