Yesterday, I explained the points to note regarding statistical errors as 5 viewpoints to detect the lies of statistics.

Today, I would like to refocus on the data that I am trying to analyze.

** KPI (key performance indicator) ** is a numerical value that quantitatively indicates what is required to achieve the goal. It's not a KPI that wants to be shaped and beautiful, but it's a KPI that loses 10 kilos or raises your nose by 1.5 centimeters by three months.

If you don't understand exactly what kind of data you are trying to use as a KPI, you often lead to wrong KPIs, and you run the risk of leading to meaningless data analysis.

What is a variable?

We will identify the condition of the surveyed person from various fields in various fields such as social surveys and medical statistics. For example, imagine a questionnaire or medical record. You will be asked what your gender is and how old you are. The values taken at this time can be broadly divided into ** discrete variable ** and ** continuous variable **. Of these, discrete variables are those that can clarify the boundaries of measured values. For example, gender. This discrete variable can be further divided into two depending on whether it can be ordered.

Unorderable discrete variables (eg gender, nationality, company to which they belong)
Discrete variables that can be ordered (eg grades: 1. excellent 2. good 3. possible 4. not)
Continuous variable

What is a scale

Variables are categorized by their scale level as follows:

** Nominal scale ** Variables that are given as simple categories and cannot be ordered are nominal scales. This includes the place of origin and the company to which you belong. Nominal scales, of course, cannot be averaged. However, you can take the mode.
** Ordinal scale ** A measure of the orderable discrete variables mentioned above. The ranking etc. corresponds to this. Since it is only an order, it is not possible to calculate that it is twice as much as the second person because it is the first place. This is because the scale interval is not constant, as the person in 1st place is slightly different from the person in 2nd place and the person in 3rd place or lower is very different. You cannot find the mean or variance.
** Interval scale ** It is a continuous variable that does not start from zero. For example, time of day, temperature, etc. It cannot be evaluated that the time is 20:00, which is twice as hot as 10 o'clock, or the temperature is 30 degrees, which is twice as hot as 15 degrees. However, the interval itself is meaningful, which is different from the ordinal scale. You can find summary statistics such as means and variances, but not ratios.
** Proportional scale ** It is a continuous variable starting from zero. For example, sales, price, number of users, days elapsed since a certain day, test scores out of 100.

What is a representative value?

A summary statistic is a number obtained by performing statistical operations to summarize data. Representative values are popular ones that are especially used in summary statistics. Speaking of average, you will be familiar with it. It is often used on a daily basis, including the calculation of per capita money when everyone has a drinking party.

Average value

It is also called the arithmetic mean, which is the sum of all the observed values and divided by the number. Since it is obtained from all data, it has the advantage of representing the overall variation. The disadvantage is that it is affected by outliers. For this reason, trimmed mean is sometimes used, such as averaging excluding the top or bottom few percent.

Median

It is the value that is located exactly in the middle when the observed values are rearranged. This is effective when the distribution shape is unknown or when it is expected that many outliers will be included.

Mode

As the name implies, it is the most observed value.

In any case, it should be noted that some information is missing because it is just a summary.

Example

It shows the distribution of ages when examining the number of members of a kindergarten.

age	Number of people
3	15
4	28
5	31
6	15
22	1
25	1
46	1
49	1
70	1
75	1

Consider which of the above is most appropriate to apply as a representative value in this case.

Linear regression

I've talked about variables for a long time. Let's do something that seems to be statistics. Suppose you have the following variables: This is a table of the age of the purchasing customer and the purchase price at a cosmetics store.

age	price(Unit 100/Circle)
24	236
27	330
29	375
34	392
42	460
43	525
51	578

At this time, age is an interval scale and price is a proportional scale.

Plot the data

Plotting is basically drawing a graph from variables. Why draw a graph? By drawing a graph, variables are illustrated and visualized with visual symbols. This makes it easier to understand the data and helps you to make hypotheses.

Python's NumPy and matplotplib are great libraries that are very often used in statistical mathematics and are illustrated by statistical methods. The function for this is extremely powerful and easy to handle. Anyway, let's draw a scatter plot.

import numpy as np #Load NumPy
import matplotlib.pyplot as plt #Loading matplotlib

v1 = np.array([24, 27, 29, 34, 42, 43, 51]) #A list of ages
v2 = np.array([236, 330, 375, 392, 460, 525, 578]) #price

plt.xlim(20, 55) #Specify X-axis range
plt.ylim(200, 600) #Specify Y-axis range
plt.xlabel('Age') #X-axis label for age
plt.ylabel('Price') #Y-axis label for price
plt.plot(v1, v2, 'o', color="blue") #draw
plt.show() #Display an image on the screen
plt.savefig("image.png ") #Save the image with a file name

I made a scatter plot like this.

Find a linear function

By the way, looking at the figure, it seems that the price of cosmetics purchased tends to increase as the customer ages. With human senses, I feel like I want to draw a straight line upward to the right.

Mathematically, linear regression is a type of regression analysis that approximates using a specific function such as a linear function (y = 2x, etc.) or logarithmic curve that is assumed from an appropriate model.

First, let's actually perform linear regression programmatically.

import numpy as np
import matplotlib.pyplot as plt

v1 = np.array([24, 27, 29, 34, 42, 43, 51])
v2 = np.array([236, 330, 375, 392, 460, 525, 578])

def phi(x):
    return [1, x, x**2, x**3]

def f(w, x):
    return np.dot(w, phi(x))

PHI = np.array([phi(x) for x in v2])
w = np.linalg.solve(np.dot(PHI.T, PHI), np.dot(PHI.T, v1))

ylist = np.arange(200, 600, 10)
xlist = [f(w, x) for x in ylist]

plt.plot(xlist, ylist, color="red")
plt.xlim(20, 55)
plt.ylim(200, 600)
plt.xlabel('Age')
plt.ylabel('Price')
plt.plot(v1, v2, 'o', color="blue")
plt.show()
plt.savefig("image2.png ")

In this way, the straight line (= linear function) was found. It looks like you've got an approximate solution.

I would like to leave the detailed story of linear regression to textbooks, but next time I would like to consider linear regression and its applications.

reference

Introduction to Social Statistics (The Open University of Japan Teaching Materials) http://www.amazon.co.jp/dp/4595313705

[PDF] Introduction to Statistics-Hideo Konami http://ruby.kyoto-wu.ac.jp/~konami/Text/Statistics.pdf

Data Visualization for Engineers [Practice] Introduction ~ Web Visualization with D3.js http://www.amazon.co.jp/dp/4774163260

[PYTHON] Understanding data types and beginning linear regression