[PYTHON] Obtain statistics etc. from the extracted sample

Previously, I did Story of extracting a sample by scanning 100% from the population with Hadoop. If there is little prior information about the data and you want to analyze it by fumbling, you will first analyze the extracted sample ad hoc from various angles to grasp the characteristics and trends of the data.

Make full use of pandas functions

Sampling and pandas by Hadoop ) Is excellently compatible. The combination of pandas + matplotlib is analyzed using two data structures, Series and DataFrame, as previously introduced. You can visualize the results.

Loading samples extracted with Hadoop

Since Hadoop output has a standard tab-delimited data structure, it can be read as it is by using the pd.read_table () function.

import pandas as pd
df = pd.read_table('hadoop-out.txt')
df.describe() #Find multiple summary statistics

#=> count              38156219 #Total population
#   unique              6536847 #Unique population
#   top      0024D69XXXXX,Area9 #1st index

You can also force the dictionary object to be converted to a data frame in the following ways:

df = pd.DataFrame(list(self.dic.values()), index=list(self.dic.keys()))

In the first place, the data is usually structured by the time it is processed by Hadoop using Fluentd etc., so it is compatible with pandas that handles structured data. The good thing is that it makes sense.

Convenient functions for series and data frames

The value_counts () function is useful for further aggregating results such as word counts. Find the observation frequency of the value from a one-dimensional data structure such as a series, an array, or a sequence.

Pandas also provides a function fillna () that fills in missing values, which allows you to fill holes in the extraction process with some value.

argument Description
value Scalar value to fill in the blanks.(Dictionaries are also acceptable)
axis 0 for rows, 1 for columns
limit Maximum number of consecutive fills
method Specify when filling in the holes with the average value or median value

The duplicated () function in the data frame returns a series. This can be used to check for duplicates as it returns True if the value has already appeared in that dataframe.

The replace () function replaces the value. For example, to consider 99999 to be a missing value and replace it with NaN:

series.replace('99999', np.nan)

It is also easy to remove or round outliers other than the reference value.

#Absolute value exceeds 3(-Other than between 3 and 3)Value to NaN
data[np.abs(data) > 3] = np.nan

Summary

Using pandas functions can help you narrow down the targets to be analyzed from the extracted specimens. Hadoop-friendly pandas are essential for fast PDCA cycles of analysis.

Recommended Posts

Obtain statistics etc. from the extracted sample
Mathematical statistics from the basics Random variables
Extracted text from image
Obtain the sequence information of the translated protein from the mutation information of CDS