Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): Completed on Thursday, December 19th ・ Progate Python course (5 courses in total): Ends on Saturday, December 21st ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): Completed on Saturday, December 23 ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ ** Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018) **: January 4th (Sat) ~
p.346 Chapter 10 Data Aggregation and Group Operations Completed reading.
・ Explanation of data visualization libraries such as matplotlib and seaborn Setting elements such as linetypes can be found in ** DocString (function name +'?') **. (If you are importing matplotlib with as plt, use it like ** plt.plot? **.)
-Basically, matplotlib should be used, and add-on libraries such as pandas and seaborn should be used as needed.
Plot preparation
import matplotlib.pyplot as plt
fig = plt.figure() #An object that contains plotting capabilities.
ax1 = fig.add_subplot(1, 1, 1) #Add one or more subplots to plot.
#The format of the figure and the input data are described below.
・ Overview of what you can do Margin adjustment, axis sharing, title, legend and display position adjustment (optimal position with loc ='best'), Label rotation (rotation), annotation (annotate), figure addition (add_patch), Default value setting of matplotlib (rc method)
Axis class (AxesSubplot)Batch setting of attributes using the set method of
props = {'title': 'namae no ikkatsu settei', 'xlabel': 'aiueo'}
ax.set(**props)
-DataFrame also has a plot method. Can be used as is for data frames.
Visualization of value frequency
s.value_counts().plot.bar() #Horizontal bar at barh
The seaborn package makes it easy to visualize data that needs to be aggregated or summarized before plotting. Specify the data in the argument data, and specify the row and column names of the data frame in x and y.
・ Histogram: A type of bar graph, displaying the frequency of values as discrete data
Density plot: Generated from a continuous probability distribution that is presumed to have produced the observed data. Usually, this distribution is approximated as a simple sum such as a normal distribution called kernel. Therefore, the density plot is also called the "kernel density estimation (KDE) plot". (Plot.kde)
・ Methods that are likely to be used very often seaborn.distplot (histogram and density estimation plot can be created at the same time) seaborn.regplot (Create a scatter plot and apply a regression line by linear regression) seaborn.pairplot (Can visualize scatterplot matrix comparing each element at once)
・ Pandas groupby method Arbitrary processing can be executed by combining elements of datasets (understood as something)
-The group calculation process is a flow of split-apply-combine.
-Multiple elements can be specified for one data set. Is it possible to extract an arbitrary value, process it (average, count, etc.), and then group it again?
-It can also be classified using mapping information using a dictionary.
・ Functions of groupby method (count, sum, mean, median ...) Let's cover basic arithmetic calculations.
-The name given when data is aggregated by groupby can be changed by passing a tuple. You can also specify no index with as_index = False.
-Apply separates the objects, ** applies the function passed to each piece, and ** then joins them. Imagination is required because the function passed to apply must be implemented by the programmer himself.
・ Pivot table and cross tabulation. It can be implemented in both data frame functions and group by. Being able to handle these will be useful for data cleaning, modeling, and statistical analysis.
Recommended Posts