Make a note of frequently used items for basic tabulation, and plan to update at any time
from sklearn import datasets
import pandas as pd
from collections import OrderedDict
iris = datasets.load_iris()
df = pd.concat([pd.DataFrame(iris.data,columns=iris.feature_names),pd.DataFrame(iris.target,columns=["species"])],axis=1)
I'm glad that the recent update makes pandas easier to see.
describe
df.describe()
Basic statistics can be output
df["petal length (cm)"].describe()
Can be calculated with Series alone
value_counts
df["species"].value_counts()
Can be counted
get_dummies
pd.get_dummies(df["species"]).ix[[0,1,2,50,51,52,100,101,102]]
So-called dummy variables can be created (Indexed for easy viewing)
sort_values
df.sort_values("sepal length (cm)",ascending=False)
df can be sorted (ascending specifies ascending / descending order)
df.sort_values(["sepal length (cm)","sepal width (cm)"],ascending=False)
Multiple specifications are possible (priority is the index of the argument list)
groupby
df_groupby = df.groupby("species",as_index=False)
df_groupby.mean()
Since the groupby object can be reused, it is faster to store the groupby object in a variable when you want to apply multiple similar aggregations.
groupby.agg
df_groupby.agg({"sepal length (cm)": "mean",
"sepal width (cm)": ["mean","count"],
"petal length (cm)": ["max","min"],
"petal width (cm)": ["sum","var","std"]})
By specifying in dictionary format, individual aggregation for each column is possible (However, if you specify multiple aggregations for one column, it will be multi-column, so be careful)
Also, since this is in no particular order, use OrderedDict if you want to specify the order.
df_groupby.agg(OrderedDict((["sepal length (cm)", "mean"],
["sepal width (cm)", ["mean","count"]],
["petal length (cm)", ["max","min"]],
["petal width (cm)", ["sum","var","std"]])))
to_csv
df.to_csv("test.csv",index=False,encoding="utf8")
pd.read_csv("test.csv")
When index = False, the next reading is easy Sometimes it cannot be read unless encoding is specified (especially Windows)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
is a magic command for visualization in Jupyter
sns.boxplot(data=df, x="species", y="sepal length (cm)")
pairplot
sns.pairplot(data=df)
sns.pairplot(data=df, hue="species")
Can be divided by segment
jointplot
sns.jointplot(data=df, x="sepal length (cm)", y="sepal width (cm)", kind="kde")
distplot
sns.distplot(df["sepal length (cm)"], rug=True,)
Recommended Posts