[PYTHON] Quickly try to visualize datasets with pandas

Until now, as a plotting using pandas + matplotlib, [various data plotting with pandas + matplotlib]( I have introduced things such as http://qiita.com/ynakayama/items/68eff3cb146181329b48) and Data visualization method by matplotlib (+ pandas).

Extract and Process When looking down on the data, the flow up to visualization is renewed. Let's organize and follow.

Make the dataset a pandas object

First you'll bring the dataset to the world of pandas, which has two main streams.

  1. How to read from an external file such as a csv file using functions such as pd.read_csv and pd.read_table
  2. How to convert associative array (dictionary) objects etc. to DataFrame

Of these, 1. is used when there is already structured data that can be used as is in an external file. For example, if you have a file called iris.csv, make it a pandas object as follows.

df = pd.read_csv("iris.csv")

Regarding 2., use it when you want to handle the data generated when extracting or processing with Python code to some extent with pandas. You may want to refer to pandas as it has rich documentation. The from_dict function converts the dictionary object as is into a data frame. If you want to specify the index explicitly, it is convenient to use the from_records function.

df = pd.DataFrame.from_records(my_dic, index=my_array)

Get the transposed matrix

In datasets, the X and Y axes are often the opposite perspective for the observer. Even in such a case, if it is a pandas data frame, it is always easy to use the .T method [Transpose Matrix](http://ja.wikipedia.org/wiki/%E8%BB%A2%E7%BD%AE%E8% You can get A1% 8C% E5% 88% 97). This is a very common use and should be remembered.

dft = df.T

In pandas textbooks, it seems that df = df.T is often set, but I prefer non-destructive conversion as above. ..

Interactively plot with IPython

Writing code that uses matplotlib also requires trial and error. At this time, it is efficient to repeat the steps of quickly drawing and checking the data frame diagram on IPython.

The ipython -i option allows you to specify a Python script as an argument, which allows you to work with the interactive shell while running this script. This is very convenient.

For example, if you have a class like this:

class MyClass:
    def __init__(self, args):
        self.my_var = args[1]
        self.my_array = []
        self.my_dic = {}

    def my_method(self):
        ...

If you start the shell as ipython -i my_class.py, MyClass will be loaded and you can retrieve the object as follows.

my_instance = MyClass()
arr = my_instance.my_array
dic = my_instance.my_dic

If you used my_method to store data in an instance variable such as self.my_dic, you can retrieve the data from this instance variable as above, and plot from here for interactive visualization.

Typical visualization method of data frame

In the first place, it is the usual two-dimensional data when it can be converted to a data frame, so it can be said that the work to be done has become apparent to some extent if explained so far.

Here are some visualization methods to try first.

The well-known Iris is used as the data set.

We have already introduced the details of shapes many times, so please refer to Past Articles.

Scatterplot matrix

First is the standard scatter plot matrix.

plt.figure() #Prepare the canvas

from pandas.tools.plotting import scatter_matrix
scatter_matrix(df) #Draw a scatterplot matrix

plt.show() #When displaying images interactively
plt.savefig("1.png ") #When outputting to an image file

hoge2.png

This is a Suguremono that gives you a bird's eye view of the correlation between each column and each row. If you can stabilize your mind by looking at the scatter plot matrix, you will be used to it.

Simple plot

After that, the step of preparing the canvas and the step of outputting the image will be omitted.

df.plot(legend=True)

hoge3.png

As I've mentioned many times, pandas defaults to True for legend. If you can't see the figure well because of the explanation, you can set legend = False.

Stacked bar graph

If you try to plot the first 10 data frames, it will look like this.

df10 = df.head(10)
df10.plot(kind='barh', stacked=True, alpha=0.5, legend=True)

1.png

bar graph

Bar graphs are useful for visualizing by narrowing down to a one-dimensional vector space.

df['sepal width'].hist()

hoge.png

Area chart

This is useful for tracking changes in multiple data over time.

df.plot(kind='area', legend=True)

3.png

Summary

How about. As you get used to it, you will unknowingly use the interactive shell to plot when facing the data. You can see the power of IPython, which allows quick trial and error, and pandas + matplotlib, which can be used seamlessly with Python, as productive tools.

Recommended Posts

Quickly try to visualize datasets with pandas
Quickly visualize with Pandas
Try converting to tidy data with pandas
Processing datasets with pandas (1)
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Try to factorial with recursion
Try to visualize the room with Raspberry Pi, part 1
Try to operate DB with Python and visualize with d3
A sample to try Factorization Machines quickly with fastFM
Try to profile with ONNX Runtime
I want to do ○○ with Pandas
Try to output audio with M5STACK
Try to reproduce color film with Python
Try logging in to qiita with Python
I tried to visualize AutoEncoder with TensorFlow
Try to predict cherry blossoms with xgboost
First YDK to try with Cisco IOS-XE
Try to generate an image with aliasing
Try to make your own AWS-SDK with bash
Try to solve the fizzbuzz problem with Keras
Try to solve the man-machine chart with Python
I tried to detect motion quickly with OpenCV
Try to extract Azure document DB document with pydocumentdb
Try to draw a life curve with python
[Python] How to read excel file with pandas
How to try the friends-of-friends algorithm with pyfof
Try to make a "cryptanalysis" cipher with Python
Interactively visualize data with TreasureData, Pandas and Jupyter.
Try to make a dihedral group with Python
Try to make client FTP fastest with Pythonista
Try to detect fish with python + OpenCV2.4 (unfinished)
Try to solve the programming challenge book with python3
[First API] Try to get Qiita articles with Python
I tried to move Faster R-CNN quickly with pytorch
Try to make a command standby tool with python
Getting Started with pandas: Basic Knowledge to Remember First
Try to dynamically create a Checkbutton with Python's Tkinter
Try to solve the internship assignment problem with Python
Try to predict forex (FX) with non-deep machine learning
Try to make RESTful API with MVC using Flask 1.0.2
[GCP] Try a sample to authenticate users with Firebase
How to access with cache when reading_json in pandas
How to extract null values and non-null values with pandas
How to extract non-missing value nan data with pandas
How to output CSV of multi-line header with pandas
How to convert JSON file to CSV file with Python Pandas
Try to get the contents of Word with Golang
[Neo4J] ④ Try to handle the graph structure with Cypher
[Python] A memo to write CSV vertically with Pandas
Try to tamper with requests from iphone with Burp Suite
Try to automate pdf format report creation with Python
Convert numeric variables to categorical with thresholds in pandas
Try to specify the axis with PyTorch's Softmax function
How to extract non-missing value nan data with pandas
Try scraping with Python.
How to use Pandas 2
Bootstrap sampling with Pandas
Visualize data with Streamlit
Learn Pandas with Cheminformatics