[PYTHON] Try to aggregate doujin music data with pandas

Hello. I am writing the first draft while smoking shisha. It's muffled. It's been about 3 months since I posted to Qiita. long time no see.

Synopsis

Chata Do you know? I'm a relaxed douujin singer, but since I was singing "Dango Big Family," you may have heard it even if you didn't know the name. An acquaintance made a music database for Mr. Chata, and I was able to get the data, so I would like to analyze it using pandas.

What to do this time

It's not an analysis, but the CDs containing Chata's songs are tabulated by the release date. I will try two methods, a simple method using Seaborn's * count plot * and a slightly detour method that aggregates by year and month and then outputs with * bar plot *.

About the dataset

Uses data organized by CD. It consists of "CD name", "Circle name 1", "Circle name 2", "Release date", "Major label flag", and "Remarks", and is saved in csv format.

environment

Ubuntu 16.04 Python 3.5.2 :: Anaconda custom (64-bit)

Let's get started.

#coding:utf-8

import csv
import pandas as pd
import seaborn as sns

if __name__ == "__main__":

    #Read CSV
    dfCD = pd.read_csv("ChataData_CD.csv")

    #Release date
    releaseYear = []
    #Get the release date by turning the CSV storage data frame
    # i:Line name, row:series(Row value)
    for i,row in dfCD.iterrows():
        ymd = str(row['ReleaseYmd'])
        #Slice the release date to make it the release date
        releaseYear.append(ymd[0:4])

    #Put the release date list in the data frame
    chataCD = pd.DataFrame({'year':releaseYear})

    #Set Japanese font to seaborn
    sns.set(font='TakaoPGothic')

*** iterrows *** turns the data frame containing the original data to get the release date of the CD. *** iterrows *** is a method that turns a tuple consisting of ** row name ** and ** row value **. Is it an image of rotating the data frame vertically? (Reference: http://sinhrks.hatenablog.com/entry/2015/06/18/221747) Since it is registered as the release date in the dataset, slice it so that it becomes "yyyymm" and put it in the list.

pd.DataFrame ({** label name **: *** Series ***}) puts a list of release dates into the data frame. Series is a one-dimensional list. At first, I tried to add the release date (string) while turning the original data and failed.

Set the Japanese font in Seaborn and you're ready to go. Here, TakaoPGothic is specified. I referred to the following for specifying Japanese fonts. [Seaborn] Display Japanese (change font)

Let's visualize it using Seaborn. Let's start with a simple method.

    fig = sns.countplot(x='year',data=chataCD,palette='Greens_r').get_figure()
    fig.suptitle('Changes in the number of CD releases of Chata's songs(2000-2016)')
    sns.plt.savefig('countByYear_simple.png')

*** countplot *** is a method that counts X-axis or Y-axis data. It's easy because you only need to pass a data frame containing the release date of each CD. [Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2] (http://qiita.com/hik0107/items/7233ca334b2a5e1ca924) *** palette *** specifies the color palette. How to choose Seaborn color palette was helpful. The following is the output graph.

Next is a slightly detour method. It will take more time, but this is more likely to be a study of pandas.

    #Find the number of releases by year and month
    yrCount = chataCD['year'].value_counts(ascending=True).sort_index()
    year = []
    count = []
    for row in yrCount.iteritems():
        year.append(row[0])
        count.append(row[1])
    dfCount = pd.DataFrame({'year':year,'count':count})

    barplot = sns.barplot(x='year',y='count',data=dfCount,palette='Greens_r').get_figure()
    barplot.suptitle('Changes in the number of CD releases of Chata's songs(2000-2016)')
    sns.plt.savefig('countByYear.png')

For those who are detoured, we will create a data frame with two types of data, the release date and the number of releases. First, use ** DataFrame.value_counts ** to count the number of elements.

** DataFrame.value_counts ** returns * Series *. In this case, * index * is the year and month, and * values * is the number of sheets. Make a list of each release date and number of releases, and enter the value there. *** iteritems *** turns a tuple consisting of * index * and * values *, so add * row [0] * to the release date and * row [1] * to the number of releases.

After creating a data frame from the list of release date and number of releases, it is visualized with Seaborn. Use *** barplot *** to draw a so-called bar graph.

The story goes awry

When I import and run Seaborn, I get the following Warning, but I'm having trouble understanding the cause ...

/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/site-packages/PIL/Image.py:85: RuntimeWarning: The _imaging extension was built for another  version of Pillow or PIL
  warnings.warn(str(v), RuntimeWarning)

I would appreciate it if you could tell me the cause in the comments.