[PYTHON] I want to say that there is data preprocessing ~

Hello, this is sunfish. Do you have a favorite YouTuber, everyone? Are you worried about the increase in the number of registrants? If so, let's take a look at the data.


52 channels in total

Was acquired by YouTube API and accumulated. ↓ Channel information スクリーンショット 2020-10-26 19.22.03.png ↓ Posted video information スクリーンショット 2020-10-26 19.22.46.png

There is pre-processing 1-Strong format habit

スクリーンショット 2020-10-26 19.24.20.png This is the data that represents the length of the video and is in the ISO standard format. If you are familiar with it, you will notice that ** "PT24M18S"-> 24 minutes 18 seconds **. By the way, videos of 1 hour or more are written as ** "PT2H24M57S" **. And yes, I can't handle it as it is, so I have to make it into seconds or fractions, that is, numerical values.


In Analysis Tool nehan, it takes 4 steps to get a fraction from this string. (I ignored the number of seconds this time) The idea is to take a continuous number ** ending in ** M or H from the format ** (hours) H (minutes) M (seconds) S **. スクリーンショット 2020-10-26 19.29.52.png

The point is the part that extracts minutes and hours with ** Extract character string **, and it can be extracted very easily with the following settings. スクリーンショット 2020-10-26 19.36.00.png

I multiplied the number of hours by 60 and returned it to minutes, and I was able to get the total number of minutes. スクリーンショット 2020-10-26 19.38.07.png Depending on the language, this format seems to be easy to handle, but if you try to do it without programming, it will be quite difficult.

There is preprocessing 2 --I want only the latest data

Since we get channel information every day, naturally, the data of the same channel will be accumulated. So you can make a graph like this. (Channel: Hidetaka Kano [Official Channel] EIKO! GO !!) スクリーンショット 2020-10-26 19.44.13.png However, if you want to compare many channels, you only need the latest one data for each channel.


This is done in one step. Use ** Select n lines from beginning / end **. ↓ Sort in descending order by data acquisition date, and take the first line for each channel name (Title). スクリーンショット 2020-10-26 19.47.33.png

So, I was able to make such a graph with the latest data. スクリーンショット 2020-10-26 19.48.44.png

There is preprocessing 3 --A lot of letters are stuck together

Multiple keywords can be set for the channel, and they are stored separated by spaces in the data. スクリーンショット 2020-10-26 19.54.37.png At this rate, the number of words cannot be counted, so it is necessary to separate each word.


This is also completed in one step. Use ** Split String **. ↓ Put a space in the character string of the division standard, and check the option to hold the divided character string vertically. スクリーンショット 2020-10-26 19.58.31.png

Then you can break it down into words and make it vertical. スクリーンショット 2020-10-26 19.57.58.png I tried to aggregate the words, but it seems that there are no words that are common to many channels. .. .. Since we have a lot of data on cooking channels, we have the most dishes. スクリーンショット 2020-10-26 20.42.59.png


How about. Was there? The analysis tool nehan is a tool created to facilitate preprocessing. I hope you can convey the concept as much as possible.

