[PYTHON] I want to say that there is data preprocessing ~

Hello, this is sunfish. Do you have a favorite YouTuber, everyone? Are you worried about the increase in the number of registrants? If so, let's take a look at the data.

data

52 channels in total

Channel information (number of subscribers, channel setting keywords, etc.)
Posted video information (number of views, number of likes, number of comments, etc.)

Was acquired by YouTube API and accumulated. ↓ Channel information スクリーンショット 2020-10-26 19.22.03.png ↓ Posted video information スクリーンショット 2020-10-26 19.22.46.png

There is pre-processing 1-Strong format habit

スクリーンショット 2020-10-26 19.24.20.png This is the data that represents the length of the video and is in the ISO standard format. If you are familiar with it, you will notice that ** "PT24M18S"-> 24 minutes 18 seconds **. By the way, videos of 1 hour or more are written as ** "PT2H24M57S" **. And yes, I can't handle it as it is, so I have to make it into seconds or fractions, that is, numerical values.

processing

In Analysis Tool nehan, it takes 4 steps to get a fraction from this string. (I ignored the number of seconds this time) The idea is to take a continuous number ** ending in ** M or H from the format ** (hours) H (minutes) M (seconds) S **. スクリーンショット 2020-10-26 19.29.52.png

The point is the part that extracts minutes and hours with ** Extract character string **, and it can be extracted very easily with the following settings. スクリーンショット 2020-10-26 19.36.00.png

I multiplied the number of hours by 60 and returned it to minutes, and I was able to get the total number of minutes. スクリーンショット 2020-10-26 19.38.07.png Depending on the language, this format seems to be easy to handle, but if you try to do it without programming, it will be quite difficult.

There is preprocessing 2 --I want only the latest data

Since we get channel information every day, naturally, the data of the same channel will be accumulated. So you can make a graph like this. (Channel: Hidetaka Kano [Official Channel] EIKO! GO !!) スクリーンショット 2020-10-26 19.44.13.png However, if you want to compare many channels, you only need the latest one data for each channel.

processing

This is done in one step. Use ** Select n lines from beginning / end **. ↓ Sort in descending order by data acquisition date, and take the first line for each channel name (Title). スクリーンショット 2020-10-26 19.47.33.png

So, I was able to make such a graph with the latest data. スクリーンショット 2020-10-26 19.48.44.png

There is preprocessing 3 --A lot of letters are stuck together

Multiple keywords can be set for the channel, and they are stored separated by spaces in the data. スクリーンショット 2020-10-26 19.54.37.png At this rate, the number of words cannot be counted, so it is necessary to separate each word.

processing

This is also completed in one step. Use ** Split String **. ↓ Put a space in the character string of the division standard, and check the option to hold the divided character string vertically. スクリーンショット 2020-10-26 19.58.31.png

Then you can break it down into words and make it vertical. スクリーンショット 2020-10-26 19.57.58.png I tried to aggregate the words, but it seems that there are no words that are common to many channels. .. .. Since we have a lot of data on cooking channels, we have the most dishes. スクリーンショット 2020-10-26 20.42.59.png

Summary

How about. Was there? The analysis tool nehan is a tool created to facilitate preprocessing. I hope you can convey the concept as much as possible.

Click here for an introduction to the analysis tool nehan (https://nehan.io/product/).