[PYTHON] I want to say that there is data preprocessing ~

Hello, this is sunfish. Do you have a favorite YouTuber, everyone? Are you worried about the increase in the number of registrants? If so, let's take a look at the data.


52 channels in total

Was acquired by YouTube API and accumulated. ↓ Channel information スクリーンショット 2020-10-26 19.22.03.png ↓ Posted video information スクリーンショット 2020-10-26 19.22.46.png

There is pre-processing 1-Strong format habit

スクリーンショット 2020-10-26 19.24.20.png This is the data that represents the length of the video and is in the ISO standard format. If you are familiar with it, you will notice that ** "PT24M18S"-> 24 minutes 18 seconds **. By the way, videos of 1 hour or more are written as ** "PT2H24M57S" **. And yes, I can't handle it as it is, so I have to make it into seconds or fractions, that is, numerical values.


In Analysis Tool nehan, it takes 4 steps to get a fraction from this string. (I ignored the number of seconds this time) The idea is to take a continuous number ** ending in ** M or H from the format ** (hours) H (minutes) M (seconds) S **. スクリーンショット 2020-10-26 19.29.52.png

The point is the part that extracts minutes and hours with ** Extract character string **, and it can be extracted very easily with the following settings. スクリーンショット 2020-10-26 19.36.00.png

I multiplied the number of hours by 60 and returned it to minutes, and I was able to get the total number of minutes. スクリーンショット 2020-10-26 19.38.07.png Depending on the language, this format seems to be easy to handle, but if you try to do it without programming, it will be quite difficult.

There is preprocessing 2 --I want only the latest data

Since we get channel information every day, naturally, the data of the same channel will be accumulated. So you can make a graph like this. (Channel: Hidetaka Kano [Official Channel] EIKO! GO !!) スクリーンショット 2020-10-26 19.44.13.png However, if you want to compare many channels, you only need the latest one data for each channel.


This is done in one step. Use ** Select n lines from beginning / end **. ↓ Sort in descending order by data acquisition date, and take the first line for each channel name (Title). スクリーンショット 2020-10-26 19.47.33.png

So, I was able to make such a graph with the latest data. スクリーンショット 2020-10-26 19.48.44.png

There is preprocessing 3 --A lot of letters are stuck together

Multiple keywords can be set for the channel, and they are stored separated by spaces in the data. スクリーンショット 2020-10-26 19.54.37.png At this rate, the number of words cannot be counted, so it is necessary to separate each word.


This is also completed in one step. Use ** Split String **. ↓ Put a space in the character string of the division standard, and check the option to hold the divided character string vertically. スクリーンショット 2020-10-26 19.58.31.png

Then you can break it down into words and make it vertical. スクリーンショット 2020-10-26 19.57.58.png I tried to aggregate the words, but it seems that there are no words that are common to many channels. .. .. Since we have a lot of data on cooking channels, we have the most dishes. スクリーンショット 2020-10-26 20.42.59.png


How about. Was there? The analysis tool nehan is a tool created to facilitate preprocessing. I hope you can convey the concept as much as possible.

Recommended Posts

I want to say that there is data preprocessing ~
I want CAPTCHA to say HIWAI words
Anyway, I want to check JSON data easily
I want to knock 100 data sciences with Colaboratory
Data preprocessing (2) Data is changed from Categorical to Numerical.
I want to get League of Legends data ②
I want to get League of Legends data ①
I want to create a web application that uses League of Legends data ①
Library for "I want to do that" of data science on Jupyter Notebook
I want to give a group_id to a pandas data frame
I want to refute "Ruby is not cool here"
I want to solve Sudoku (Sudoku)
I want to be able to analyze data with Python (Part 3)
I want to initialize if the value is empty (python)
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I tried to summarize SQLAlchemy briefly (There is also TIPS)
I want to use a wildcard that I want to shell with Python remove
I want to understand systemd roughly
Don't you want to say that you made a face recognition program?
Qiskit: I want to create a circuit that creates arbitrary states! !!
I want to acquire and list Japanese stock data without scraping
I want to scrape images to learn
I want to do ○○ with Pandas
I want to debug with Python
I want to convert vertically held data (long type) to horizontally held data (wide type)
I want to specify a file that is not a character string for logrotate, but is it impossible?
I want to get angry with my mom when my memory is tight
I tried to implement deep learning that is not deep with only NumPy
[Note] I want to completely preprocess the data of the Titanic issue-Age version-
"CSI" that I want to teach beginners in interactive console application production
I analyzed Airbnb data for those who want to stay in Amsterdam
I felt that mock for object is easier to see via patch.
I want to pin Spyder to the taskbar
I want to detect objects with OpenCV
SIGNATE Quest ① From data reading to preprocessing
I want to output to the console coolly
I want to scrape them all together.
I want to handle the rhyme part1
I want to know how LINUX works!
I want to blog with Jupyter Notebook
I want to handle the rhyme part3
I want to use jar from python
I want to build a Python environment
I want to use Linux on mac
I want to pip install with PythonAnywhere
I want to play with aws with python
I want to use IPython Qt Console
I want to display the progress bar
I want to make an automation program!
I want to embed Matplotlib in PySimpleGUI
I want to handle the rhyme part2
I want to develop Android apps on Android
I want to handle the rhyme part5
I want to handle the rhyme part4
There is no telnet! At that time
I went to "Summer is in full swing! Spark + Python + Data Science Festival".
I want to get the path of the directory where the running file is stored.
The story of IPv6 address that I want to keep at a minimum
I want to create a priority queue that can be updated in Python (2.7)