[PYTHON] Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~

Use the Nico Nico Pedia dataset

The Nico Nico Pedia Dataset is a collection of 2008-2014 articles of Nico Nico Pedia published on IDR and comments on those articles.

It is suitable for research on natural language systems such as knowledge extraction, but it is not a docile data set like Wikipedia, but a rather quirky data set.

For example, nearly half of the sentences in Nico Nico Pedia are sentences that lack the subject, the writing style is not always unified, and AA is also included.

This time, I will introduce the contents of the data together with a simple preprocessing tool in search of ** interesting people ** who want to analyze this data set.

Pre-treat the provided Nico Nico Pedia

The data provided is ** a little special CSV ** and can be transformed into a standard CSV with proper pre-processing. Also, HTML is a bit cumbersome for parsing because some tags are cumbersome. For this reason

  1. Preprocess CSV
  2. Transform Nico Nico Pedia article (HTML) into JSON
  3. Take a glance

I will try this in this article.

Preprocessing

The important environment required for preprocessing is not the memory but the disk capacity. If you inadvertently have only 50GB of extra space, preprocessing will fail with an error.

Also, if you use Python, you should have more CPU and memory. ~~ Or rather, the performance of Pandas is not so good ... ~~

Application for data use

https://www.nii.ac.jp/dsc/idr/nico/nicopedia-apply.html

Apply from here. When you apply, you will receive the URL for download within at least a few days, so keep it.

Download and uncompress the compressed file

Please download it from the URL and expand it to apply like this.

.
└── nico-dict
    └── zips
        ├── download.txt
        ├── head
        │   ├── head2008.csv
        │   ├── ...
        │   └── head2014.csv
        ├── head.zip
        ├── res
        │   ├── res2008.csv
        │   ├── ...
        │   └── res2014.csv
        ├── res.zip
        ├── rev2008.zip
        ├── rev2009
        │   ├── rev200901.csv
        │   ├── rev200902.csv
        │   ├── rev200903.csv
        │   ├── ...
        │   └── rev200912.csv
        ├── rev2009.zip
        ├──...
        ├── rev2013.zip
        ├── rev2014
        │   ├── rev201401.csv
        │   └── rev201402.csv
        └── rev2014.zip

Repository clone

Originally, I used Clojure (Lisp) for analysis because of the ease of lazy evaluation and preprocessing, but I made a tool for HTML-> JSON that does not process as much as possible so that it can be analyzed with Python.

https://github.com/MokkeMeguru/niconico-parser

Please clone from.

git clone https://github.com/MokkeMeguru/niconico-parser

Format CSV.

https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/preprocess.sh

To zips / preprocess.sh

sh preprocess.sh

please do it. This file is the processing required to modify the CSV escape to modern specifications. (Back story: I've been testing this process quite a bit, but maybe there's a bug. If you have a bug, please comment.)

Save article header information in database

The Nico Nico Pedia dataset can be broadly divided.

  1. Header (article ID, title, title reading, category, etc.)
  2. Article body
  3. Comments on the article

It has become. Of these, 1. is an amount that can be easily created into a database, so we will create a database.

Required files are https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/create-table.sql and https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/ import-dir.sh. Arrange these so that they are zips / head / <file>

sh import-dir.sh

Please. Then you will get a db of sqlite3 called header.db.

Let's access it for a trial.

sqlite3 headers.db
sqlite3 > select * from article_header limit 10
 ...> ;
 1|Nico Nico Pedia|Nico Nico Daihakka|a|20080512173939
 4|curry|curry|a|20080512182423
 5|I asked Hatsune Miku to sing the original song "You have flowers and I sing".|\N|v|20080719234213
 9|Go Go Curry|Go Go Curry|a|20080512183606
 13|Authentic Gachimuchi Pants Wrestling|\N|v|20080513225239
 27|The head is pan(P)┗(^o^ )┓3|\N|v|20080529215132
 33|[Hatsune Miku] "A little fun time report" [Arranged song]|\N|v|20080810020937
 37|【 SYNC.ART'S × U.N.Is Owen her? ] -Sweets Time-|\N|v|20080616003242
 46|Nico Nico Douga Meteor Shower|\N|v|20080513210124
 47|I made a high potion.|\N|v|20090102150209

It has a Nico Nico Pedia feeling, and it smells like you can get knowledge that Wikipedia does not have.

HTML->JSON! One of the big problems with Nico Nico Pedia articles is that there are many weird tags. Unlike Wikipedia, there are a lot of <br> tags and <span> tags for formatting, and it was a personal response that I had a lot of trouble trying to retrieve the sentence. (~~ Also, AA is almost out of order. Make a tag for AA ... ~~)

The easiest way to parse HTML is to use a DSL (domain-specific language). A well-known one is Kotlin's HTML Parsing Tool.

This time I tried to process it easily using Lisp. The detailed code is ... well () ...

lein preprocess-corpus -r /path/to/nico-dict/zips

Well, please execute it like this. (Click here for Jar execution (Bug report)) It takes about 10 to 15 minutes to eat up about 20 to 30 GB of disc.

Let's take a quick look at the contents.

head -n 1 rev2008-jsoned.csv 
1,"{""type"":""element"",""attrs"":null,""tag"":""body"",""content"":[{""type"":""element"",""attrs"":null,""tag"":""h2"",""content"":[""Overview""]},{""type"":""element"",""attrs"":null,""tag"":""p"",""content"":[""What is Nico Nico Pedia?(abridgement)Is.""]}]}",200xxxxxxxx939,[],Nico Nico Pedia,Nico Nico Daihakka,a

To explain one item at a time

  1. Article ID
  2. JSON conversion + preprocessed article
  3. Article update date
  4. A list of links (<a> tags) contained within the page
  5. Title
  6. Reading the title
  7. Category ("a" = word "v" = video "i" = product "l" = live broadcast "c" = maybe community article (category not in spec))

I can't really introduce the effect of JSON conversion + preprocessing this time, but for example, it is easier to handle things like <p> hoge <span /> hoge <br /> bar </ p>, and the graph You can mention that it is easier to bring it to and apply tools like Snorkel.

Let's do some statistics

I made a pre-processing tool! It's not very boring just by itself, so let's do something like statistics. Speaking of data processing, it seems to be Python + Pandas, so I will investigate using Python + Pandas. (However, Pandas is very heavy or slow, so please use another tool for full-scale analysis.)

Follow the steps below like Jupyter Notebook.

Dependency import

import pandas as pd
import json
from pathlib import Path
from pprint import pprint

Declaration of global variables

Please change for each environment.

############################
#Global variables(Change as appropriate) #
############################
#CSV header
header_name = ('article_id', 'article', 'update-date',
               'links', 'title', 'title_yomi', 'category''')
dtypes = {'article_id': 'uint16',
          'article': 'object',
          'update-date': 'object',
          'links': 'object',
          'title': 'object',
          'title_yomi': 'object',
          'category': 'object'
}

#Sample CSV
sample_filepath = "/home/meguru/Documents/nico-dict/zips/rev2014/rev201402-jsoned.csv"
sample_filepath = Path(sample_filepath)

#Sample CSVs
fileparent = Path("/home/meguru/Documents/nico-dict/zips")
filepaths = [
    "rev2014/rev201401-jsoned.csv",
    "rev2014/rev201402-jsoned.csv",
    "rev2013/rev201301-jsoned.csv",
    "rev2013/rev201302-jsoned.csv",
    "rev2013/rev201303-jsoned.csv",
    "rev2013/rev201304-jsoned.csv",
    "rev2013/rev201305-jsoned.csv",
    "rev2013/rev201306-jsoned.csv",
    "rev2013/rev201307-jsoned.csv",
    "rev2013/rev201308-jsoned.csv",
    "rev2013/rev201309-jsoned.csv",
    "rev2013/rev201310-jsoned.csv",
    "rev2013/rev201311-jsoned.csv",
    "rev2013/rev201312-jsoned.csv",
]
filepaths = filter(lambda path: path.exists(),  map(
    lambda fpath: fileparent / Path(fpath), filepaths))
##################

Definition of function to read preprocessed CSV

def read_df(csvfile: Path, with_info: bool = False):
    """read jsoned.csv file
    args:
    - csvfile: Path
    a file path you want to read
    - with_info: bool
    with showing csv's information
    returns:
    - df
    readed data frame
    notes:
    if you call this function, you will got some log message
    """
    df = pd.read_csv(csvfile, names=header_name, dtype=dtypes)
    print('[Info] readed a file {}'.format(csvfile))
    if with_info:
        df.info()
    return df


def read_dfs(fileparent: Path, csvfiles: List[Path]):
    """read jsoned.csv files
    args:
    - fileparent: Path
    parent file path you want to read
    - csvfiles: List[Path]
    file paths you want to read
    returns:
    - dfl
    concated dataframe
    note:
    given.
        fileparent = \"/path/to\"
        csvfiles[0] = \"file\"
    then.
        search file <= \"/path/to/file\"
    """
    dfl = []
    for fpath in filepaths:
        dfi = pd.read_csv(fileparent / fpath,
                          index_col=None, names=header_name, dtype=dtypes)
        dfl.append(dfi)
    dfl = pd.concat(dfl, axis=0, ignore_index=True)
    return dfl

Load one file into the sample and view it

This time, let's take a look at how the links (<a> tags) in HTML show how they are scattered for each type of article.

df = read_df(sample_filepath, True)
# [Info] readed a file /home/meguru/Documents/nico-dict/zips/rev2014/rev201402-jsoned.csv
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 6499 entries, 0 to 6498
# Data columns (total 7 columns):
# article_id     6499 non-null int64
# article        6499 non-null object
# update-date    6499 non-null int64
# links          6499 non-null object
# title          6491 non-null object
# title_yomi     6491 non-null object
# category       6491 non-null object
# dtypes: int64(2), object(5)
# memory usage: 355.5+ KB

For the time being, I was able to confirm that this file itself has a 6.5k article.

Then parse the JSON-ized link information to calculate the number of links.

#Confirmation of raw data
df['links'][0]
# => '[{"type":"element","attrs":{"href":"http://wwwxxxxhtml"},"tag":"a","content":["Kochi xxxx site"]}]'
dfs= pd.DataFrame()
dfs['links']= df['links'].map(lambda x: len(json.loads(x)))
dfs['links'][0]
# => 1

Let's take a quick statistic.

dfs['category']=df['category']
dfsg=dfs.groupby('category')
dfsg.describe()
#            links                                                      
#            count       mean         std  min  25%   50%    75%     max
# category                                                              
# a         5558.0  41.687298  209.005652  0.0  0.0   2.0  11.00  2064.0
# c           36.0  54.305556  109.339529  0.0  2.0   2.0  38.25   376.0
# i            4.0   7.500000    5.507571  2.0  3.5   7.0  11.00    14.0
# l          786.0  22.760814  106.608535  0.0  0.0   2.0   9.00  1309.0
# v          107.0  32.887850   46.052744  0.0  3.0  11.0  37.00   153.0

"a" = word "v" = video "i" = product "l" = live broadcast "c" = community article, so on average there are many ** community article links **. However, if you look at the median and maximum values, you can observe that it seems necessary to look at (classify) the word articles in more detail.

Try increasing the sample data and checking

6k articles aren't enough, so let's increase the data.

dfl = read_dfs(fileparent, filepaths)
# >>>         article_id                                            article  ...             title_yomi category
# 0             8576  {"type":"element","attrs":null,"tag":"body","c...  ...Kabekick a
# [223849 rows x 7 columns]
dfls = pd.DataFrame()
dfls['links'] = dfl['links'].map(lambda x: len(json.loads(x)))
dfls['category'] = dfl['category']
dflsg = dfls.groupby('category')
dflsg.describe()
#              links
#              count       mean         std  min  25%  50%   75%     max
# category
# a         193264.0  32.400566  153.923988  0.0  0.0  2.0  10.0  4986.0
# c           1019.0  34.667321   77.390967  0.0  1.0  2.0  34.0   449.0
# i            247.0   6.137652    6.675194  0.0  1.0  3.0  10.0    28.0
# l          24929.0  20.266477  100.640253  0.0  0.0  1.0   5.0  1309.0
# v           3414.0  14.620387   22.969974  0.0  1.0  6.0  16.0   176.0

Overall, you can see that the average value of live and video links is reversed as the number of video links decreases. In addition, the point that the fluctuation range of the number of links of the word article is too large can be confirmed as in the case of one sample. Also counter-intuitive is that ** word articles are less than the average in the third quartile **.

From the above results, it can be seen that at least the number of links varies considerably depending on the type of article, and I think that it seems better to study after observing the properties of each article individually. (Throw to the viewer how to study and produce results from here)

Is there a correlation between the number of article links and the article size?

From the previous experiment, you can see that the variance is large, especially for word articles. The reason for this is ** from my experience and intuition when I usually look at Nico Nico Pedia **, I came up with the correlation between article size and the number of links. So, let's consider the number of characters in the JSON-converted data as the article size and check the correlation.

dfts.corr()
#                  links  article_size
# links         1.000000      0.713465
# article_size  0.713465      1.000000

Well, at least there seems to be a strong positive correlation.

If you step a little further, it will look like this.

#About word articles
dfts[dfts['category'] == "a"].loc[:, ["links", "article_size"]].corr()
#                  links  article_size
# links         1.000000      0.724774
# article_size  0.724774      1.000000



#About community articles
dfts[dfts['category'] == "c"].loc[:, ["links", "article_size"]].corr()
#                links  article_size
# links        1.00000       0.63424
# article_size 0.63424       1.00000

#About product articles
dfts[dfts['category'] == "i"].loc[:, ["links", "article_size"]].corr()
#                  links  article_size
# links         1.000000      0.254031
# article_size  0.254031      1.000000

#About live broadcast articles
dfts[dfts['category'] == "l"].loc[:, ["links", "article_size"]].corr()
#                 links  article_size
# links         1.00000       0.58073
# article_size  0.58073       1.00000

#About video articles
dfts[dfts['category'] == "v"].loc[:, ["links", "article_size"]].corr()
#                  links  article_size
# links         1.000000      0.428443
# article_size  0.428443      1.000000

News We have developed a CLI for parsing articles published on the Web.

lein parse-from-web -u https://dic.nicovideo.jp/a/<contents-title>

You can get JSON-converted article data like this. See Repository for an example of acquisition.

However, this ** puts a load on the other server **, so please use it for purposes such as trying out a tool for a while. Even if you make a mistake, do not imitate carpet bombing scraping from the university IP.

Recommended Posts

Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
Image analysis was easy using the data and API provided by Microsoft COCO.
The beginning of cif2cell
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
Mathematical understanding of principal component analysis from the beginning
Story of image analysis of PDF file and data extraction
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
DJango Note: From the beginning (simplification and splitting of URLConf)
Let's make the analysis of the Titanic sinking data like that
Data analysis based on the election results of the Tokyo Governor's election (2020)
Touch the mock and stub
A Python beginner first tried a quick and easy analysis of weather data for the last 10 years.
Recommended books and sources of data analysis programming (Python or R)
A simple data analysis of Bitcoin provided by CoinMetrics in Python
About Boxplot and Violinplot that visualize the variability of independent data
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
The story of Python and the story of NaN
First Python 3 ~ The beginning of repetition ~
Touch the object of the neural network
Clash of Clans and image analysis (3)
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data handling 2 Analysis of various data formats
Summarize the main points of growth hacks for web services and the points of analysis
Send and receive image data as JSON over the network with Python
Look up the names and data of free variables in function objects
Extract and plot the latest population data from the PDF data provided by the city
The story of making a sound camera with Touch Designer and ReSpeaker
Summary of probability distributions that often appear in statistics and data analysis
Data cleansing of open data of the occurrence situation of the Ministry of Health, Labor and Welfare
Get the key for the second layer migration of JSON data in python