[PYTHON] Introduction to Quiz Statistics (1) -Mathematical analysis of question sentences to know the tendency of questions-

Introduction

Do you guys do quizzes? You're doing it, right? ?? And speaking of quizzes, after all Hana is a quick push quiz! I think that everyone is working online and offline in various ways every day for quick push quizzes.

By the way, when you are doing a quick push quiz, you may be worried about this, right?

-* Parallel (so-called "but problem") or etymological problem (problem in the form of "□□ in ◯◯ word") is about one question out of every question. Let's ...? * -* I want to know the keyword "If you hear this word, the answer is already confirmed!" ... *

If you're a quiz player, you'll have this kind of question once, right? ~~ Teka, hold me ☆ ~~ The purpose and purpose of this article is to answer these questions head-on with technical and mathematical methods.

By the way, in this article, the source code and mathematical formulas will be messed up. Of course, you can omit all of these and write a conclusion suddenly, but I want to give a little "it's likeness" ~~ The secret lab center king or the thesis Night speaker ~~ Because it is also a commitment (fucking troublesome), I dare to write a complicated description. Of course, we will try to explain the essential parts of the quiz itself as simply as possible, but please understand that we will omit detailed explanations of specialized parts (especially technical aspects). Anyway, ** If you don't understand, skip it and read it **.

Also, just in case, the program code posted here has not been confirmed to work properly, so it may cause unexpected problems. The author is not responsible for such problems, so please use at your own risk. In addition, the results and discussions are just my personal impressions, so please do not take them too much.

Now let's take you to the entrance to the academic ** Quiz Statistics **.

** [Note] ** Here, follow the steps below to organize the quiz question data in the flow of HTML → SQL database → CSV file, but of course, omit the SQL database and HTML → CSV file, But there is no problem at all. In other words, you don't even have to write a separate program, and you can copy HTML to Excel, etc., and that is better for work efficiency. The following description is

――Somehow, it seems that it is easier to play and organize in various ways if you put it in the database once. ――I'm not very good at using Excel personally, so I want to do it in a way that doesn't use Excel as much as possible.

That's because of my personal feelings and circumstances, so if you say that you don't need a database, that's right, and if you say that Excel is enough, that's right. ** After that, it is just a record of what I tried because I have little history of quiz activities, so please be aware that there are essentially redundant parts and detours. ** **

Preparation of quiz question data to analyze

Extraction of question sentences and answers and storage in database

By the way, we will analyze the quiz question sentences from here, but of course, it will not start unless there is a quiz question to be analyzed. This time, among the quiz questions published by Quiz Forest, only the questions of the "abc series", which seems to be the most standard quiz question sentence, Was used.

The quiz question is published in HTML format, but it is a little difficult to handle as it is, so I will start by extracting the question sentence and the answer cleanly from this HTML file. By the way, let's dive into the database for ease of use later.

Problems with abc After saving all the HTML files in the ʻabcquiz directory, first unify the character code of the HTML files to UTF-8. If you have saved all the HTML files in the ʻabcquiz directory, you can batch convert the character code with the following command.

nkf -w --overwrite ./abcquiz/*.htm

After the character code conversion is completed, write a Python script and run it. Here, the packages MySQLdb and bs4 used in the Python script below are not included by default, so you need to install them separately as well. For example, you can use the pip command to install it like this:

sudo pip install MySQL-python
sudo pip install beautifulsoup4

Then run the following Python program. The following Python scripts are based on the assumption that they will work with Python 2.x.

extract_abcquiz_into_mysqldb.py


# coding: utf-8

import os
import re
import codecs
import MySQLdb
from bs4 import BeautifulSoup

#Objects for MySQL operations
connector = MySQLdb.connect(host="localhost", db="quiz", user="(User name)", passwd="(password)", charset="utf8")
cursor = connector.cursor()

#Search all HTML file names in the specified folder
file_names = os.listdir("./abcquiz/")
#If the element remove of itself occurs in the for statement, turn it by copying all slices
for f in file_names[:]:
    if ".htm" not in f:
        file_names.remove(f)

#Extract and read the problem statement (already in unicode type)
for f in file_names:
    open_file = codecs.open("./abcquiz/" + f, "r", "utf8")
    whole_soup = BeautifulSoup(open_file)
    #First, divide the soup for each problem
    each_quiz_soups = whole_soup.find_all("tr")
    for qs in each_quiz_soups:
        current_quiz = {"question":u"", "answer":u""}
        #Extract further td tags
        tds = qs.find_all("td")
        is_next_of_question = False
        for td in tds:
            #The td cell ending with "?" Is a problem statement, so extract it.
            if re.match(ur".*\?$", td.text):
                current_quiz["question"] = td.text
                is_next_of_question = True
                continue
            #The td cell next to the question sentence is the answer
            if is_next_of_question:
                current_quiz["answer"] = td.text
                break
        sql_query = u"insert into abcquiz (question, answer) values ('" + current_quiz["question"] + u"', '" + current_quiz["answer"] + u"')"
        try:
            cursor.execute(sql_query)
        except:
            continue

connector.commit()
cursor.close()
connector.close()

This will extract all question and answer pairs from the HTML file and store them in the abcquiz` table of the quizdatabase with columns `question` and answers`. In addition, it is necessary to install MySQL, prepare a MySQL user account, create a quiz database and ```abcquiz table, etc. in advance.

This completes the 11184 question quiz question database for the time being.

Export to CSV file format

It's a little difficult to analyze while keeping it in the database, so let's format the quizzes once stored into a form that is easy to analyze and take them out. I often use the CSV file format in such cases, so this time I will export it to the CSV file format. Of course, it doesn't have to be CSV, but JSON can be used.

First, log in to MySQL.

mysql -u (User name) -p

After entering the password and logging in, move to the quiz database and export the contents of the ```abcquiz`` table to a CSV file.

use quiz;
select * from abcquiz into outfile 'abcquiz.csv' fields terminated by ',';

If you try to generate a CSV file, you may get an error such as secure-file-priv and it may not work. These solutions are on a case-by-case basis and will be lengthy to export, so I will omit them here, but it seems that there are many cases where the value of secure-file-priv is NULL, so " If you deal with it in the direction of "just set the value of" secure_file_priv in the loaded "my.cnf", it will probably work.

(Reference) Mysql making --secure-file-priv option to NULL --Stack Overflow http://stackoverflow.com/questions/37543177/mysql-making-secure-file-priv-option-to-null

By the process so far, the following CSV file, that is, a 11184-line text file in which the question sentence and the answer are separated by a single-byte comma is generated.

The element with the element symbol B is boron, but what is the element with the element symbol C?,carbon
What kind of tears do you think of a small wage or share as a bird?,Sparrow's tears
The first shogun of the Kamakura Shogunate was Minamoto no Yoritomo, but who is the first shogun of the Muromachi Shogunate?,Mr. Takauji Ashikaga
Who is the current coach of the Japan national football team, nicknamed "skinny"?,Zico
What color is used for the American flag but not for the Japanese flag?,Blue
・
・
・

Using this as a material, we will actually analyze various things in the next chapter.

Problem analysis by pattern matching

Let's do a little analysis. Here is one of the questions that the quiz player mentioned at the beginning of the article

-* I wonder how many questions about parallel or etymology will come up at a rate of about 1 question ...? *

I will give you the answer to. Here, the definitions of the parallel problem and the etymological problem are as follows.

--A parallel problem is a problem that has a problem sentence in which the character string "da ga" appears anywhere in the problem sentence. --The etymological problem is a problem in which a character string pattern expression in the form of "meaning (arbitrary character string) in a word" appears anywhere in the problem sentence.

Now, let's create a program that reads the CSV file generated earlier, counts the number of problems that match these patterns, and calculates the appearance ratio to the whole.

count_parallel_gogen_question.py


# coding: utf-8

import csv
import re

#Read the question sentence from the CSV file created in advance (first column of CSV)
questions = []
with open("./abcquiz.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        questions.append(row[0].decode("utf8"))

#Specify the character string pattern to search with a regular expression
parallel_pattern = re.compile(ur"but")
gogen_pattern = re.compile(ur"In words.*means")

#Count the number of questions that match the specified pattern
parallel_num = 0
gogen_num = 0
for q in questions:
    if re.search(parallel_pattern, q):
        parallel_num += 1
    if re.search(gogen_pattern, q):
        gogen_num += 1

#Calculate the appearance ratio based on the count result
parallel_probability = float(parallel_num) / float(len(questions))
gogen_probability = float(gogen_num) / float(len(questions))

#Result output
print u"Total number of problems= ", unicode(len(questions))
print u"Number of parallel problems= ", unicode(parallel_num), u"(Question ratio= ", u"{:.2f}".format(parallel_probability * 100.0), u"%,", u"{:.2f}".format(1.0 / parallel_probability), u"1 question per question)"
print u"Etymological problems= ", unicode(gogen_num), u"(Question ratio= ", u"{:.2f}".format(gogen_probability * 100.0), u"%,", u"{:.2f}".format(1.0 / gogen_probability), u"1 question per question)"

When you run this guy, the following result will be output on the screen.

Total number of problems=  11184
Number of parallel problems=1083 (Question ratio=  9.68 %, 10.1 in 33 questions)
Etymological problems=244 (Question ratio=  2.18 %, 45.1 in 84 questions)

Yes, you have the answer to your question!

--Parallel questions are given at a rate of ** 10.33 questions ** --Etymology questions are given at a rate of ** 45.84 questions **

Just in case, let's check if we are properly picking up parallel problems and etymological problems. In the place where the parallel problem is counted in the previous code, add one line of print statement that outputs the problem statement to the screen.

    if re.search(parallel_pattern, q):
        parallel_num += 1
        print q

Then, the question sentences judged as parallel problems will be displayed on the screen in a row, so let's take a look.

The element with the element symbol B is boron, but what is the element with the element symbol C?
The first shogun of the Kamakura Shogunate was Minamoto no Yoritomo, but who is the first shogun of the Muromachi Shogunate?
The nickname of the Akita Shinkansen is "Komachi", but what is the nickname of the Yamagata Shinkansen?
Another name for sports is ice hockey, which is called "martial arts on ice", but what is called "chess on ice"?
The largest fish in the world is the whale shark, but what is the largest amphibian in the world?
・
・
・

As you might expect, parallel problems have been detected and counted. In addition, if you look at this list all the time, you will see the following problem statement on the way.

It is a plant used for condiments in cooking. Welsh onion belongs to the lily family and ginger belongs to the ginger family, but what kind of wasabi is it?

It's a so-called "three-para problem" in which two elements appear before "but" and the third element appears. In the definition of the parallel problem this time, these three-para problems are also counted. If you devise a definition, you can extract only the 3-para problem.

Similarly, let's check the etymological problem. Move the print sentence that you added earlier to the place where the etymological problem is counted.

    if re.search(gogen_pattern, q):
        gogen_num += 1
        print q

And execute.

What is a cake that literally imitates a tree ring, which means "wooden cake" in German?
What is the movement of Christians to recapture the Muslim-occupied Iberian Peninsula, which means "reconquest" in Spanish?
What is the stand-up style at parties, which means "tableware cupboard" in French?
What is the policy of the former Soviet Union to develop open politics and economy, which means "information disclosure" in Russian?
What is a cold, sweet dessert that means "perfect" in French and has a variety of fruits, chocolates and more?
・
・
・

Yeah, this is also good! It seems to be okay. If you look at the list, you will find problem sentences with slightly different coat colors.

What is the farm equipment that has the name "combined" in English because it can be cut and threshed at once?
What is a Western confectionery made by sprinkling chocolate on a strip of cream puff, which has the name meaning "lightning bolt" in French?

It's a pattern in which the expression "meaning □□ in ◯◯ word" appears in the middle. This is because knowing the etymology may not be that much of an advantage (for example, the problems listed above can be pushed in front of you even if you don't know the meaning of English or French. I think it's a problem), so it may be a little weak to call it the so-called "etymology problem".

Also, one problem statement in a slightly unexpected form was detected.

What is the English translation of "Confidencial", which means to accompany a letter and to be read only by the addressee?

This problem, the definition of the etymological problem defined this time, "means in words (arbitrary character string)" is certainly included. However, as a practical form, translation into other languages appears in the front, so it may not be an etymological problem. Well, this time I just wanted to get a rough idea of the question ratio, so this one question doesn't affect the majority of the results, but if you want a more detailed analysis, you'll need to consider it.

in conclusion

So, in this article, I have taken up the technical handling of the past questions of the abc series, the actual analysis results, and their consideration. as a result,

--Parallel questions are given at a rate of ** 10.33 questions ** --Etymology questions are given at a rate of ** 45.84 questions **

However, this time, we are only dealing with the problems of the abc series about 10 years ago, so in the recent quiz questioning tendency, this ratio may have changed considerably. I can think of it.

Well, at the beginning I mentioned a question that the quiz player has, but in addition to the question that has just been solved, there was another question, right?

-* I want to know the keyword "If you hear this word, the answer is already confirmed!" ... *

Actually, the topic of solving this question was Honmaru, but the article seems to be a little long, so today I will break it down here, and on another day as "Quiz Statistics (2)" I would like to publish it again. I think that the analysis performed in this article is possible enough with the COUNTIF function of Excel as long as there is problem sentence data, but next time it can not be solved without using specialized programming methods. We are planning to deal with such contents firmly. If you don't mind, please go out with me again.

I would like to express my sincere gratitude to everyone who has read this far, and I would like to conclude "Quiz Statistics (1)". See you in "Quiz Statistics (2)" again.

Recommended Posts

Introduction to Quiz Statistics (1) -Mathematical analysis of question sentences to know the tendency of questions-
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
An introduction to data analysis using Python-To increase the number of video views-
From the introduction of pyethapp to the execution of contract
[Introduction to Python] Basic usage of the library scipy that you absolutely must know
How to know the port number of the xinetd service
Mathematical understanding of principal component analysis from the beginning
Get to know the feelings of gradient boosting trees
[Introduction to Python] Basic usage of the library matplotlib
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
An introduction to object orientation-let's change the internal state of an object
I want to know the features of Python and pip
I want to know the legend of the IT technology world