[PYTHON] Create a data frame from the acquired boat race text data

Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time, I will summarize about data frame creation. In the first place, I will create a separate article regarding data acquisition, so please refer to that.

Original data to be converted to DataFrame

The lower snap is text data that can be downloaded from the boat race formula, but since we want to use this race result for ranking learning, this time we will format the content enclosed in yellow as a data frame. image.png

Handling unnecessary words and blank spaces

For example, the above figure is the information of the 1st race of Toda Racecourse, but I am concerned about the fact that "Toda" is also "Toda" and the existence of full-width space. So, first of all, we will do the processing around that.

import pandas as pd
import numpy as np

filename_txt = 'bs_2020_0622.txt' #Original text data
output_raw = '2020_0622_raw.txt' #After processing part 1_Text data after blank processing

with open(filename_txt, mode='r', encoding='utf-8') as f:
    lines = f.readlines()

for i in range(len(lines)):
    with open(output_raw, mode='a', encoding='utf-8') as f:
         f.writelines(lines[i].replace("Player name","Player name").replace("m wave","")\
                      .replace("cm","").replace("Fixed approach","")\
                      .replace("〜","").replace("!","").replace("(","")\
                      .replace(")","").replace("・","").replace("−","")\
                      .replace(" ","").replace("     "," ").replace("    "," ")\
                      .replace("   "," ").replace("  "," ").lstrip())

You might think this mountain of replace .. I would appreciate it if you could let me know if you have any better ideas or writing styles. In any case, by reading the original data line by line and applying the replace process, the text file becomes as shown in the figure below. You've managed to handle the extra whitespace (+ words).

image.png

Extracting data using regular expressions, etc.

As shown in the yellow frame of the first snap, the desired information is scattered in each text line. As the final process of creating a data frame, devise so that the desired information is described in all lines.

import re

output_txt= '2020_0622_txt.txt' #After processing part 2_Text data after regular expression processing

place_list = ['Fukuoka [Results]','Omura [Results]','Kiryu [Results]','Toda [Results]'\
              ,'Tama River [Results]','Lake Hamana [Results]','Gamagori [Results]'\
             ,'Tokoname [Results]','Tsu [Results]','Lake Biwa [Results]','Suminoe [Results]'\
              ,'Naruto [Results]','Marugame [Results]','Miyajima [Results]'\
             ,'Tokuyama [Results]','Shimonoseki [Results]','Wakamatsu [Results]','Mikuni [Results]'\
              ,'Heiwajima [Results]','Ashiya [Results]','Kojima [Results]',\
             'Edogawa [Results]','Amagasaki [Results]','Karatsu [Results]']

pattern_list = ['TOP 1R 2R 3R 4R 5R 6R 7R 8R 9R 10R 11R 12R']
pattern = 'TOP 1R 2R 3R 4R 5R 6R 7R 8R 9R 10R 11R 12R (\w+ \w+ \w+ \w+ \w+ \w+ \w+ \w+)'
repattern = re.compile(pattern)

pattern2_list = ['Arrive boat No. Player name Motor boat Exhibition Entry timing time']
pattern2 = 'Arrive boat No. Player name Motor boat Exhibition Entry timing time(\w+)' #' 
repattern2 = re.compile(pattern2)

pattern3_list = ['No.']

#Read the race result with blank processing line by line
with open(output_raw, mode='r', encoding='utf-8') as f:
    lines = f.readlines()

#Prepare empty text data
text = ''
text2 = ''
text3 = ''

#Check if there is a line that matches the pattern line by line.
for i in range(0,len(lines)):
    with open(output_txt, mode='a', encoding='utf-8') as f:
        if pattern_list[0] in lines[i]:
            #Use the search function to see if it applies
            match = re.search(repattern, lines[i])
            #Outputs the extracted character string
            try:
                text = match.group(1)
            except:
                print(str(lines[i]),i,"This guy has a little error")
        
        if pattern2_list[0] in lines[i]:
            #Use the search function to see if it applies
            match = re.search(repattern2, lines[i])
            #Outputs the extracted character string
            try:
                text2 = match.group(1)
            except:
                print(str(lines[i]),i,"This guy a little error 2")
                text2 = "blank"
            
        if pattern3_list[0] in lines[i]:
            #Outputs the extracted character string
            text3 = lines[i].replace("No.","").replace("/"," ")
            
        for j in place_list:
            if j in lines[i]:
                k = j.replace("[Results]","")
                f.writelines(lines[i])
                break
                
        f.writelines(lines[i].replace('. .','.').rstrip()+" "+k+" "\
                     +text+" "+text2+" "+text3+"\n")

I wonder if there is a better way .. but this will create the following text data. Can you see that the weather and date are included in the race information you want this time?

image.png

Finally, data frame! !!

When you get here, it's almost over. The information I want in this text data is only the range of the red frame written in the above figure. In other words, the line where the first character of the text line starts with "0" is taken in as a data frame.

df = pd.read_table(output_txt, names='A') #Column name"A"Create a data frame with
df_racer = df[df['A'].str.startswith('0')] #"0"Get only lines starting with
df_racer = df_racer['A'].str.split(' +', expand=True)\
    .rename(columns={0:'Position',1:'Lane',2:'Register',3:'Name',4:'Motor',5:'Boat'\
                     ,6:'Tenji',7:'Entry',8:'Start',9:'Time',10:'Place'\
                     ,11:'Round',12:'RaceType',13:'Course',14:'Wether'\
                     ,15:'Wind',16:'WindDir',17:'WindStr',18:'WaveHgt'\
                     ,19:'RaceResult',20:'RaceDay',21:'Year',22:'Month'\
                     ,23:'Day',24:'Location'})

df_racer = df_racer.reset_index()

Since each information is separated by a half-width space, the meaning is defined by split ('+') and rename processing. And the completed data frame looks like this. ** You did it! ** **

df_racer.head()
image.png

in conclusion

The process of converting this text data into a data frame was very difficult ... Is there a difference between a format that is easy to see and a format that is easy to process data? How many text data are you making? !! I also feel that I want to write smarter. Will it be updated someday ..?

Recommended Posts

Create a data frame from the acquired boat race text data
Create multiple line charts from a data frame at once using Matplotlib
Generate a vertical image of a novel from text data
Try to create a battle record table with matplotlib from the data of "Schedule-kun"
[Go] Create a tool that returns the Pokemon race value received from the standard input
Find out the maximum number of characters in multi-line text stored in a data frame
A program that removes specific characters from the entered text
Remove the frame from the image
Create a dummy data file
Create a correlation diagram from the conversation history of twitter
Create a summary table by product and time by processing the data extracted from a certain POS system
Create an API that returns data from a model using turicreate
[numpy] Create a moving window matrix from multidimensional time series data
Outputs a line containing the specified character string from a text file
[Spark Data Frame] Change a column from horizontal to vertical (Scala)
[Mac] Create a Python3 execution environment from the fully initialized state
Prepare a high-speed analysis environment by hitting mysql from the data analysis environment
Create a pandas Dataframe from a string.
How to divide and process a data frame using the groupby function
Extract lines that match the conditions from a text file with python
[Development environment] How to create a data set close to the production DB
How to create a clone from Github
Create a graph using the Sympy module
Create a large text file with shellscript
Create a standings from JFL match results
Create a deb file from a python package
How to create a repository from media
Create a dataframe from excel using pandas
Create a matrix with PythonGUI (text box)
Get only the text from the Django form.
Python --Read data from a numeric data file and find the multiple regression line.
Aggregate steps by day from iPhone healthcare data to create a CSV file
[Django] Create a form that automatically fills in the address from the zip code
Change the data frame of pandas purchase data (id x product) to a dictionary