Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time, I will summarize about data frame creation. In the first place, I will create a separate article regarding data acquisition, so please refer to that.

The code is written in my own way, so I would appreciate it if you could give me some advice.

Original data to be converted to DataFrame

The lower snap is text data that can be downloaded from the boat race formula, but since we want to use this race result for ranking learning, this time we will format the content enclosed in yellow as a data frame.

Handling unnecessary words and blank spaces

For example, the above figure is the information of the 1st race of Toda Racecourse, but I am concerned about the fact that "Toda" is also "Toda" and the existence of full-width space. So, first of all, we will do the processing around that.

import pandas as pd
import numpy as np

filename_txt = 'bs_2020_0622.txt' #Original text data
output_raw = '2020_0622_raw.txt' #After processing part 1_Text data after blank processing

with open(filename_txt, mode='r', encoding='utf-8') as f:
    lines = f.readlines()

for i in range(len(lines)):
    with open(output_raw, mode='a', encoding='utf-8') as f:
         f.writelines(lines[i].replace("Player name","Player name").replace("m wave","")\
                      .replace("cm","").replace("Fixed approach","")\
                      .replace("〜","").replace("！","").replace("（","")\
                      .replace("）","").replace("・","").replace("−","")\
                      .replace("　","").replace("     "," ").replace("    "," ")\
                      .replace("   "," ").replace("  "," ").lstrip())

You might think this mountain of replace .. I would appreciate it if you could let me know if you have any better ideas or writing styles. In any case, by reading the original data line by line and applying the replace process, the text file becomes as shown in the figure below. You've managed to handle the extra whitespace (+ words).

Extracting data using regular expressions, etc.

As shown in the yellow frame of the first snap, the desired information is scattered in each text line. As the final process of creating a data frame, devise so that the desired information is described in all lines.

import re

output_txt= '2020_0622_txt.txt' #After processing part 2_Text data after regular expression processing

place_list = ['Fukuoka [Results]','Omura [Results]','Kiryu [Results]','Toda [Results]'\
              ,'Tama River [Results]','Lake Hamana [Results]','Gamagori [Results]'\
             ,'Tokoname [Results]','Tsu [Results]','Lake Biwa [Results]','Suminoe [Results]'\
              ,'Naruto [Results]','Marugame [Results]','Miyajima [Results]'\
             ,'Tokuyama [Results]','Shimonoseki [Results]','Wakamatsu [Results]','Mikuni [Results]'\
              ,'Heiwajima [Results]','Ashiya [Results]','Kojima [Results]',\
             'Edogawa [Results]','Amagasaki [Results]','Karatsu [Results]']

pattern_list = ['TOP 1R 2R 3R 4R 5R 6R 7R 8R 9R 10R 11R 12R']
pattern = 'TOP 1R 2R 3R 4R 5R 6R 7R 8R 9R 10R 11R 12R (\w+ \w+ \w+ \w+ \w+ \w+ \w+ \w+)'
repattern = re.compile(pattern)

pattern2_list = ['Arrive boat No. Player name Motor boat Exhibition Entry timing time']
pattern2 = 'Arrive boat No. Player name Motor boat Exhibition Entry timing time(\w+)' #' 
repattern2 = re.compile(pattern2)

pattern3_list = ['No.']

#Read the race result with blank processing line by line
with open(output_raw, mode='r', encoding='utf-8') as f:
    lines = f.readlines()

#Prepare empty text data
text = ''
text2 = ''
text3 = ''

#Check if there is a line that matches the pattern line by line.
for i in range(0,len(lines)):
    with open(output_txt, mode='a', encoding='utf-8') as f:
        if pattern_list[0] in lines[i]:
            #Use the search function to see if it applies
            match = re.search(repattern, lines[i])
            #Outputs the extracted character string
            try:
                text = match.group(1)
            except:
                print(str(lines[i]),i,"This guy has a little error")
        
        if pattern2_list[0] in lines[i]:
            #Use the search function to see if it applies
            match = re.search(repattern2, lines[i])
            #Outputs the extracted character string
            try:
                text2 = match.group(1)
            except:
                print(str(lines[i]),i,"This guy a little error 2")
                text2 = "blank"
            
        if pattern3_list[0] in lines[i]:
            #Outputs the extracted character string
            text3 = lines[i].replace("No.","").replace("/"," ")
            
        for j in place_list:
            if j in lines[i]:
                k = j.replace("[Results]","")
                f.writelines(lines[i])
                break
                
        f.writelines(lines[i].replace('. .','.').rstrip()+" "+k+" "\
                     +text+" "+text2+" "+text3+"\n")

I wonder if there is a better way .. but this will create the following text data. Can you see that the weather and date are included in the race information you want this time?

Finally, data frame! !!

When you get here, it's almost over. The information I want in this text data is only the range of the red frame written in the above figure. In other words, the line where the first character of the text line starts with "0" is taken in as a data frame.

df = pd.read_table(output_txt, names='A') #Column name"A"Create a data frame with
df_racer = df[df['A'].str.startswith('0')] #"0"Get only lines starting with
df_racer = df_racer['A'].str.split(' +', expand=True)\
    .rename(columns={0:'Position',1:'Lane',2:'Register',3:'Name',4:'Motor',5:'Boat'\
                     ,6:'Tenji',7:'Entry',8:'Start',9:'Time',10:'Place'\
                     ,11:'Round',12:'RaceType',13:'Course',14:'Wether'\
                     ,15:'Wind',16:'WindDir',17:'WindStr',18:'WaveHgt'\
                     ,19:'RaceResult',20:'RaceDay',21:'Year',22:'Month'\
                     ,23:'Day',24:'Location'})

df_racer = df_racer.reset_index()

Since each information is separated by a half-width space, the meaning is defined by split ('+') and rename processing. And the completed data frame looks like this. ** You did it! ** **

df_racer.head()

in conclusion

The process of converting this text data into a data frame was very difficult ... Is there a difference between a format that is easy to see and a format that is easy to process data? How many text data are you making? !! I also feel that I want to write smarter. Will it be updated someday ..?