I wanted to do something when the characters embedded in the PDF were strange. I want to look like below. I want to combine the same characters when they are repeated in succession.
Ah ah → Ah Aiuueo → Aiueo ABCABCABC → ABCABCABC Yui Yui consent → Yui Yui consent
python
#It is assumed that result already contains some character string
result = re.sub(r"(.)\1{1,}", "\g<1>", result) #Collect repeated strings
Text formatting
import re
from unicodedata import normalize
def clean_text(txt:str):
result = re.sub(r"\s| ",'',txt) #Remove whitespace first to make processing lighter
result = normalize('NFKC', result) #Unicode normalization
result = re.sub(r"(.)\1{1,}", "\g<1>", result) #Collect repeated strings
if (')(cid:' in result): #Correspondence in case of character embedded PDF
return ''
return result
Louise
import re
text = "Louise! Louise! Louise! Ruizuuuuuuuuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa !! !!\n\
Ah ah ah ... ah ... ah! Ah ah ah ah! !! !! Louise Louise Louise Wow Wow Ah Ah! !! !!\n\
Ah Kunka Kunka! Kunka Kunka! Suha Suha! Suha Suha! It smells good ... Kun\n\
Hmm! I want to squeeze the pink blonde hair of Louise Francoise-tan! Kunka Kunka! Aa! !!\n\
mistook! I want to be fluffy! Mofumofu! Mofumofu! Hair Mofumofu! Crispy Mofumofu ... Kyun Kyun Kyu! !!\n\
The 12th volume of the novel, Louise, was cute! !! Ah ah ... ah ... ah ah ah! !! Fahhhhh! !!\n\
I'm glad that the second season of the anime was broadcast, Ruiz-tan! Oh Oh Oh Oh! cute! Louise! cute! A-aa ~ aa!"
print(re.sub(r"(.)\1{1,}", "\g<1>", text))
#Louise! Louise! Louise! Ruizuu Wow!
#Ah ... ah ... ah! Aa! Louise Louise Louise Wow!
#Ah Kunka Kunka! Kunka Kunka! Suha Suha! Suha Suha! It smells ... kun
#Hmm! I want to squeeze the pink blonde hair of Louise Francoise-tan! Kunka Kunka! Aa!
#mistook! I want to be fluffy! Mofumofu! Mofumofu! Hair fluffy! Crispy Mofumofu ... Kyun Kyun Kyu!
#The 12th volume of the novel, Louise, was cute! Ah ... ah ... ah! Fah!
#I'm glad that the second season of the anime was broadcast, Ruiz-tan! Aa! Cute! Louise! Cute! Ahhhh!
Reverse replacement. I saw various things, but I felt that they were all here.
Grouping when using regular expressions in Python. For Python, it took me a while to realize that I had to write \ g <1> instead of $ 1.
Recommended Posts