Language processing 100 knocks 2015 "Chapter 3: Regular expressions" .ac.jp/nlp100/#ch3) 24th "Extracting File Reference" record. This time we will learn ** union (or) **. "Union (or)" is very simple and easy to understand.
Link | Remarks |
---|---|
024.Extracting file references.ipynb | Answer program GitHub link |
100 amateur language processing knocks:24 | Copy and paste source of many source parts |
Python regular expression basics and tips to learn from scratch | I organized what I learned in this knock |
Regular expression HOWTO | Python Official Regular Expression How To |
re ---Regular expression operation | Python official re package description |
Help:Simplified chart | Wikipediaの代表的なマークアップのSimplified chart |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.
Regular Expressions, JSON, Wikipedia, InfoBox, Web Services
File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.
--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped
Create a program that performs the following processing.
Extract all the media files referenced from the article.
Help:Simplified chartAccording to the "file" [[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]
The format.
Extract the file name of the following part with a regular expression.
Excerpt from the "file" part of the file
|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]\n
[[File:Battle of Waterloo 1815.PNG|thumb|left|[[Battle of Waterloo]]By victory in[[Napoleonic Wars]]Is put to an end,[[Pax Britannica]]The era has arrived.]]\n
[[File:The British Empire.png|thumb|250px|[[British Empire]]Countries / regions with experience under governance. Current[[British overseas territory]]Is underlined in red.]]\n
import re
from pprint import pprint
import pandas as pd
def extract_by_title(title):
df_wiki = pd.read_json('jawiki-country.json', lines=True)
return df_wiki[(df_wiki['title'] == title)]['text'].values[0]
wiki_body = extract_by_title('England')
#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
#Search for short strings by making it a non-greedy match
pprint(re.findall(r'''
(?:File|File) #Non-capture,'File'Or'File'
: #Non-capture
(.+?) #Capture target, one or more arbitrary characters, non-greedy
\| #Non-capture,|Escape
''', wiki_body, re.VERBOSE))
The main part of this time is the following part.
python
pprint(re.findall(r'''
(?:File|File) #Non-capture,'File'Or'File'
: #Non-capture
(.+?) #Capture target, one or more arbitrary characters, non-greedy
\| #Non-capture,|Escape
''', wiki_body, re.VERBOSE))
(?:File|File)
of|
Is the or symbol.
This time it means "if it was File
or File
".
When the program is executed, the following results will be output.
Output result
['Royal Coat of Arms of the United Kingdom.svg',
'Battle of Waterloo 1815.PNG',
'The British Empire.png',
'Uk topo en.jpg',
'BenNevis2005.jpg',
'Elizabeth II greets NASA GSFC employees, May 8, 2007 edit.jpg',
'Palace of Westminster, London - Feb 2007.jpg',
'David Cameron and Barack Obama at the G20 Summit in Toronto.jpg',
'Soldiers Trooping the Colour, 16th June 2007.jpg',
'Scotland Parliament Holyrood.jpg',
'London.bankofengland.arp.jpg',
'City of London skyline from London City Hall - Oct 2008.jpg',
'Oil platform in the North SeaPros.jpg',
'Eurostar at St Pancras Jan 2008.jpg',
'Heathrow T5.jpg',
'Anglospeak.svg',
'CHANDOS3.jpg',
'The Fabs.JPG',
'PalaceOfWestminsterAtNight.jpg',
'Westminster Abbey - West Door.jpg',
'Edinburgh Cockburn St dsc06789.jpg',
'Canterbury Cathedral - Portal Nave Cross-spire.jpeg',
'Kew Gardens Palm House, London - July 2009.jpg',
'2005-06-27 - United Kingdom - England - London - Greenwich.jpg',
'Stonehenge2007 07 30.jpg',
'Yard2.jpg',
'Durham Kathedrale Nahaufnahme.jpg',
'Roman Baths in Bath Spa, England - July 2006.jpg',
'Fountains Abbey view02 2005-08-27.jpg',
'Blenheim Palace IMG 3673.JPG',
'Liverpool Pier Head by night.jpg',
"Hadrian's Wall view near Greenhead.jpg ",
'London Tower (1).JPG',
'Wembley Stadium, illuminated.jpg']
Recommended Posts