Language processing 100 knocks 2015 "Chapter 3: Regular expressions" .ac.jp/nlp100/#ch3) 24th "Extracting File Reference" record. This time we will learn ** union (or) **. "Union (or)" is very simple and easy to understand.

Reference link

Link	Remarks
024.Extracting file references.ipynb	Answer program GitHub link
100 amateur language processing knocks:24	Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch	I organized what I learned in this knock
Regular expression HOWTO	Python Official Regular Expression How To
re ---Regular expression operation	Python official re package description
Help:Simplified chart	Wikipediaの代表的なマークアップのSimplified chart

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

24. Extracting file references

Extract all the media files referenced from the article.

Problem supplement (about "file")

Help:Simplified chartAccording to the "file" [[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]The format. Extract the file name of the following part with a regular expression.

`Excerpt from the "file" part of the file`


|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]\n
[[File:Battle of Waterloo 1815.PNG|thumb|left|[[Battle of Waterloo]]By victory in[[Napoleonic Wars]]Is put to an end,[[Pax Britannica]]The era has arrived.]]\n
[[File:The British Empire.png|thumb|250px|[[British Empire]]Countries / regions with experience under governance. Current[[British overseas territory]]Is underlined in red.]]\n

Answer

Answer program [024. Extracting file references.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8% E7% 8F% BE / 024.% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E5% 8F% 82% E7% 85% A7% E3% 81% AE % E6% 8A% BD% E5% 87% BA.ipynb)

import re
from pprint import pprint

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')


#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
#Search for short strings by making it a non-greedy match
pprint(re.findall(r'''
                  (?:File|File)   #Non-capture,'File'Or'File'
                  :                  #Non-capture
                  (.+?)              #Capture target, one or more arbitrary characters, non-greedy
                  \|                 #Non-capture,|Escape
                  ''', wiki_body, re.VERBOSE))

Answer commentary

The main part of this time is the following part.

`python`


pprint(re.findall(r'''
                  (?:File|File)   #Non-capture,'File'Or'File'
                  :                  #Non-capture
                  (.+?)              #Capture target, one or more arbitrary characters, non-greedy
                  \|                 #Non-capture,|Escape
                  ''', wiki_body, re.VERBOSE))

Union (or)

(?:File|File)of|Is the or symbol. This time it means "if it was File or File".

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


['Royal Coat of Arms of the United Kingdom.svg',
 'Battle of Waterloo 1815.PNG',
 'The British Empire.png',
 'Uk topo en.jpg',
 'BenNevis2005.jpg',
 'Elizabeth II greets NASA GSFC employees, May 8, 2007 edit.jpg',
 'Palace of Westminster, London - Feb 2007.jpg',
 'David Cameron and Barack Obama at the G20 Summit in Toronto.jpg',
 'Soldiers Trooping the Colour, 16th June 2007.jpg',
 'Scotland Parliament Holyrood.jpg',
 'London.bankofengland.arp.jpg',
 'City of London skyline from London City Hall - Oct 2008.jpg',
 'Oil platform in the North SeaPros.jpg',
 'Eurostar at St Pancras Jan 2008.jpg',
 'Heathrow T5.jpg',
 'Anglospeak.svg',
 'CHANDOS3.jpg',
 'The Fabs.JPG',
 'PalaceOfWestminsterAtNight.jpg',
 'Westminster Abbey - West Door.jpg',
 'Edinburgh Cockburn St dsc06789.jpg',
 'Canterbury Cathedral - Portal Nave Cross-spire.jpeg',
 'Kew Gardens Palm House, London - July 2009.jpg',
 '2005-06-27 - United Kingdom - England - London - Greenwich.jpg',
 'Stonehenge2007 07 30.jpg',
 'Yard2.jpg',
 'Durham Kathedrale Nahaufnahme.jpg',
 'Roman Baths in Bath Spa, England - July 2006.jpg',
 'Fountains Abbey view02 2005-08-27.jpg',
 'Blenheim Palace IMG 3673.JPG',
 'Liverpool Pier Head by night.jpg',
 "Hadrian's Wall view near Greenhead.jpg ",
 'London Tower (1).JPG',
 'Wembley Stadium, illuminated.jpg']

[PYTHON] 100 Language Processing Knock-24: Extract File Reference