[PYTHON] 100 Language Processing Knock-24: Extract File Reference

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" .ac.jp/nlp100/#ch3) 24th "Extracting File Reference" record. This time we will learn ** union (or) **. "Union (or)" is very simple and easy to understand.

Reference link

Link Remarks
024.Extracting file references.ipynb Answer program GitHub link
100 amateur language processing knocks:24 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description
Help:Simplified chart Wikipediaの代表的なマークアップのSimplified chart

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

24. Extracting file references

Extract all the media files referenced from the article.

Problem supplement (about "file")

Help:Simplified chartAccording to the "file" [[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]The format. Extract the file name of the following part with a regular expression.

Excerpt from the "file" part of the file


|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]\n
[[File:Battle of Waterloo 1815.PNG|thumb|left|[[Battle of Waterloo]]By victory in[[Napoleonic Wars]]Is put to an end,[[Pax Britannica]]The era has arrived.]]\n
[[File:The British Empire.png|thumb|250px|[[British Empire]]Countries / regions with experience under governance. Current[[British overseas territory]]Is underlined in red.]]\n

Answer

Answer program [024. Extracting file references.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1%A8% E7% 8F% BE / 024.% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E5% 8F% 82% E7% 85% A7% E3% 81% AE % E6% 8A% BD% E5% 87% BA.ipynb)

import re
from pprint import pprint

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')


#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
#Search for short strings by making it a non-greedy match
pprint(re.findall(r'''
                  (?:File|File)   #Non-capture,'File'Or'File'
                  :                  #Non-capture
                  (.+?)              #Capture target, one or more arbitrary characters, non-greedy
                  \|                 #Non-capture,|Escape
                  ''', wiki_body, re.VERBOSE))

Answer commentary

The main part of this time is the following part.

python


pprint(re.findall(r'''
                  (?:File|File)   #Non-capture,'File'Or'File'
                  :                  #Non-capture
                  (.+?)              #Capture target, one or more arbitrary characters, non-greedy
                  \|                 #Non-capture,|Escape
                  ''', wiki_body, re.VERBOSE))

Union (or)

(?:File|File)of|Is the or symbol. This time it means "if it was File or File".

Output result (execution result)

When the program is executed, the following results will be output.

Output result


['Royal Coat of Arms of the United Kingdom.svg',
 'Battle of Waterloo 1815.PNG',
 'The British Empire.png',
 'Uk topo en.jpg',
 'BenNevis2005.jpg',
 'Elizabeth II greets NASA GSFC employees, May 8, 2007 edit.jpg',
 'Palace of Westminster, London - Feb 2007.jpg',
 'David Cameron and Barack Obama at the G20 Summit in Toronto.jpg',
 'Soldiers Trooping the Colour, 16th June 2007.jpg',
 'Scotland Parliament Holyrood.jpg',
 'London.bankofengland.arp.jpg',
 'City of London skyline from London City Hall - Oct 2008.jpg',
 'Oil platform in the North SeaPros.jpg',
 'Eurostar at St Pancras Jan 2008.jpg',
 'Heathrow T5.jpg',
 'Anglospeak.svg',
 'CHANDOS3.jpg',
 'The Fabs.JPG',
 'PalaceOfWestminsterAtNight.jpg',
 'Westminster Abbey - West Door.jpg',
 'Edinburgh Cockburn St dsc06789.jpg',
 'Canterbury Cathedral - Portal Nave Cross-spire.jpeg',
 'Kew Gardens Palm House, London - July 2009.jpg',
 '2005-06-27 - United Kingdom - England - London - Greenwich.jpg',
 'Stonehenge2007 07 30.jpg',
 'Yard2.jpg',
 'Durham Kathedrale Nahaufnahme.jpg',
 'Roman Baths in Bath Spa, England - July 2006.jpg',
 'Fountains Abbey view02 2005-08-27.jpg',
 'Blenheim Palace IMG 3673.JPG',
 'Liverpool Pier Head by night.jpg',
 "Hadrian's Wall view near Greenhead.jpg ",
 'London Tower (1).JPG',
 'Wembley Stadium, illuminated.jpg']

Recommended Posts

100 Language Processing Knock-24: Extract File Reference
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 language processing knock-74 (using scikit-learn): Prediction
I tried 100 language processing knock 2020: Chapter 2