[PYTHON] 100 amateur language processing knocks: 28

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ One article information is stored in JSON format per line -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.

28. Removal of MediaWiki markup

In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.

The finished code:

`main.py`


# coding: utf-8
import gzip
import json
import re
fname = 'jawiki-country.json.gz'


def extract_UK():
	'''Get the body of an article about the UK

Return value:
British article text
	'''

	with gzip.open(fname, 'rt') as data_file:
		for line in data_file:
			data_json = json.loads(line)
			if data_json['title'] == 'England':
				return data_json['text']

	raise ValueError('I can't find a British article')


def remove_markup(target):
	'''Markup removal
Remove MediaWiki markup as much as possible

argument:
	target --Target string
Return value:
String with markup removed
	'''

	#Removal of highlighted markup
	pattern = re.compile(r'''
		(\'{2,5})	#2-5'(Start of markup)
		(.*?)		#Any one or more characters (target character string)
		(\1)		#Same as the first capture (end of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\2', target)

	#Internal links, file removal
	pattern = re.compile(r'''
		\[\[		# '[['(Start of markup)
		(?:			#Start a group that is not captured
			[^|]*?	# '|'0 or more characters other than, non-greedy
			\|		# '|'
		)*?			#Group end, this group appears 0 or more, non-greedy
		([^|]*?)	#Capture target,'|'Other than 0 characters, non-greedy (character string to be displayed)
		\]\]		# ']]'(End of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\1', target)

	# Template:Removal of Lang{{lang|Language tag|String}}
	pattern = re.compile(r'''
		\{\{lang	# '{{lang'(Start of markup)
		(?:			#Start a group that is not captured
			[^|]*?	# '|'0 or more characters other than, non-greedy
			\|		# '|'
		)*?			#Group end, this group appears 0 or more, non-greedy
		([^|]*?)	#Capture target,'|'Other than 0 characters, non-greedy (character string to be displayed)
		\}\}		# '}}'(End of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\1', target)

	#Removal of external links[http://xxxx] 、[http://xxx xxx]
	pattern = re.compile(r'''
		\[http:\/\/	# '[http://'(Start of markup)
		(?:			#Start a group that is not captured
			[^\s]*?	#Zero or more non-blank characters, non-greedy
			\s		#Blank
		)?			#Group ends, this group appears 0 or 1
		([^]]*?)	#Capture target,']'Other than 0 characters, non-greedy (character string to be displayed)
		\]			# ']'(End of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub(r'\1', target)

	# <br>、<ref>Removal
	pattern = re.compile(r'''
		<			# '<'(Start of markup)
		\/?			# '/'Appears 0 or 1 (in the case of the end tag/There is)
		[br|ref]	# 'br'Or'ref'
		[^>]*?		# '>'Other than 0 characters, non-greedy
		>			# '>'(End of markup)
		''', re.MULTILINE + re.VERBOSE)
	target = pattern.sub('', target)

	return target


#Compiling the extraction conditions of the basic information template
pattern = re.compile(r'''
	^\{\{Basic information.*?$	# '{{Basic information'Lines starting with
	(.*?)		#Capture target, any 0 or more characters, non-greedy
	^\}\}$		# '}}'Line
	''', re.MULTILINE + re.VERBOSE + re.DOTALL)

#Extraction of basic information template
contents = pattern.findall(extract_UK())

#Extraction condition compilation of field name and value from extraction result
pattern = re.compile(r'''
	^\|			# '|'Lines starting with
	(.+?)		#Capture target (field name), any one or more characters, non-greedy
	\s*			#0 or more whitespace characters
	=
	\s*			#0 or more whitespace characters
	(.+?)		#Capture target (value), any one or more characters, non-greedy
	(?:			#Start a group that is not captured
		(?=\n\|) 	#new line+'|'Before (Affirmative look-ahead)
		| (?=\n$)	#Or a line break+Before the end (affirmative look-ahead)
	)			#Group end
	''', re.MULTILINE + re.VERBOSE + re.DOTALL)

#Extracting field names and values
fields = pattern.findall(contents[0])

#Set in dictionary
result = {}
keys_test = []		#List of field names in order of appearance for confirmation
for field in fields:
	result[field[0]] = remove_markup(field[1])
	keys_test.append(field[0])

#Displayed for confirmation (keys for easy confirmation_Sort by field name appearance using test)
for item in sorted(result.items(),
		key=lambda field: keys_test.index(field[0])):
	print(item)

Execution result:

`Terminal`


('Abbreviated name', 'England')
('Japanese country name', 'United Kingdom of Great Britain and Northern Ireland')
('Official country name', 'United Kingdom of Great Britain and Northern Ireland Official country name in non-English:\n*An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath (Scottish Gaelic)\n*Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon (Welsh)\n*Ríocht Aontai the na Breataine Móire agus Tuaisceart na hÉireann (Irish)\n*An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh (Cornish)\n*Unitit Kinrick o Great Breetain an Northren Ireland (Scots)\n**Claught Kängrick o Docht Brätain an Norlin Airlann, Unitet Kängdom o Great Brittain an Norlin Airlann (Ulster Scots)')
('National flag image', 'Flag of the United Kingdom.svg')
('National emblem image', 'British coat of arms')
('National emblem link', '(National emblem)')
('Motto', 'Dieu et mon droit (French:God and my rights)')
('National anthem', 'God Save the Queen')
('Position image', 'Location_UK_EU_Europe_001.svg')
('Official terminology', 'English (virtually)')
('capital', 'London')
('Largest city', 'London')
('Head of state title', 'Queen')
('Name of head of state', 'Elizabeth II')
('Prime Minister's title', 'Prime Minister')
('Prime Minister's name', 'David Cameron')
('Area ranking', '76')
('Area size', '1 E11')
('Area value', '244,820')
('Water area ratio', '1.3%')
('Demographic year', '2011')
('Population ranking', '22')
('Population size', '1 E7')
('Population value', '63,181,775United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population')
('Population density value', '246')
('GDP statistics year yuan', '2012')
('GDP value source', '1,547.8 billion IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom')
('GDP Statistics Year MER', '2012')
('GDP ranking MER', '5')
('GDP value MER', '2,433.7 billion')
('GDP statistical year', '2012')
('GDP ranking', '6')
('GDP value', '2,316.2 billion')
('GDP/Man', '36,727')
('Founding form', 'Founding of the country')
('Established form 1', 'Kingdom of England / Kingdom of Scotland (both until the Act of Union 1707)')
('Date of establishment 1', '927/843')
('Established form 2', 'Founding of the Kingdom of Great Britain (Acts of Union 1707)')
('Date of establishment 2', '1707')
('Established form 3', 'United Kingdom of Great Britain and Ireland founded (Act of Union 1800)')
('Date of establishment 3', '1801')
('Established form 4', 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"')
('Date of establishment 4', '1927')
('currency', 'UK pounds(&pound;)')
('Currency code', 'GBP')
('Time zone', '±0')
('Daylight saving time', '+1')
('ISO 3166-1', 'GB / GBR')
('ccTLD', '.uk / .gb use.Overwhelmingly small number compared to uk.')
('International call number', '44')
('Note', '')

File removal

[Previous question] remove_markup () of (http://qiita.com/segavvy/items/9a8137f045852bc299d6) has been repaired. When removing the internal link in the previous question, the file should not be involved by making sure that only 0 or 1 | appears in the range enclosed by [[ and ]]. However, this time, by targeting two or more, the file is removed at the same time as the internal link.

Removal of `Template: Lang`

Template:LangWhen you look at{{lang|Language tag|String}}It seems to be in the format, so thisStringI tried to leave.

Removal of external links

If you look at the external link in the [Markup quick reference table](https://ja.wikipedia.org/wiki/Help: quick reference table), it will be[http://www.example.org display character]. Seems to need to leave the display string, so I tried that.

Removal of other markup

The problem is to remove the MediaWiki markup as much as possible, but looking at the removal results so far, it seems that only<br>and<ref>are left, so remove these two. I am doing it.

That's all for the 29th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. The license for the data used in this Chapter 3 is Creative Commons Attribution-Inheritance 3.0 Non-Portable (Japanese translation //creativecommons.org/licenses/by-sa/3.0/deed.ja)). *