[PYTHON] 100 amateur language processing knocks: 61

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 7: Database

artist.json.gz is a file in the open music database MusicBrainz that is converted to JSON format and compressed in gzip format. In this file, information about one artist is stored in JSON format on one line. The outline of JSON format is as follows.

field Mold Contents Example

id Unique identifier integer 20660

gid Global identifier String "ecf9f3a3-35e9-4c58-acaa-e707fba45060"

name artist name String "Oasis"

sort_name Artist name (for dictionary order) String "Oasis"

area Place of activity String "United Kingdom"

aliases alias List of dictionary objects

aliases[].name alias String "oasis"

aliases[].sort_name Alias (for alignment) String "oasis"

begin Activity start date dictionary

begin.year Activity start year integer 1991

begin.month Activity start month integer

begin.date Activity start date integer

end Activity end date dictionary

end.year End of activity year integer 2009

end.month Activity end month integer 8

end.date Activity end date integer 28

tags tag List of dictionary objects

tags[].count Number of times tagged integer 1

tags[].value Tag content String "rock"

rating Rating Dictionary object

rating.count Rating votes integer 13

rating.value Rating value (average value) integer 86

Consider storing and retrieving artist.json.gz data in a key-value-store (KVS) and document-oriented database. Use LevelDB, Redis, Kyoto Cabinet, etc. as KVS. MongoDB was adopted as the document-oriented database, but CouchDB, RethinkDB, etc. may also be used.

field	Mold	Contents	Example
id	Unique identifier	integer	20660
gid	Global identifier	String	"ecf9f3a3-35e9-4c58-acaa-e707fba45060"
name	artist name	String	"Oasis"
sort_name	Artist name (for dictionary order)	String	"Oasis"
area	Place of activity	String	"United Kingdom"
aliases	alias	List of dictionary objects
aliases[].name	alias	String	"oasis"
aliases[].sort_name	Alias (for alignment)	String	"oasis"
begin	Activity start date	dictionary
begin.year	Activity start year	integer	1991
begin.month	Activity start month	integer
begin.date	Activity start date	integer
end	Activity end date	dictionary
end.year	End of activity year	integer	2009
end.month	Activity end month	integer	8
end.date	Activity end date	integer	28
tags	tag	List of dictionary objects
tags[].count	Number of times tagged	integer	1
tags[].value	Tag content	String	"rock"
rating	Rating	Dictionary object
rating.count	Rating votes	integer	13
rating.value	Rating value (average value)	integer	86

61. Search for KVS

Use the database built with> 60 to get the activity location of a specific (designated) artist.

The finished code:

`main.py`


# coding: utf-8
import re
import leveldb

fname_db = 'test_db'

#Regular expression for decomposing key into name and id
pattern = re.compile(r'''
    ^
    (.*)	# name
    \t      #Separation
    (\d+)   # id
    $
    ''', re.VERBOSE + re.DOTALL)

#LevelDB open
db = leveldb.LevelDB(fname_db)

#Condition input
clue = input('Please enter the artist name--> ')
hit = False

#artist name+'\t'Search by
for key, value in db.RangeIter(key_from=(clue + '\t').encode()):

	#Return key to name and id
	match = pattern.match(key.decode())
	name = match.group(1)
	id = match.group(2)

	#End when you become a different artist
	if name != clue:
		break

	#Check and display the activity location
	area = value.decode()
	if area != '':
		print('{}(id:{})Activity place:{}'.format(name, id, area))
	else:
		print('{}(id:{})The place of activity is not registered'.format(name, id))
	hit = True

if not hit:
	print('{}Is not registered'.format(clue))

Execution result:

`Execution result`


Please enter the artist name--> Oasis
Oasis(id:20660)Activity place:United Kingdom
Oasis(id:286198)Activity place:United States
Oasis(id:377879)Activity place:United Kingdom

There were 3 Oasis examples.

`Execution result`


Please enter the artist name--> SMAP
SMAP(id:265728)Activity place:Japan

In this database, SMAP is still active in Japan.

Even if you are registered as an artist, you may not have information on where you are active.

`Execution result`


Please enter the artist name-->Mayuko Higa
Mayuko Higa(id:1075206)The place of activity is not registered

Mayuko Higa was looking at the list of artists whose activity locations were not registered, and it happened to stand out in kanji, so I used it as an example. It seems to be from Okinawa.

If the artist itself is not registered, it will be as follows.

`Execution result`


Please enter the artist name--> segavvy
segavvy is not registered

Search for LevelDB

If you just search by key, if you specify key with LevelDB.Get (), value will be returned and it will end, but previous question In the database created in, the key is the artist name +'\ t'+ unique identifier in order to deal with duplicate artist names. I don't know the unique identifier in advance, so I took the iterator and checked it.

The iterator is acquired by LevelDB.RangeIter (), but if you normally acquire and check all the iterators, the merit of using KVS will be lost, so use the fact that the LevelDB key is always sorted. I am. LevelDB.RangeIter () can specify the start condition of the iterator with key_from, so if you specify the artist name you want to search +'\ t'and check only after that, no matter what the unique identifier is, the corresponding artist's You can get the value directly. You can also specify the end condition with key_to, but I didn't specify it this time because I couldn't think of the specification method because I didn't understand the sort logic of LevelDB clearly. Instead, break when the artist name isn't what you want.

Notational fluctuation, Unicode code point

As an aside, my favorite fusion band T-SQUARE is not registered for some reason.

`Execution result`


Please enter the artist name--> T-SQUARE
T-SQUARE is not registered

Mysteriously, when I looked into the data for a moment, the - (hyphen) was not the usual character (Unicode code point: 45) but another character (same: 8208). If you enter that character, it will hit.

`Execution result`


Please enter the artist name--> T‐SQUARE
T‐SQUARE(id:9707)Activity place:Japan

Such fluctuations in notation can cause search omissions, so if you try to create a search mechanism, it may be a headache. Even in the problems so far, I have learned various methods such as morphological analysis and handling as a prototype or stemming, but there are other methods such as case, full half-width, variant characters, old and new kanji, kanji (= It seems that there are various things such as (to make hiragana), full spelling or abbreviation.

By the way, Unicode code points can be found in Python at ʻord () `.

What is a Unicode code point? For those of you who are aiming for geeks, I think the explanation of Differences between Unicode and UTF-8 understood from the concept of character codes is easy to understand.

That's all for the 62nd knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. The data license used in this Chapter 7 is Creative Commons Attribution-NonCommercial--Inheritance 3.0 Non-Portable ([Japanese] Translation](https://creativecommons.org/licenses/by-nc-sa/3.0/deed.ja)). *