[PYTHON] 100 amateur language processing knocks: 61

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 7: Database

artist.json.gz is a file in the open music database MusicBrainz that is converted to JSON format and compressed in gzip format. In this file, information about one artist is stored in JSON format on one line. The outline of JSON format is as follows.

field Mold Contents Example
id Unique identifier integer 20660
gid Global identifier String "ecf9f3a3-35e9-4c58-acaa-e707fba45060"
name artist name String "Oasis"
sort_name Artist name (for dictionary order) String "Oasis"
area Place of activity String "United Kingdom"
aliases alias List of dictionary objects
aliases[].name alias String "oasis"
aliases[].sort_name Alias (for alignment) String "oasis"
begin Activity start date dictionary
begin.year Activity start year integer 1991
begin.month Activity start month integer
begin.date Activity start date integer
end Activity end date dictionary
end.year End of activity year integer 2009
end.month Activity end month integer 8
end.date Activity end date integer 28
tags tag List of dictionary objects
tags[].count Number of times tagged integer 1
tags[].value Tag content String "rock"
rating Rating Dictionary object
rating.count Rating votes integer 13
rating.value Rating value (average value) integer 86

Consider storing and retrieving artist.json.gz data in a key-value-store (KVS) and document-oriented database. Use LevelDB, Redis, Kyoto Cabinet, etc. as KVS. MongoDB was adopted as the document-oriented database, but CouchDB, RethinkDB, etc. may also be used.

61. Search for KVS

Use the database built with> 60 to get the activity location of a specific (designated) artist.

The finished code:

main.py


# coding: utf-8
import re
import leveldb

fname_db = 'test_db'

#Regular expression for decomposing key into name and id
pattern = re.compile(r'''
    ^
    (.*)	# name
    \t      #Separation
    (\d+)   # id
    $
    ''', re.VERBOSE + re.DOTALL)

#LevelDB open
db = leveldb.LevelDB(fname_db)

#Condition input
clue = input('Please enter the artist name--> ')
hit = False

#artist name+'\t'Search by
for key, value in db.RangeIter(key_from=(clue + '\t').encode()):

	#Return key to name and id
	match = pattern.match(key.decode())
	name = match.group(1)
	id = match.group(2)

	#End when you become a different artist
	if name != clue:
		break

	#Check and display the activity location
	area = value.decode()
	if area != '':
		print('{}(id:{})Activity place:{}'.format(name, id, area))
	else:
		print('{}(id:{})The place of activity is not registered'.format(name, id))
	hit = True

if not hit:
	print('{}Is not registered'.format(clue))

Execution result:

Execution result


Please enter the artist name--> Oasis
Oasis(id:20660)Activity place:United Kingdom
Oasis(id:286198)Activity place:United States
Oasis(id:377879)Activity place:United Kingdom

There were 3 Oasis examples.

Execution result


Please enter the artist name--> SMAP
SMAP(id:265728)Activity place:Japan

In this database, SMAP is still active in Japan.

Even if you are registered as an artist, you may not have information on where you are active.

Execution result


Please enter the artist name-->Mayuko Higa
Mayuko Higa(id:1075206)The place of activity is not registered

Mayuko Higa was looking at the list of artists whose activity locations were not registered, and it happened to stand out in kanji, so I used it as an example. It seems to be from Okinawa.

If the artist itself is not registered, it will be as follows.

Execution result


Please enter the artist name--> segavvy
segavvy is not registered

Search for LevelDB

If you just search by key, if you specify key with LevelDB.Get (), value will be returned and it will end, but previous question In the database created in, the key is the artist name +'\ t'+ unique identifier in order to deal with duplicate artist names. I don't know the unique identifier in advance, so I took the iterator and checked it.

The iterator is acquired by LevelDB.RangeIter (), but if you normally acquire and check all the iterators, the merit of using KVS will be lost, so use the fact that the LevelDB key is always sorted. I am. LevelDB.RangeIter () can specify the start condition of the iterator with key_from, so if you specify the artist name you want to search +'\ t'and check only after that, no matter what the unique identifier is, the corresponding artist's You can get the value directly. You can also specify the end condition with key_to, but I didn't specify it this time because I couldn't think of the specification method because I didn't understand the sort logic of LevelDB clearly. Instead, break when the artist name isn't what you want.

Notational fluctuation, Unicode code point

As an aside, my favorite fusion band T-SQUARE is not registered for some reason.

Execution result


Please enter the artist name--> T-SQUARE
T-SQUARE is not registered

Mysteriously, when I looked into the data for a moment, the - (hyphen) was not the usual character (Unicode code point: 45) but another character (same: 8208). If you enter that character, it will hit.

Execution result


Please enter the artist name--> T‐SQUARE
T‐SQUARE(id:9707)Activity place:Japan

Such fluctuations in notation can cause search omissions, so if you try to create a search mechanism, it may be a headache. Even in the problems so far, I have learned various methods such as morphological analysis and handling as a prototype or stemming, but there are other methods such as case, full half-width, variant characters, old and new kanji, kanji (= It seems that there are various things such as (to make hiragana), full spelling or abbreviation.

By the way, Unicode code points can be found in Python at ʻord () `.

What is a Unicode code point? For those of you who are aiming for geeks, I think the explanation of Differences between Unicode and UTF-8 understood from the concept of character codes is easy to understand.

That's all for the 62nd knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 92
100 amateur language processing knocks: 06
100 amateur language processing knocks: 81
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69
100 amateur language processing knocks: 53
100 amateur language processing knocks: 18