It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
artist.json.gz is a file in the open music database MusicBrainz that is converted to JSON format and compressed in gzip format. In this file, information about one artist is stored in JSON format on one line. The outline of JSON format is as follows.
field Mold Contents Example id Unique identifier integer 20660 gid Global identifier String "ecf9f3a3-35e9-4c58-acaa-e707fba45060" name artist name String "Oasis" sort_name Artist name (for dictionary order) String "Oasis" area Place of activity String "United Kingdom" aliases alias List of dictionary objects aliases[].name alias String "oasis" aliases[].sort_name Alias (for alignment) String "oasis" begin Activity start date dictionary begin.year Activity start year integer 1991 begin.month Activity start month integer begin.date Activity start date integer end Activity end date dictionary end.year End of activity year integer 2009 end.month Activity end month integer 8 end.date Activity end date integer 28 tags tag List of dictionary objects tags[].count Number of times tagged integer 1 tags[].value Tag content String "rock" rating Rating Dictionary object rating.count Rating votes integer 13 rating.value Rating value (average value) integer 86 Consider storing and retrieving artist.json.gz data in a key-value-store (KVS) and document-oriented database. Use LevelDB, Redis, Kyoto Cabinet, etc. as KVS. MongoDB was adopted as the document-oriented database, but CouchDB, RethinkDB, etc. may also be used.
Use the database built with> 60 to get the activity location of a specific (designated) artist.
main.py
# coding: utf-8
import re
import leveldb
fname_db = 'test_db'
#Regular expression for decomposing key into name and id
pattern = re.compile(r'''
^
(.*) # name
\t #Separation
(\d+) # id
$
''', re.VERBOSE + re.DOTALL)
#LevelDB open
db = leveldb.LevelDB(fname_db)
#Condition input
clue = input('Please enter the artist name--> ')
hit = False
#artist name+'\t'Search by
for key, value in db.RangeIter(key_from=(clue + '\t').encode()):
#Return key to name and id
match = pattern.match(key.decode())
name = match.group(1)
id = match.group(2)
#End when you become a different artist
if name != clue:
break
#Check and display the activity location
area = value.decode()
if area != '':
print('{}(id:{})Activity place:{}'.format(name, id, area))
else:
print('{}(id:{})The place of activity is not registered'.format(name, id))
hit = True
if not hit:
print('{}Is not registered'.format(clue))
Execution result
Please enter the artist name--> Oasis
Oasis(id:20660)Activity place:United Kingdom
Oasis(id:286198)Activity place:United States
Oasis(id:377879)Activity place:United Kingdom
There were 3 Oasis examples.
Execution result
Please enter the artist name--> SMAP
SMAP(id:265728)Activity place:Japan
In this database, SMAP is still active in Japan.
Even if you are registered as an artist, you may not have information on where you are active.
Execution result
Please enter the artist name-->Mayuko Higa
Mayuko Higa(id:1075206)The place of activity is not registered
Mayuko Higa was looking at the list of artists whose activity locations were not registered, and it happened to stand out in kanji, so I used it as an example. It seems to be from Okinawa.
If the artist itself is not registered, it will be as follows.
Execution result
Please enter the artist name--> segavvy
segavvy is not registered
If you just search by key, if you specify key with LevelDB.Get ()
, value will be returned and it will end, but previous question In the database created in, the key is the artist name +'\ t'+ unique identifier in order to deal with duplicate artist names. I don't know the unique identifier in advance, so I took the iterator and checked it.
The iterator is acquired by LevelDB.RangeIter ()
, but if you normally acquire and check all the iterators, the merit of using KVS will be lost, so use the fact that the LevelDB key is always sorted. I am.
LevelDB.RangeIter ()
can specify the start condition of the iterator with key_from, so if you specify the artist name you want to search +'\ t'and check only after that, no matter what the unique identifier is, the corresponding artist's You can get the value directly.
You can also specify the end condition with key_to, but I didn't specify it this time because I couldn't think of the specification method because I didn't understand the sort logic of LevelDB clearly. Instead, break when the artist name isn't what you want.
As an aside, my favorite fusion band T-SQUARE is not registered for some reason.
Execution result
Please enter the artist name--> T-SQUARE
T-SQUARE is not registered
Mysteriously, when I looked into the data for a moment, the -
(hyphen) was not the usual character (Unicode code point: 45) but another character (same: 8208). If you enter that character, it will hit.
Execution result
Please enter the artist name--> T‐SQUARE
T‐SQUARE(id:9707)Activity place:Japan
Such fluctuations in notation can cause search omissions, so if you try to create a search mechanism, it may be a headache. Even in the problems so far, I have learned various methods such as morphological analysis and handling as a prototype or stemming, but there are other methods such as case, full half-width, variant characters, old and new kanji, kanji (= It seems that there are various things such as (to make hiragana), full spelling or abbreviation.
By the way, Unicode code points can be found in Python at ʻord () `.
What is a Unicode code point? For those of you who are aiming for geeks, I think the explanation of Differences between Unicode and UTF-8 understood from the concept of character codes is easy to understand.
That's all for the 62nd knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts