[PYTHON] How do you make "the first three words you found"?

Have you ever seen this on Twitter or Facebook? スナップショット17.jpeg This is what I made up. You can find some by searching.

** How do you make this! ?? I was wondering **, so I thought about it for a moment. All the code is implemented in Python3. If you are using 2.x and often use Japanese, please switch. If you pay attention to the character code only at the entrance and exit, you don't have to worry about unicode or codec, so it's very easy.

Strategy 1: Choose hiragana at random

Strategy 1


import random

h, w = 16, 16
USE_HIRAGANAS = ('Aiue Okakikuke Kosase Soshi' +
        'What's Nune no Hahifuhe Homami Mumemoya Yuyorari Rurerowan' +
        'Gagiguge Gozajizuzezo Dajizu de Dobabibubebo Papipupepo')

for _ in range(h):
    print(''.join(random.choice(USE_HIRAGANAS) for _ in range(w)))

Results of Strategy 1


Ruzuru Po Nuzu Jijiranuyo Audo Pepo
Geyojiname Hikepee no Patsume Abo
Zuko Pozazago Gibabora Hiwa Chiro Yubu
It ’s amazing.
Sillaginidae Sillaginidae
Noupi Homi ni Reze Unazamomu Obotsu
Robuza Hopozae Shipusachi Yomu
Nato is Goza Koto Dometsuse Pizozuso
No tanning and unevenness
Ntepabesoyusupibi is a stranger
Meba and Hechinagake Yasuzumihoha
Yabudo Soewa ni Sajiya no Mobonunu
Mabiro Gozuru Yupiruji Wazumosa
Gaze and Sozuma Hoa Osaji Gezu Hameki
Chikoshige Bikinebo Pezazubo
I'm afraid to be a hoe

I don't feel like finding a word. Sorry.

Strategy 2: Consider the frequency of hiragana appearance and select commonly used hiragana with high probability.

Use gcanna.ctd from cannadic Kai, which is used as a dictionary for the open source Japanese input system, as a sample of hiragana strings. used.

By the way, the frequency of appearance of each hiragana was like this. スナップショット19.jpeg (Refer to here for how to display Japanese in matplotlib so that it doesn't become confusing.)

Here is a code example that makes it easier to select hiragana with a high probability.

Strategy 2


def make_text1(hist, height, width):
    n = sum(hist.values())
    keyvalues = hist.most_common()
    a = [keyvalues[0][1]]
    for x in keyvalues[1:]:
        a.append(a[-1] + x[1])

    def _get():
        x = random.randrange(n)
        for i,y in enumerate(a):
            if x < y:
                return keyvalues[i][0]

    return '\n'.join([''.join([_get() for _ in range(width)])
                     for _ in range(height)])

By the way, hist assumes collections.Counter type. It is a subclass of dict type and is similar to defaultdict (lambda: 0) in terms of usability, but It is very convenient because you can add up between Counter types, and you can get ((key, value), ...) in descending order with Counter.most_common ().

And as a result, I got a list of hiragana like this.

Strategy 2 results


Kigo Eshiratsu Osen
Migodakote Azuchi Karigo no Ukuri
Bubuchi Dona Chinkefui Odakoshikofu
Yugitado Cancer is good
Mesagen Aki de Biuro Hiroro
Maybe it's light and light
Fuwanrosa Hiuta is Ujimi Kakubo
Chitsuke-n-mezan
Jiriyoji and Parugiko
Chita Chita, a nice apple
Toseda Sarago is Sadamu Umomusin
Fuzoki Sezori Kuki Sachi Aiku Kafu
Riuuchi no Sato Toshito Mawachi
Chiira Utafuse Suma Biseki
Sohachi squeaky and squeaky
Nana Bakon Uta Hogiki Kuukoan

It has improved considerably. It's pretty good. This will give you the energy to look for it. But wouldn't it be a little better?

Strategy 3: Take into account the probability of transition from one hiragana to another

In short, it's the Markov process on the first floor. Here is an excerpt of the code. By the way, markov is collections.defaultdict (collections.Counter), and markov ['a'] ['i'] counts the number of "i" after "a". I will. Originally, I had to divide each element by sum (markov ['a']. Values ()) so that it would be a probability, but I skipped it, so I'm doing that with this function as well.

Strategy 3


def tee(func, arg):
    func(arg)
    return arg

def make_text2(markov, hist, height, width):
    '''A list of hiragana that also considers the next hiragana.'''
    def _get(up, left):
        '''Consider the transition from the upper character and the transition from the left character'''
        c = norm_markov[up] + norm_markov[left]
        x = random.random() * 2.0
        y = 0.0
        for k,v in c.most_common():
            y += v
            if x <= y:
                #print('_get:', up, left, '-->', k)
                return k

    #Since the transition from the upper character and the left character is considered, the column before the 0th column and the row before the 0th row are
    #It is necessary, but for convenience, the line made from the frequency of occurrence of hiragana is used.
    firstcol = make_text1(hist, 1, width)
    prevline = make_text1(hist, 1, width).strip()
    #print('firstcol', firstcol)
    #print('prevline', prevline)

    #Normalize markov.
    #To make the weight of the transition from the upper character and the transition from the left character the same.
    norm_markov = {}
    for k1,cnt in markov.items():
        c = Counter()
        n = sum(cnt.values())
        for k2,v in cnt.items():
            c[k2] = v / n
        norm_markov[k1] = c

    a = []
    for i in range(height):
        line = []
        functools.reduce(lambda x,j: tee(line.append, _get(prevline[j], x)),
                         range(width), firstcol[i])
        a.append(''.join(line))
        prevline = line
    return '\n'.join(a)

By the way, to make a markov, you can use a code like this. (I've omitted a lot)

markov Let's make


from collections import deque
from collections import defaultdict
from collections import Counter

def shiftiter(iterable, n):
    '''Generator which yields iterable[0:n], iterable[1:n+1], ...

    Example: shiftiter(range(5), 3) yields (0, 1, 2), (1, 2, 3), (2, 3, 4)'''
    # I used deque, but I'm not sure it is fastest way:(
    it = iter(iterable)
    try:
        a = deque([next(it) for _ in range(n)])
        yield tuple(a)
    except StopIteration:
        pass
    for x in it:
        a.popleft()
        a.append(x)
        yield tuple(a)


USE_HIRAGANAS = ('Aiue Okakikuke Kosase Soshi' +
            'What's Nune no Hahifuhe Homami Mumemoya Yuyorari Rurerowan' +
            'Gagiguge Gozajizuzezo Dajizu de Dobabibubebo Papipupepo')

markov = defaultdict(Counter)
with open(FILENAME, encoding=FILEENCODING) as f:
    for line in f:
        hiragana = line.split()[0]
        for x,y in shiftiter(hiragana, 2):
            if x in USE_HIRAGANAS and y in USE_HIRAGANAS:
                markov[x][y] += 1

As a result, it looks like this.

Strategy 3 results


Kisetsu Gejinkatonta Saikigose
Small Shizun Rakui Otsubishi Kimebui
Nikaho Nikaho
I'm so nice
Ruiga Hokkokine Ushin When Gouri
Kehowaun Negoru Egipai Toushi
You're a stranger
What is Sankeda Ichichi?
Nigini Omotoyoi Heuri Kochise
Lego Kasenka Funuya Shinkuso
Ken is a shiroka
Naki Funamisai may be good
Grated chair chair
Tokei Jio-san or Sonkaki Ogoto
Ngayama Yoshigoki Hojimami Shizu
Bichikuji Kobarichi Ichizumu Konguka

Would you like to look for this? It's random, so I have to make it several times and choose a good one, but I have the impression that it's useless. It's not bad.

Future considerations for those who are not yet satisfied

Markov process on the second floor or higher

You might want to try it.

How to choose letters

Currently, the probability distribution is accumulated from the largest, normalized to 0 to 1, and selected with a random number from 0 to 1. (Actually, it's a little different, but the meaning is almost like that) But do I have to choose a hiragana that appears only with a 1% chance? Shouldn't it be ignored? And I think. However, when I remodeled it so that only about 80% could be selected, a large number of frequently used hiragana such as "n" and "u" were selected. It seems that some adjustment is necessary even on the side with high frequency.

The code is github

For those who are interested: The code is posted on github. https://github.com/gyu-don/jikken/tree/master/markov I haven't uploaded gcanna.ctd because it was troublesome to display the license (GPL), but I have uploaded the output of the aggregated data, so I can execute it without it. (If anyone wants to use this code under the GPL, it's okay to use it)

You can download gcanna.ctd here. https://osdn.jp/projects/alt-cannadic/releases/

Recommended Posts

How do you make "the first three words you found"?
How much do you know the basics of Python?
How do you collect information?
Do you understand the Monty Hall problem?
Do you make something like a rocket?
How to use MkDocs for the first time
You can do it in 3 minutes! How to make a moving QR code (GIF)!