[PYTHON] Google Tech Dev Guide Translated the explanation of the first issue into Japanese

An amateur worked on the learning material Google Tech Dev Guide provided by Google, but since it is an outsider, the content does not come in while reading in English. I translated it into Japanese because it would be more efficient to translate it all into Japanese and then read it. I hope it helps someone.

This time I worked on the following:

■ Title Former Coding Interview Question: Find longest word in dictionary that is a subsequence of a given string

■ Link https://techdevguide.withgoogle.com/paths/foundational/find-longest-word-in-dictionary-that-subsequence-of-given-string/#code-challenge

What i did

1. First, solve it yourself

・ Implemented with Python 3.6 ・ I made something equivalent to the linked Greedy algorithm without any special effort.

def find_longest_word_in_string(letter, words):
    i = 0
    ok = ""
    l = len(letter) - 1

    for w in words:
        j = 0
        while j < l:
            if w[0] == letter[j]:
                j += 1
                w = w[1:]
            else:
                j +=1
        if w == "":
            if len(ok) < len(words[i]):
                ok = words[i]
        i += 1

    return(ok)

S = "abppplee"
D = ["jam", "jamboleeee", "able", "ale", "apple", "bale", "kangaroo"]
result = find_longest_word_in_string(S, D)
print(result)

2. Reading the commentary

Japanese translation of the explanation below.

■ Explanation Various methods can be considered as a problem-solving approach (For example, there are multiple algorithms that can be applied). In other words, your problem-solving ability and knowledge will be tested.

■ Forced solution Generate all possible substrings of 2 ^ n from string S and check if they match the words in dictionary D. As a small optimization, make the string to be compared represented by a hash table or a prefix tree to make the search more efficient.

■ Check each word in the dictionary with greedy algorithm It can be confirmed by the greedy method whether the word w in the dictionary D is a substring of S. Scan S for w [0] from the beginning. When w [0] is found in S, then scan for w [1] from the position where w [0] was found, until the entire S is scanned or all the characters contained in w are found. to continue.

After sorting in descending order by the length of the words included in D, the above procedure can be performed, and the first word that becomes a substring of S can be used as the detection result. The execution time is from the number W of words contained in the dictionary and the number of characters N included in S to O (N * W).

When the length of the word contained in D is close to the length of S, the execution time O (N * W) is asymptotically optimal (because the input is O (N * W)). However, in the worst case, if the string S is long and the words contained in D are very short, it is far from optimal.

Let L be the total length of the letters of all the words contained in the dictionary. In this case, the best case of calculation time is O (N + L) (considering the case where the word is shorter than S). However, the execution time of the above algorithm is O (N * W), and it may be O (N * L) when the words contained in the dictionary have a certain small length.

■ How to improve the greedy algorithm In order to check the word w in the dictionary, it is necessary to know the minimum i for which the letter c contained in w has S [i] == c larger than the position j where the letter matches S before. Please be aware of that. The greedy algorithm is obviously slow because it scans this i without thinking.

Preprocessing S can make such operations much faster. Make a map of characters-> Make a list sorted by the presence of characters. For example, if S == "abppplee":

a -> [0] b -> [1] p -> [2, 3, 4] l -> [5] e -> [6, 7]

When you want to know "The previous character was X, where does the next character Y match?", Search for Y in the map, and then do a binary search to find the smallest position of X in the list. Find the number or find that it doesn't exist (in which case the words are out of order in the S).

This method requires one binary search for each character contained in D. Therefore, in this case, since only O (N) is required for the preprocessing of S, the total processing time becomes O (N + logN) * 1 and approaches the best possible processing time.

A more advanced approach: Since this is a classic "successor-value" problem, you may find that there are special data structures that can perform searches even faster. For example, using vEB tree improves the execution time to O (N + L * loglogN).

■ Optimal approach of O (N + L) to small size strings (I'm not sure here) If the size of the string is only a-z or a certain small number, it can be solved by the execution time of O (N + L) mentioned above.

The actual execution time is O (N * A + L), where A is the number of alphabets. Instead of creating a sparse expression * 2 (you could make it dense). ← I'm not sure. Instead of p-> [2, 3, 4] you can do p-> [2, 2, 3, 4, -1, -1, -1, -1] and the i-th index goes directly to the query. Generate an answer. This method requires an O (N) space for each character (O (NA) in total), and preprocessing takes time. Therefore, it is not suitable for large character strings.

■ The best way for any string Process all words at the same time instead of processing word by word.

First, create a tuple containing w and 0 for each word w contained in D (eg (w, 0)). This number is the letter contained in w found in S, so none was found at first. About tuple t Let the tuple word be t.w and the number contained in the tuple be t.i (because it is an index).

Group words by t.w [t.i](the initial value of t.i is 0, so the initial value is the first letter). For example, in the example, D = {"able", "ale", "apple", "bale", "kangaroo"}, so

a -> [("able", 0), ("ale", 0), ("apple", 0)] b -> [("bale", 0)] k -> [("kangaroo", 0)]

What we're doing here is what words to look for next for the letters that might be found in the S.

Scan character by character for S. If the scanned character is c, increment t.i by 1 for each tuple contained in map [c] and move it to map [t.w [t.i]] * 3. If t.i == t.w.length, move it to the special word "found" list. The final program output is the longest word in the found list.

3. Copying the answer example

While looking at the screen, copy the sutras and add comments to check the operation. (The original answer example needs to correct the indentation of the if statement on line 35)

#!/usr/bin/env python
import collections
import sys
import logging

logging.basicConfig(level = logging.DEBUG, format = "%(asctime)s - %(levelname)s - %(message)s")

logging.disable(logging.DEBUG)

def find_longest_word_in_string(letters, words):
    letter_positions = collections.defaultdict(list)
#To create a list that does not require initialization. Use defaultdict.


    for index, letter in enumerate(letters):
        letter_positions[letter].append(index)
#Extract characters from letters as dictionary keys.
#Store the position of letter in letters in a dictionary.
#key is letter.

    for word in sorted(words, key=lambda w: len(w), reverse=True):
        #Sort words in words in descending order by length
        pos = 0
        for letter in word:
            logging.debug("word is {}, letter is {}." .format(word, letter))
            if letter not in letter_positions:
                break
            #The character is letter_break if not in positions

            possible_positions = [p for p in letter_positions[letter] if p >= pos]
            #p is the position of the target letter. Even if the if statement in the middle of the for loop is broken, pos is added, so
            #The next character is the judgment target.
            logging.debug("possible_positions are {}." .format(possible_positions))
            logging.debug("pos is {}." .format(pos))
            if not possible_positions:
                break
            logging.debug("possible_positions[0] is {}." .format(possible_positions[0]))
            pos = possible_positions[0] + 1
            #possible_positin[0]Returns the first number in the list that stores the letter position.
            #By ↑, the character to be judged in letters is set to the character after the current letter.
            #Example: letters= "aaapplle"、word = "apple"in the case of,
            #The letter of the loop on the first lap is"a"Because it is possible_positions = (0, 1, 2)
            #The pos of the next loop is 0+ 1 = 1
            #In the next loop the letter is"p"Because it is possible_positions = (3, 4)
            #The pos of the next loop is 3+ 1 = 4
            #In the next loop the letter is"p"But p>=Because the condition is 4 or more than pos
            #possible_positions = (4)
            
        #If for turns until there are no more characters in the word, it will not be broken and that word will be returned.
        else:
            return word

    #print(find_longest_word_in_string(sys.argv[1], sys.argv[2:]))
    #Execute by specifying a character string and a word from the command line.

if __name__ == '__main__':

    letters = "abppplee"
    words = ["able", "ale", "apple", "bale", "kangaroo"]
    print(find_longest_word_in_string(letters, words))

#If you run the script file directly from the command line,__name__Automatically in the variable (attribute)
# __main__The "substitute" operation is performed at the very beginning.

4. Read the explanation of the link and look back roughly

・ It takes time by the number of characters x the number of words, it is slow ・ At least it should have been sorted by word length and broken in the middle ・ It is important to devise data preprocessing

Recommended Posts

Google Tech Dev Guide Translated the explanation of the first issue into Japanese
[Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch)
The one that just translated the Py2app option into Japanese
Completely translated the site of "The Hitchhiker's Guide to Python"
Translated the explanation of the upper model of kaggle's brain wave detection competition