[PYTHON] Is it a character string operation?

When I was looking it up, it turned out to be a miscellaneous memo. Various character string division.

Camel case or snake case
Japanese surname split
Japanese address split
Chinese etc ..

It is a miscellaneous sentence.

Camel case and snake case string conversion

See the URL for how to use it. https://kagamihoge.hatenablog.com/entry/2017/01/03/225054 I thought about speed measurement and comparing the coding contents, but since it already existed in Convert CamelCase and snake_case to each other, refer to that. Therefore, refer only to the source code below.

Google Guava

https://github.com/google/guava/blob/master/guava/src/com/google/common/base/CaseFormat.java#L84

  private static String firstCharOnlyToUpper(String word) {
    return word.isEmpty()
        ? word
        : Ascii.toUpperCase(word.charAt(0)) + Ascii.toLowerCase(word.substring(1));
  }

Commons Lang

https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7539

    private static String[] splitByCharacterType(final String str, final boolean camelCase) {
        if (str == null) {
            return null;
        }
        if (str.isEmpty()) {
            return ArrayUtils.EMPTY_STRING_ARRAY;
        }
        final char[] c = str.toCharArray();
        final List<String> list = new ArrayList<>();
        int tokenStart = 0;
        int currentType = Character.getType(c[tokenStart]);
        for (int pos = tokenStart + 1; pos < c.length; pos++) {
            final int type = Character.getType(c[pos]);
            if (type == currentType) {
                continue;
            }
            if (camelCase && type == Character.LOWERCASE_LETTER && currentType == Character.UPPERCASE_LETTER) {
                final int newTokenStart = pos - 1;
                if (newTokenStart != tokenStart) {
                    list.add(new String(c, tokenStart, newTokenStart - tokenStart));
                    tokenStart = newTokenStart;
                }
            } else {
                list.add(new String(c, tokenStart, pos - tokenStart));
                tokenStart = pos;
            }
            currentType = type;
        }
        list.add(new String(c, tokenStart, c.length - tokenStart));
        return list.toArray(ArrayUtils.EMPTY_STRING_ARRAY);
    }

ModeShape

https://github.com/HexarA/Json2Pojo/blob/master/src/org/jboss/dna/common/text/Inflector.java#L325

    public String underscore( String camelCaseWord,
                              char... delimiterChars ) {
        if (camelCaseWord == null) return null;
        String result = camelCaseWord.trim();
        if (result.length() == 0) return "";
        result = result.replaceAll("([A-Z]+)([A-Z][a-z])", "$1_$2");
        result = result.replaceAll("([a-z\\d])([A-Z])", "$1_$2");
        result = result.replace('-', '_');
        if (delimiterChars != null) {
            for (char delimiterChar : delimiterChars) {
                result = result.replace(delimiterChar, '_');
            }
        }
        return result.toLowerCase();
    }

Java Case Converter

https://github.com/toolpage/java-case-converter https://en.toolpage.org/cat/case-converter https://github.com/toolpage/java-case-converter/blob/master/src/org/toolpage/util/text/CaseConverter.java#L121

	public static String convertToSnakeCase(String value) {
		String throwAwayChars = "()[]{}=?!.:,-_+\\\"#~/";
		value = value.replaceAll("[" + Pattern.quote(throwAwayChars) + "]", " ");
		value = CaseConverter.convertToStartCase(value);
		return value.trim().replaceAll("\\s+", "_");
	}

Netbeans Case Converter NetBeans plugin https://github.com/eviweb/netbeans-case-converter

Japanese surname split

When I was researching English-speaking traditions, I suddenly remembered Creating a perfect Yubaba with Name Divider as a catalyst.

NameDivider https://internet.watch.impress.co.jp/docs/yajiuma/1289735.html https://github.com/rskmoi/namedivider-python https://github.com/rskmoi/NameDivider

        example:
        -----------------------------------------------------
        >>> namedivider = NameDivider()
        >>> divided_name = namedivider.divide_name("Yoshihide Suga")
        >>> print(divided_name)
        "Yoshihide Suga"
        >>> print(divided_name.to_dict())
        {'family': 'Suga', 'given': 'Yoshihide', 'separator': ' ', 'score': 0.6328842762252201, 'algorithm': 'kanji_feature'}
        -----------------------------------------------------
        """

name-divider When I was looking at Github, a few similar things came out, so here too. https://github.com/iszk/name-divider

Japanese municipal division

When I thought about it, I wondered if anyone was trying to find a Japanese address, but it already existed. There are pioneers everywhere.

Extreme sports that divide the address into "prefectures/cities/after" with as short a regular expression as possible Divided the address into "prefectures", "city", and "afterwards" [City with the characters "prefectures" (https://uub.jp/zat/todofukenmoji.html)

First, put the final result for those who are "difficult to read"

(...??[Prefectures])((?:Asahikawa|Date|Ishikari|Morioka|Oshu|Tamura|Minamisoma|Nasushiobara|Higashimurayama|Musashimurayama|Hamura|Tokamachi|Joetsu|Toyama|Nonoichi|Omachi|Gamagori|Yokkaichi|Himeji|Yamatokoriyama|Hatsukaichi|Kudamatsu|Iwakuni|Tagawa|Omura)city|.+?county(?:Tamamura|Omachi|.+?)[Towns and villages]|.+?city.+?Ward|.+?[cityWardTowns and villages])(.+)

Chinese surname split

After investigating the above, I wondered if it is also in Chinese, but at most I found it below.

In the Shoji Ruins, Chinese surname, surname, Japanese name, last name, Japanese first name. (Split the Chinese name in the address book into last name and first name and save it as last name and first name.)

https://github.com/chengyin/chinese-contact-name-separator

Chinese name judgment

mingpipe is the name matcher of the Chinese name. Take two names and predict if they can refer to the same entity (person, organization, or location).

Example: 轛罗伦萨 (Florence) Jade Cold Midori (Philippines) true

https://github.com/hltcoe/mingpipe

Conclusion: Strings in multibyte characters are difficult

Today's survey is over with the following as a punch line.

https://github.com/derek73/python-nameparser https://github.com/derek73/python-nameparser/issues/83

The parser seems to parse incorrectly for Chinese names in English. (below uses Malaysia's Chinese name) Names without nickname. Current:


>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Tham'
        middle: 'Jun'
        last: 'Hoe'
        suffix: ''
        nickname: ''
]>
Expected:

>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Jun Hoe'
        middle: ''
        last: 'Tham'
        suffix: ''
        nickname: ''
]>