[PYTHON] Is it a character string operation?

When I was looking it up, it turned out to be a miscellaneous memo. Various character string division.

  1. Camel case or snake case
  2. Japanese surname split
  3. Japanese address split
  4. Chinese etc ..

It is a miscellaneous sentence.

Camel case and snake case string conversion

See the URL for how to use it. https://kagamihoge.hatenablog.com/entry/2017/01/03/225054 I thought about speed measurement and comparing the coding contents, but since it already existed in Convert CamelCase and snake_case to each other, refer to that. Therefore, refer only to the source code below.

Google Guava

https://github.com/google/guava/blob/master/guava/src/com/google/common/base/CaseFormat.java#L84

  private static String firstCharOnlyToUpper(String word) {
    return word.isEmpty()
        ? word
        : Ascii.toUpperCase(word.charAt(0)) + Ascii.toLowerCase(word.substring(1));
  }

Commons Lang

https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7539

    private static String[] splitByCharacterType(final String str, final boolean camelCase) {
        if (str == null) {
            return null;
        }
        if (str.isEmpty()) {
            return ArrayUtils.EMPTY_STRING_ARRAY;
        }
        final char[] c = str.toCharArray();
        final List<String> list = new ArrayList<>();
        int tokenStart = 0;
        int currentType = Character.getType(c[tokenStart]);
        for (int pos = tokenStart + 1; pos < c.length; pos++) {
            final int type = Character.getType(c[pos]);
            if (type == currentType) {
                continue;
            }
            if (camelCase && type == Character.LOWERCASE_LETTER && currentType == Character.UPPERCASE_LETTER) {
                final int newTokenStart = pos - 1;
                if (newTokenStart != tokenStart) {
                    list.add(new String(c, tokenStart, newTokenStart - tokenStart));
                    tokenStart = newTokenStart;
                }
            } else {
                list.add(new String(c, tokenStart, pos - tokenStart));
                tokenStart = pos;
            }
            currentType = type;
        }
        list.add(new String(c, tokenStart, c.length - tokenStart));
        return list.toArray(ArrayUtils.EMPTY_STRING_ARRAY);
    }

ModeShape

https://github.com/HexarA/Json2Pojo/blob/master/src/org/jboss/dna/common/text/Inflector.java#L325

    public String underscore( String camelCaseWord,
                              char... delimiterChars ) {
        if (camelCaseWord == null) return null;
        String result = camelCaseWord.trim();
        if (result.length() == 0) return "";
        result = result.replaceAll("([A-Z]+)([A-Z][a-z])", "$1_$2");
        result = result.replaceAll("([a-z\\d])([A-Z])", "$1_$2");
        result = result.replace('-', '_');
        if (delimiterChars != null) {
            for (char delimiterChar : delimiterChars) {
                result = result.replace(delimiterChar, '_');
            }
        }
        return result.toLowerCase();
    }

Java Case Converter

https://github.com/toolpage/java-case-converter https://en.toolpage.org/cat/case-converter https://github.com/toolpage/java-case-converter/blob/master/src/org/toolpage/util/text/CaseConverter.java#L121

	public static String convertToSnakeCase(String value) {
		String throwAwayChars = "()[]{}=?!.:,-_+\\\"#~/";
		value = value.replaceAll("[" + Pattern.quote(throwAwayChars) + "]", " ");
		value = CaseConverter.convertToStartCase(value);
		return value.trim().replaceAll("\\s+", "_");
	}

Netbeans Case Converter NetBeans plugin https://github.com/eviweb/netbeans-case-converter

Japanese surname split

When I was researching English-speaking traditions, I suddenly remembered Creating a perfect Yubaba with Name Divider as a catalyst.

NameDivider https://internet.watch.impress.co.jp/docs/yajiuma/1289735.html https://github.com/rskmoi/namedivider-python https://github.com/rskmoi/NameDivider

        example:
        -----------------------------------------------------
        >>> namedivider = NameDivider()
        >>> divided_name = namedivider.divide_name("Yoshihide Suga")
        >>> print(divided_name)
        "Yoshihide Suga"
        >>> print(divided_name.to_dict())
        {'family': 'Suga', 'given': 'Yoshihide', 'separator': ' ', 'score': 0.6328842762252201, 'algorithm': 'kanji_feature'}
        -----------------------------------------------------
        """ 

name-divider When I was looking at Github, a few similar things came out, so here too. https://github.com/iszk/name-divider

Japanese municipal division

When I thought about it, I wondered if anyone was trying to find a Japanese address, but it already existed. There are pioneers everywhere.

Extreme sports that divide the address into "prefectures/cities/after" with as short a regular expression as possible Divided the address into "prefectures", "city", and "afterwards" [City with the characters "prefectures" (https://uub.jp/zat/todofukenmoji.html)

First, put the final result for those who are "difficult to read"

(...??[Prefectures])((?:Asahikawa|Date|Ishikari|Morioka|Oshu|Tamura|Minamisoma|Nasushiobara|Higashimurayama|Musashimurayama|Hamura|Tokamachi|Joetsu|Toyama|Nonoichi|Omachi|Gamagori|Yokkaichi|Himeji|Yamatokoriyama|Hatsukaichi|Kudamatsu|Iwakuni|Tagawa|Omura)city|.+?county(?:Tamamura|Omachi|.+?)[Towns and villages]|.+?city.+?Ward|.+?[cityWardTowns and villages])(.+)

Chinese surname split

After investigating the above, I wondered if it is also in Chinese, but at most I found it below.

In the Shoji Ruins, Chinese surname, surname, Japanese name, last name, Japanese first name. (Split the Chinese name in the address book into last name and first name and save it as last name and first name.)

https://github.com/chengyin/chinese-contact-name-separator

Chinese name judgment

mingpipe is the name matcher of the Chinese name. Take two names and predict if they can refer to the same entity (person, organization, or location).

Example: 轛 罗 伦 萨 (Florence) Jade Cold Midori (Philippines) true

https://github.com/hltcoe/mingpipe

Conclusion: Strings in multibyte characters are difficult

Today's survey is over with the following as a punch line.

https://github.com/derek73/python-nameparser https://github.com/derek73/python-nameparser/issues/83

The parser seems to parse incorrectly for Chinese names in English. (below uses Malaysia's Chinese name) Names without nickname. Current:


>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Tham'
        middle: 'Jun'
        last: 'Hoe'
        suffix: ''
        nickname: ''
]>
Expected:

>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Jun Hoe'
        middle: ''
        last: 'Tham'
        suffix: ''
        nickname: ''
]>

Recommended Posts

Is it a character string operation?
Is this string a decimal?
Delete a particular character in Python if it is the last
[Golang] Check if a specific character string is included in the character string
# 5 [python3] Extract characters from a character string
[Python] How to invert a character string
[Python beginner memo] Python character string, path operation
I want to specify a file that is not a character string for logrotate, but is it impossible?
Character range / character string range
How to input a character string in Python and output it as it is or in the opposite direction.
I tried to generate a random character string
Judge whether it is a prime number [Python]
What is a distribution?
Check if the string is a number in python
Get the variable name of the variable as a character string.
Various character string operations
Is it Google Colaboratory?
What is a terminal?
[Python] How to expand variables in a character string
# Function that returns the character code of a string
What is a hacker?
Basics of Python learning ~ What is a string literal? ~
I want to split a character string with hiragana
Output a character string with line breaks in PyYAML
What is a pointer?
Smartly announce that it is a deprecated implementation --debtcollerctor
When a character string of a certain series is in the Key of the dictionary, the character string is converted to the Value of the dictionary.
Determine if a string is a time with a python regular expression
Get a Python web page, character encode it, and display it
[Addition] A memo for dividing a character string containing multiple spaces
Find out how many each character is in the string.
Python learning basics ~ How to output (display) a character string? ~
[Python] Use a string sequence
BLAST result-like character string display
Is this a system trade?
Date and time ⇔ character string
What is a decision tree?
Python list is not a list
Is Vtuber Positive? Is it negative?
What is a Context Switch?
What is a super user?
What is a system call
[Definition] What is a framework?
Python f character (formatted string)
This is a webiopi question
What is a callback function?
The image is a slug
What is a python map?
[Introduction to Python] How to split a character string with the split function
[Introduction to Python] How to output a character string in a Print statement
How easy is it to synthesize a drug on the market?
Try to extract a character string from an image with Python3
Outputs a line containing the specified character string from a text file
How to extract the desired character string from a line 4 commands
Is it okay with such a patch? (Memo about mock ~ Part 2 ~)