[PYTHON] Decompose hostname with co.jp with regular expression

The problem you want to solve

I want to extract three parts from a hostname: Domain, SubDomain, Top Level Domain (For the domain configuration, we recommend the explanation of GoDaddy Youtube)

For example, the following hostname www.facebook.com

Subdomain: (www) It doesn't mean anything, so I'll omit it. Domain:facebook Top Level Domain:com

However, I would like to think about a guy like co.jp, which is common in Japan (although not exactly), here with a ** set TLD **. For example, news.yahoo.co.jp

Subdomain:news Domain:yahoo Top Level Domain:co.jp

I want the result.

By the way, other than Japan, there are countries that use this kind of co.xx formula.

Some of the countries using .co as a second-level domain include India (.in), Indonesia (.id), Israel (.il), the United Kingdom (.uk), South Africa (.za), Costa Rica (.cr), New Zealand (.nz), Japan (.jp), South Korea (.kr) and Cook Islands (.ck).

ref: wikipedia

Python 3 code

import re

p_tld = re.compile(r"\.(?P<tld>(?:\w{2}\.)?\w{2,5})$")

test =[
"amazon.co.jp",
"amazon.com",
"news.yahoo.co.jp",
"news.yahoo.jp",
"news.yahoo.com",
"google.jp",
"google.co.jp",
"google.com",
"www.microsoft.com"
]

for t in test:
    print(t)

    #Lose www
    t = re.sub(r"^www\.", "", t)

    #Find the TLD part
    m = p_tld.search(t)
    if(m != None):
        print("tld:", m.group("tld"))
    
    #Cut the TLD part
    t = p_tld.sub("",t)

    #The remaining part. If you have a subdomain, print. If not, print only domain
    subdomain = t.split('.')
    if(len(subdomain) > 1):
        print("subdomain:", subdomain[0])
        print("domain:", subdomain[1])
    else:
        print("domain:", subdomain[0])
    print("--------")

test results

amazon.co.jp
tld: co.jp
domain: amazon
--------
amazon.com
tld: com
domain: amazon
--------
news.yahoo.co.jp
tld: co.jp
subdomain: news
domain: yahoo
--------
news.yahoo.jp
tld: jp
subdomain: news
domain: yahoo
--------
news.yahoo.com
tld: com
subdomain: news
domain: yahoo
--------
google.jp
tld: jp
domain: google
--------
google.co.jp
tld: co.jp
domain: google
--------
google.com
tld: com
domain: google
--------
www.microsoft.com
tld: com
domain: microsoft
--------

Recommended Posts

Decompose hostname with co.jp with regular expression
Regular expression with pymongo
Regular expression manipulation with Python
String replacement with Python regular expression
Regular expression Greedy
Regular expression re
Make one repeating string with a Python regular expression.
Regular expression in regex.h
0 Convert unfilled date to datetime type with regular expression
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
Date notation regular expression
Regular expression look-ahead, after-yomi
python regular expression memo
Regular expression matching method
When writing an if statement with a regular expression
Regular expression in Python
Regular expression in Python
Regular expression confirmation quiz!
Determine if a string is a time with a python regular expression
Python 處 處 regular expression Notes
Julia Quick Note [04] Regular Expression
Extract numbers with regular expressions
Regular expression check tool summary