I want to extract three parts from a hostname: Domain, SubDomain, Top Level Domain (For the domain configuration, we recommend the explanation of GoDaddy Youtube)
For example, the following hostname www.facebook.com
Subdomain: (www) It doesn't mean anything, so I'll omit it. Domain:facebook Top Level Domain:com
However, I would like to think about a guy like co.jp, which is common in Japan (although not exactly), here with a ** set TLD **. For example, news.yahoo.co.jp
Subdomain:news Domain:yahoo Top Level Domain:co.jp
I want the result.
By the way, other than Japan, there are countries that use this kind of co.xx formula.
Some of the countries using .co as a second-level domain include India (.in), Indonesia (.id), Israel (.il), the United Kingdom (.uk), South Africa (.za), Costa Rica (.cr), New Zealand (.nz), Japan (.jp), South Korea (.kr) and Cook Islands (.ck).
ref: wikipedia
import re
p_tld = re.compile(r"\.(?P<tld>(?:\w{2}\.)?\w{2,5})$")
test =[
"amazon.co.jp",
"amazon.com",
"news.yahoo.co.jp",
"news.yahoo.jp",
"news.yahoo.com",
"google.jp",
"google.co.jp",
"google.com",
"www.microsoft.com"
]
for t in test:
print(t)
#Lose www
t = re.sub(r"^www\.", "", t)
#Find the TLD part
m = p_tld.search(t)
if(m != None):
print("tld:", m.group("tld"))
#Cut the TLD part
t = p_tld.sub("",t)
#The remaining part. If you have a subdomain, print. If not, print only domain
subdomain = t.split('.')
if(len(subdomain) > 1):
print("subdomain:", subdomain[0])
print("domain:", subdomain[1])
else:
print("domain:", subdomain[0])
print("--------")
amazon.co.jp
tld: co.jp
domain: amazon
--------
amazon.com
tld: com
domain: amazon
--------
news.yahoo.co.jp
tld: co.jp
subdomain: news
domain: yahoo
--------
news.yahoo.jp
tld: jp
subdomain: news
domain: yahoo
--------
news.yahoo.com
tld: com
subdomain: news
domain: yahoo
--------
google.jp
tld: jp
domain: google
--------
google.co.jp
tld: co.jp
domain: google
--------
google.com
tld: com
domain: google
--------
www.microsoft.com
tld: com
domain: microsoft
--------
Recommended Posts