Python canonical notation: How to determine and extract only valid date representations from input data

What is regular notation?

You can use canonical notation to search for and replace complex patterns and strings. This time, I will use this canonical notation to ** read and extract only the character string that represents the date from the input data **.

Since it is the basis, I will omit the details. Please read the official documentation in the references below.

Personally, the regular notation is somewhat complicated and difficult to grasp, but I'm gradually getting used to it.

Use the standard library re

import re

Preparation of input data

First, prepare the date data. I stored various data as DATE in the list. There are various things that have nothing to do with the date, regrettable things, and things that differ only in the delimiters.

date.py


DATE = ["2020/01/05",
        "2020/1/5",
        "January 5, 2020",
        "2020-1-5",
        "2020/1/5",
        "2020.1.5",
        "2020/20/20",
        "2020 1 5",
        "2020 01 05",
        "1995w44w47",
        "Thank you",
        "1998/33/52",
        "3020/1/1",
        ]

Regular notation for dates

For example, if you enter "today" on your smartphone or computer, you may see expressions such as "January 5, 2020", "2020/01/05", and "January 5, Reiwa 2" in predictive conversion. I will. This time, we will use the Christian era and handle the notation of YYYY-MM-DD.

If you write a commonly used date regular expression as a sample, you can write it like this: ^ \ d {4}-\ d {1,2}-\ d {1,2} $.

But this is still sweet. The only supported notation is the string separated by -. That's where \ D is used.

\ D represents any non-numeric character. It's equivalent to [^ 0-9]. Therefore, you can use it to determine anything other than numbers, such as hyphens, strings, spaces, and dots.

I will make it at once.

date_type.py


date_type = re.compile(r"""(
    (^\d{4})        # First 4 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    )""",re.VERBOSE)

It's done. The method uses re.compile ().

Compared to the dates shown above, $ is gone. $ Checks if the end of the string matches, but this time the end is not necessarily \ d {1,2} = MM. That is because there is January 5, 2020 in the input data. You can't use a fixed $ with it, given that it has a day at the end or some other string.

Extract the date

Now that you're ready, consider extracting the date. First, use the .search () method to output a notation that partially matches the canonical notation.

hit_data_1.py


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    print(hit_date)

Output result_1.py


<re.Match object; span=(0, 10), match='2020/01/05'>
<re.Match object; span=(0, 8), match='2020/1/5'>
<re.Match object; span=(0, 8), match='January 5, 2020'>
<re.Match object; span=(0, 8), match='2020-1-5'>
<re.Match object; span=(0, 8), match='2020/1/5'>
<re.Match object; span=(0, 8), match='2020.1.5'>
<re.Match object; span=(0, 10), match='2020/20/20'>
<re.Match object; span=(0, 8), match='2020 1 5'>
<re.Match object; span=(0, 10), match='2020 01 05'>
<re.Match object; span=(0, 10), match='1995w44w47'>
None
<re.Match object; span=(0, 10), match='1998/33/52'>
<re.Match object; span=(0, 8), match='3020/1/1'>

Naturally, None was returned to Thank you. Other notations still look fine.

Next, omit None in bool type, and if True, return tuple type with .groups (). Let's improve the script a little.

hit_data_2.py


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()
        print(split)

Output result_2.py


('2020/01/05', '2020', '/', '01', '/', '05')
('2020/1/5', '2020', '/', '1', '/', '5')
('January 5, 2020', '2020', 'Year', '1', 'Month', '5')
('2020-1-5', '2020', '-', '1', '-', '5')
('2020/1/5', '2020', '/', '1', '/', '5')
('2020.1.5', '2020', '.', '1', '.', '5')
('2020/20/20', '2020', '/', '20', '/', '20')
('2020 1 5', '2020', ' ', '1', ' ', '5')
('2020 01 05', '2020', ' ', '01', ' ', '05')
('1995w44w47', '1995', 'w', '44', 'w', '47')
('1998/33/52', '1998', '/', '33', '/', '52')
('3020/1/1', '3020', '/', '1', '/', '1')

Yes! It will be a little more when you come here. The information you want is stored in [1], [3] and [5], respectively, in the Christian era, month, and day. Use tuple unpacking to classify this.

Furthermore, the type in the tuple is <class'str'>, so let's change it to an int type. Doing so will make it easier to judge.

Next, determine whether the int-type year, month, and day are inconsistent numbers. I will omit 3000 years because I rarely use it on a daily basis. There can be no more than 13 months and no more than 32 days. I will do it like that. If you do it in detail, you have to think about leap years, so feel free to change the judgment here.

Considering the above, it looks like this.

hit_data_3.py


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()

        # Tuple unpacking
        year, month, day = int(split[1]),int(split[3]),int(split[5])

        if year>3000 or month >12 or day > 31:
            print("False")
        else:
            print(year, month, day)

Output result_3.py


2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
False
2020 1 5
2020 1 5
False
False
False

I think I was able to extract only the expressions that seemed to be dates.

Completed sample code

main.py


import re

# data of date
DATE = ["2020/01/05",
        "2020/1/5",
        "January 5, 2020",
        "2020-1-5",
        "2020/1/5",
        "2020.1.5",
        "2020/20/20",
        "2020 1 5",
        "2020 01 05",
        "1995w44w47",
        "Thank you",
        "1998/33/52",
        "3020/1/1",
        ]

# date :sample of Regular expression operations
date_type = re.compile(r"""(
    (^\d{4})        # First 4 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    )""",re.VERBOSE)

for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()

        # Tuple unpacking
        year, month, day = int(split[1]),int(split[3]),int(split[5])

        if year>3000 or month >12 or day > 31:
            print("False")
        else:
            print(year, month, day)

Output result.py


2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
False
2020 1 5
2020 1 5
False
False
False

Summary

How was it. There may be other better ways, but that's all I can do.

When I was thinking of developing a tool to streamline my daily work, I ran into this, so I wrote it on Qiita as well.

I hope it helps. Click here for Github

References

-Python standard library re --- Regular expression operation -I don't want to google the regular expressions I use often!

Recommended Posts

Python canonical notation: How to determine and extract only valid date representations from input data
[Python] How to read data from CIFAR-10 and CIFAR-100
Use PIL in Python to extract only the data you want from Exif
Extract "current date only" and "current date and time" with python datetime.
How to scrape image data from flickr with python
[Python] Extract only numbers from lists and character strings
How to connect to various DBs from Python (PEP 249) and SQLAlchemy
[Python / Ruby] Understanding with code How to get data from online and write it to CSV
[Python] How to use input ()
Determine the date and time format in Python and convert to Unixtime
How to get followers and followers from python using the Mastodon API
How to avoid duplication of data when inputting from Python to SQLite.
[Python] How to FFT mp3 data
How to calculate date with python
How to access wikipedia from python
[Python] How to change character string (str) data to date (strptime of datetime)
How to get the date and time difference in seconds with python
[Python] How to use the enumerate function (extract the index number and element)
How to stop a program in python until a specific date and time
How to package and distribute Python scripts
How to install and use pandas_datareader [Python]
How to update Google Sheets from Python
How to extract polygon area in Python
Porting and modifying doublet-solver from python2 to python3.
How to access RDS from Lambda (python)
python: How to use locals () and globals ()
How to use "deque" for Python data
[Python] How to calculate MAE and RMSE
How to use Python zip and enumerate
Compress python data and write to sqlite
How to use is and == in Python
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
[Python] How to name table data and output it in csv (to_csv method)
How to open a web browser from python
How to generate permutations in Python and C ++
[Kaggle] From data reading to preprocessing and encoding
[Python] How to change the date format (display format)
Study from Python Hour7: How to use classes
How to generate a Python object from JSON
[Introduction to Python] How to handle JSON format data
Data retrieval from MacNote3 and migration to Write
How to handle Linux commands well from Python
How to convert SVG to PDF and PNG [Python]
How to extract coefficients from a fractional formula
[Python] Flow from web scraping to data analysis
[Python] How to use hash function and tuple.
Data cleaning How to handle missing and outliers
To represent date, time, time, and seconds in Python
How to plot autocorrelation and partial autocorrelation in python
Python: Use zipfile to unzip from standard input
Extract data from a web page with Python
Extract images and tables from pdf with python to reduce the burden of reporting
I just wanted to extract the data of the desired date and time with Django