[PYTHON] Extract rows that meet the conditions from Excel containing date data (% Y /% m /% d)

Overview

-Pd.to_datetime is convenient. -When doing pd.read_csv, you need to be careful about the character code. ・ Sea born may be good.

About the data to be used

Excel file ・ Including date data ・ There are various rows ・ The first line contains useless data ○○,○○,2016/8/11,○○,○○ Data like. The element name of the date is assumed to be "deadline". To write a little more about the data,

..., subject (string), ..., deadline (% Y /% m /% d), ...,

Load data (pandas)

First of all, I exported it from Excel in csv format in preparation for reading it from pandas as a csv file. Next, read as a DataFrame.

data = pd.read_csv('./File.csv', encording='Shift-jis', skiprows=1)

It took a while to encording ='Shift-jis'. With Python3, I had the image that no error would occur even if I didn't care about the character code, but that wasn't the case.

Extraction of the target location

This time, let's extract the place where the date part is in the future. Therefore, the data extraction of the date part is compared with today's date and held as a filter.

myfilter = pd.to_datetime(data['Deadline']. format='%Y/%m/%d', error='coerce' > datetime.datetime.today()

If you want to extract the part that has passed the deadline, you can afford to modify it.

myfilter = pd.to_datetime(data['Deadline']. format='%Y/%m/%d', error='coerce' < datetime.datetime.today()

It took me a while to get to pd.to_datetime. Pandas has a very convenient date manipulation method, right? http://sinhrks.hatenablog.com/entry/2014/11/09/183603 I referred to.

error ='coerce' also got caught a little. There was a line in the date data where I did not enter the date but entered the symbol, which caused the error. This time, I decided to set the error to NaT. There are two other methods. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html Please refer to.

Other patterns (calculation of days)

If you want to know how many days have passed from the set deadline to today, you can subtract the datetime. For example

days = datetime.datetime.today() - pd.to_datetime(data['Deadline'], format=%Y/%m/%d, errors='coerce')

If you want to plot this, you can not output it as it is (it is said that it is not numerical data), so for example

days = days / np.timedelta64(1, 'D')

It is necessary to make a float type by performing a calculation. It took me some time to get here.

Other filter creation patterns (extract NaT parts)

If you set errors ='coerce' when doing pd.to_datetime (), the error will be NaT, but if you want to extract NaT, for example, if you want to extract data with no date data ,

filter = pd.to_datetime(...).isnull()

is. NaT is isnull (). This also took some time to search and find.

Creating data for output

After that, I just output using a filter, but I only had to output specific information (columns), so

print(data[myfilter]['Name of a particular column'])

It was made. After that, if you change it appropriately, you can output it as a file, so I will extract specific data (lines) from a big Excel file and visualize the days until the deadline.

Graphing

Data of DataFrame of pandas seems to be good to graph with seaborn. If you want a bar graph with the subject on the vertical axis and the horizontal axis on the number of days,

sns.barplot(x='Days', y='subject',data=data)

Feeling like that. If you specify the column name you want to use for the graph with x and y, seaborn will organize it nicely.

For reference, https://stanford.edu/~mwaskom/software/seaborn/examples/horizontal_boxplot.html is.

If you want to fine-tune the graph position of seaborn (matplotlib),

plt.subplot_adjust(left=*, right=*, top=*, bottom=*)To use.


 This can be felt by actually displaying the graph and making fine adjustments, but it seems to be the position of the left end and the right end of the graph when the rightmost end of the visible area of the graph is 1.0.
 Therefore, if you make it larger than right = 1.0, the graph will extend beyond the visible area. For example, if left = 0.5 and right = 1.0, half of the graph area will be used and the left side will be a lot of empty graph. And always left <right. The relationship between top and bottom is similar.

## If you want to use Japanese fonts in graphs

```python
fp = FontProperties(fname='C:\Windows\Fonts\YUGOTHL.TTC`. size=10)

Somehow,

plt.xtics(fontproperties=fp)
plt.ytics(fontproperties=fp)

If you do, Japanese characters will not be garbled on either the x-axis or the y-axis. You can use any font you like. You can find it by looking it up in Explorer. However, since it will be a GUI for viewing fonts, it may be a little troublesome to see the path.

If you want to use Japanese on the label,

plt.ylabel('label', fontproperties=fp)

is.

Display or save the graph

The display of the plotted graph is

plt.show()

If you want to save

plt.savefig('file name')

When saving a file, you can also specify a resolution such as ``` plt.savefig ('filename', dpi = 300)` ``. If you want to use the date in the file name, for example

today = datetime.datetime.now()
plt.savefig("file name_{}-{}-{}.png ".format(today.year, today.month, today.day))

There is a way to do this (example: file name_2016-08-14.png). I think there is a smarter way, but I only know this. However, the format function is convenient, isn't it?

Recommended Posts

Extract rows that meet the conditions from Excel containing date data (% Y /% m /% d)
Extract lines that match the conditions from a text file with python
[python] Move files that meet the conditions
python> Get yyyymmdd from date type> parsed = time.strptime (mddt) / yyyymmdd = time.strftime ("% Y% m% d", parsed)
Delete vertices that meet the conditions in networkx
The one that graphs the one that got the fitbit data from the server
Extract only elements that meet specific conditions in Python