[PYTHON] Learn accounting data and try to predict accounts from the content of the description when entering journals

Introduction

When it is time to file a tax return, if you reconfirm the accounting data, you will find some mistakes in the account, so I thought it would be good if you could learn the accounting data by machine learning and predict the account.

Reading accounting data

Export the data in CSV format from the accounting software and use it. By the way, this data uses the software "JDL IBEX Treasurer" as an example.

http://www.jdl.co.jp/co/soft/ibex-ab/

Load the data with the following code.

python


import pandas as pd

filename = "JDL account book-xxxx-xxxx-Journal.csv"
df = pd.read_csv(filename, encoding="Shift-JIS", skiprows=3)

Narrow down the data to be used from the read data. Here we use the code and name of the description and debit item.

python


columns = ["Description", "Debit subject", "Debit subject正式名称"]
df_counts = df[columns].dropna()

Morphological analysis

About the description data By morphological analysis, character data is vectorized as directional numerical values.

A library called Janome is used for morphological analysis.

http://mocobeta.github.io/janome/

If it is not installed, you need to install it with the following command.

python


$ pip install janome

The following code converts the description data into token data.

python


from janome.tokenizer import Tokenizer

t = Tokenizer()

notes = []
for ix in df_counts.index:
    note = df_counts.ix[ix,"Description"]
    tokens = t.tokenize(note.replace(' ',' '))
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words.replace(' \u3000', ''))

As a result, the following conversion is performed, and it becomes a character string with a half-width space for each word.

Original summary data "Souvenir fee BLUE SKY Haneda" Data after change "Souvenir fee BLUESKY Haneda"

This character string is vectorized with the following code and used as input data.

python


from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()
vect.fit(notes)

X = vect.transform(notes)

It also uses the account code as teacher data.

python


y = df_counts.Debit subject.as_matrix().astype("int").flatten()

Machine learning

The data converted to numerical values is divided into training data and validation data by cross validation.

python


from sklearn import cross_validation

test_size = 0.2
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size)

Learn using the divided data. Here we use LinearSVC as the model.

python


from sklearn.svm import LinearSVC

clf = LinearSVC(C=120.0, random_state=42)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The score was "0.89932885906040272".

Forecast

From the learned result, enter the test data as follows and check what kind of account is expected.

python


tests = [
    "Expressway usage fee",
    "PC parts cost",
    "Stamp fee"
]

notes = []
for note in tests:
    tokens = t.tokenize(note)
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words)

X = vect.transform(notes)

result = clf.predict(X)

df_rs = df_counts[["Official name of the debit item", "Debit subject"]]
df_rs.index = df_counts["Debit subject"].astype("int")
df_rs = df_rs[~df_rs.index.duplicated()]["Official name of the debit item"]

for i in range(len(tests)):
    print(tests[i], "\t[",df_rs.ix[result[i]], "]")

The output result is ...

python


Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]

It feels pretty good (^-^)

By the way, the transfer slip needs a little more ingenuity.

I thought it would be better if I could use other information such as months, days of the week, and financial statements, but I wasn't sure how to handle the learning data, so I'll look into it later.

Well, what should I do next?

(Bonus) Separation of learning and prediction

If you try to use the above program as it is, it is not efficient because the actual data is read, learned, and then predicted each time it is executed. Therefore, save the training data, and in the part to predict, read the trained data and change it to predict.

Saving learning results

You can save the training data by adding the following code to the end of the above program.

python


from sklearn.externals import joblib

joblib.dump(vect, 'data/vect.pkl')
joblib.dump(clf, 'data/clf.pkl')
df_rs.to_csv("data/code.csv")

Reading learning results

In the new program, load the training data as follows.

python


import pandas as pd

filename = "data/code.csv"
df = pd.read_csv(filename, header=None)
df.index = df.pop(0)
df_rs = df.pop(1)

from sklearn.externals import joblib

clf = joblib.load('data/clf.pkl')
vect = joblib.load('data/vect.pkl')

Forecast

After reading the training data, continue to execute the prediction.

python


from janome.tokenizer import Tokenizer

t = Tokenizer()
tests = [
    "Expressway usage fee",
    "PC parts cost",
    "Stamp fee",
]

notes = []
for note in tests:
    tokens = t.tokenize(note)
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words)

X = vect.transform(notes)

result = clf.predict(X)

for i in range(len(tests)):
    print(tests[i], "\t[",df_rs.loc[result[i]], "]")

The execution result is ...

python


Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]

did it!

Recommended Posts

Learn accounting data and try to predict accounts from the content of the description when entering journals
Try to create a battle record table with matplotlib from the data of "Schedule-kun"
Give latitude and longitude point sequence data and try to identify the road from OpenStreetMap data
[Introduction to logarithmic graph] Predict the end time of each country from the logarithmic graph of infection number data ♬
The story of copying data from S3 to Google's TeamDrive
Try to extract the features of the sensor data with CNN
[Note] Let's try to predict the amount of electricity used! (Part 1)
[Python] Try to graph from the image of Ring Fit [OCR]
How to avoid duplication of data when inputting from Python to SQLite.
Learn Bayesian statistics from the basics to learn the M-H and HMC methods
Try to predict the triplet of boat race by ranking learning
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
Try to image the elevation data of the Geographical Survey Institute with Python
Try to get the road surface condition using big data of road surface management
I tried to learn the angle from sin and cos with chainer
Try to separate the background and moving object of the video with OpenCV
Even in the process of converting from CSV to space delimiter, seriously try to separate input / output and rules
Script to change the description of fasta
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Until you try to let DNN learn the truth of the image using Colab
Build a python environment to learn the theory and implementation of deep learning
[Verification] Does levelDB take time to register data when the amount of data increases? ??
About the handling of ZIP files including Japanese files when upgrading from Python2 to Python3
Try to display the railway data of national land numerical information in 3D