[PYTHON] I touched the data preparation tool Paxata

I am outsourced to analyze data from client companies. The other day, I received a request from a customer to evaluate the product in order to consider introducing Paxata, and I had the opportunity to try Paxata on a trial basis. Paxata is a data preparation tool acquired by DataRobot in 2019 [^ 1]. There are two patterns to use, subscription or having them put in the VM of Azure / AWS, and this time it was the latter.

Impressions

It's just an impression. Whether each has its advantages or disadvantages depends on the time and the case.

――Even though it is non-coding, some programming thinking ability is required --Non-coding tools are not magic --Since the developer himself has to design the combination of parts and process the data, it is a high hurdle for people who can not program at all. --High visibility and easy to understand what kind of processing is being done ――If you are writing a detailed design document, it will be unnecessary --There is a preview function for processing results and an automatic name identification function for character strings. --You can compare before and after replacement with preview by replacement processing. --If there is a difference between Co., Ltd. and Co., Ltd. in the character string, the name will be automatically identified. --You can't do very complicated processing --Processing is serial and cannot be nested, branched or repeated ――No matter who makes it, the finish will be the same level (~~ SIer seems to like it ~~) ――I can't export the created process to Python --Vendor lock-in --Currently, only DataRobot supports cooperation with machine learning. --In order to use the processed data with scikit-learn, you need to export it to a DB or file once. --Difficult to incorporate review and deployment processes --Since there is no concept of deployment such as development / production environment, you will be in direct contact with the product in operation during maintenance. ――It's difficult to review because you can't issue pull requests or see differences like git. --There is no test function like pytest or JUnit --Since there is a version control function, you can revert to the previous version.

Actual operation

Paxata consists of three components:

# component Description
1 Library Manage datasets (project output is also managed here)
2 project Definition of data processing
3 Project flow Definition of project processing flow and execution schedule

When developing

  1. Import the dataset into the library
  2. Define the process in the project
  3. Schedule processing in the project flow
  4. Check the processing result in the library

That is the general flow.

Import dataset to library

If you try importing a CSV file, it will look like this. The data was borrowed from here. image.png image.png

A feature called "Profile" will give you information about basic statistics and categories for each column. image.png

Profile results are also managed in the library. image.png

Define processing in project

Let's create a project with the imported data. image.png

If you try to change or replace the data type of a column, you will get a preview of the processing result like this. image.png image.png

You can also create a new column using a function like Excel with a tool called "Calculation". image.png

The grammar was pretty severe. image.png image.png

You can also aggregate with a tool called "aggregate". However, this is a type of aggregation that you add as a new column, such as when you Count Encode. image.png

For ordinary (?) Aggregation, use a tool called "Shape". image.png

Schedule processing in the project flow

Let's schedule the created project. In addition to the time interval, you can also specify the crontab format. image.png

It looks like this when displayed in a graph. I'm afraid there is only one project ... image.png

When executed, it looks like this. image.png

The processing result is managed in the library as an answer set. image.png

the end

This article is written with permission from our client companies and Paxata distributors.

Recommended Posts

I touched the data preparation tool Paxata
I touched the Qiita API
I tried the OSS visualization tool, superset
I saved the scraped data in CSV!
I touched HaikuFinder
I searched for railway senryu from the data
I tried to save the data with discord
I touched Flask
[Data analysis] Should I buy the Harumi flag?
I touched some of the new features of Python 3.8 ①
[Trainer's Recipe] I touched the flame of the Python framework.
I tried to predict the J-League match (data analysis)
I tried clustering ECG data using the K-Shape method
I tried using the API of the salmon data project
A memo that I touched the Datastore with python
I made a repeating text data generation tool "rpttxt"
I touched Wagtail (1) and let's override the save method.
Make the tool simply
I touched TensorFlow's Tensorboard
I counted the grains
I touched AWS Chalice
I tried to introduce the block diagram generation tool blockdiag
[I made it with Python] XML data batch output tool
I passed the Python data analysis test, so I summarized the points
What I saw by analyzing the data of the engineer market
I sent the data of Raspberry Pi to GCP (free)