[PYTHON] I tried to find out the outline about Big Gorilla

What i did

When I was interested in data preprocessing and was looking for materials, [Recruit Artificial Intelligence Laboratory started offering "Big Gorilla", an open source ecosystem for data integration and preparation | Recruit Holdings --Recruit Holdings](http: // www) I found a press release called .recruit.jp/news_data/release/2017/0630_17541.html).

At first glance, I wasn't sure what it was, so I looked up the outline.

What I found

What is Big Gorilla

BigGorilla - Data Integration & Preparation in Python

--Python environment with recommended libraries for data preprocessing --With some proprietary libraries

From the naming and the figure on the official website, it seemed like a huge framework, So to speak, it is an assortment of libraries. (It seems that it will not inherit the class peculiar to BigGorilla)

To actually do the pre-processing, you need to program with python normally.

Recommendation for building a portable Python environment with conda --Qiita

Installation method

$Add anaconda
$ conda env create biggorilla/py3gorilla
#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/Py3Gorilla/bin/activate Py3Gorilla

Reference: Let's get started | BigGorilla

~~ Addendum: When I tried it as of July 12, 2017, the following error did not appear with this method. (It seems that the older yml is applied, probably because the file name updated in June is strange. Probably it will be fixed by the update from now on) ~~

2017/07/21 postscript: The file has been updated. This should work as documented.

Work record of forced installation as of 7/12

$ conda env create biggorilla/py3gorilla
Collecting urllib==1.21.1
Downloading urllib-1.21.1.tar.gz (226kB)
100% |████████████████████████████████| 235kB 640kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/setup.py", line 191
s.connect((base64.b64decode(rip), 017620))
                                  ^
SyntaxError: invalid token
 ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/
CondaValueError: Value error: pip returned an error.

You can install it by downloading yml from Files :: Anaconda Cloud and removing the line that specifies urllib.

###Erase the environment once
$ conda env remove -n Py3Gorilla

#Recreate the environment by specifying the locally modified yml file
$ conda env create --name test --file ~/Downloads/Py3Gorilla.yml

#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/test/bin/activate test

#Drop the notebook for operation check and start it
$ anaconda download biggorilla/hi_gorilla
$ jupyter notebook hi_gorilla.ipynb

What can you do? See what libraries are in it

There is a list of packages to be installed in Files :: Anaconda Cloud, so take a look. Although it is introduced in the component list on the official website, it turns out that only a small part is installed. Compared to the explanation on the site, the composition is surprisingly minimal. If it is not included, pip install it yourself.

Data collection

--urllib http access standard library --https access library richer than requests urllib --scrapy scraping -(Tweepy is not included)

Data extraction

--beautifulsoup4 Web page loading and analysis --lxml xml parser --nltk Natural language processing (morphological analysis, etc.)

Schema Matching & Marsing

--FlexMatcher (manufactured by Recruit Works Institute)

Data matching & merging

--Magellan (developed by the University of Wisconsin) --Provided as part of a library called py-entitymatching, py-stringmatching

Data conversion

--xlrd Excel operation --Standard json, csv

Schema mapping

――Isn't it included just because it has commercial tools?

Workflow management

――Is this not included?

Other

Perhaps it is a dependency, scikit-learn and jupyter-notebook are included.

About the library that is originally implemented

According to Press Release, the following three libraries are implemented independently.


Currently, RIT is available in packages called KOKO and FlexMatcher.)And d)Is being developed, and Professor Doan's team has a package called Magellan.)Is developing.

FlexMatcher --Schema matching library made by Recruit Works Institute ――Even if the data item name is different between the two data, it will automatically find the correspondence. --Are you estimating the similarity using the contents of the data as teacher data? -(Personally interested)

Magellan --Data matching library developed by the University of Wisconsin ――Can you combine data with notational fluctuations into one or do something like address identification?

KOKO --Only the press release has a name. ~~ Unpublished? ~~ --The repository was open to the public

What to do next

--Conda env and try to actually build the environment --Try using FlexMatcher and Magellan

Recommended Posts

I tried to find out the outline about Big Gorilla
I tried to verify the Big Bang theorem [Is it about to come back?]
I tried to find out how to streamline the work flow with Excel x Python ②
I tried to find the entropy of the image with python
I tried to find the average of the sequence with TensorFlow
I tried to find out how to streamline the work flow with Excel x Python ④
I tried to find out how to streamline the work flow with Excel x Python ⑤
I tried to find out how to streamline the work flow with Excel x Python ①
I used Python to find out about the role choices of the 51 "Yachts" in the world.
I tried to organize about MCMC.
python beginners tried to find out
I tried to find out how to streamline the work flow with Excel x Python ③
I tried to move the ball
I tried to estimate the interval.
Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~
I tried to find out if ReDoS is possible with Python
I tried to cut out a still image from the video
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to find out how to streamline the work flow with Excel × Python, my article summary ★
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
I tried to verify the best way to find a good marriage partner
I tried to summarize the logical way of thinking about object orientation.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to find out what I can do because slicing is convenient
I tried web scraping to analyze the lyrics.
I tried to optimize while drying the laundry
I tried to save the data with discord
I tried to find 100 million digits of pi
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to find out as much as possible about the GIL that you should know if you are doing parallel processing with Python
I used gawk to find out the maximum value that goes into NF.
A programming beginner tried to find out the execution time of sorting etc.
Find out about SVM
I tried to debug.
I tried to paste
I tried to learn the sin function with chainer
I tried to make a "fucking big literary converter"
I tried to graph the packages installed in Python
I tried to detect the iris from the camera image
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to approximate the sin function using chainer
I tried to put pytest into the actual battle
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)