[PYTHON] Use indicator with pd.merge

Overview

There is an argument ʻindicator` when merging with pandas, but although it is very convenient, I feel that there are few people using it, so this is an article for the purpose of spreading.

The Documentation states as follows.

indicator : bool or str, default False If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

Generally, if you join left, all records that exist only in the data on the left will be lost at the time of merge. However, when analyzing data, it is often the case that you want to see how the records are linked and which records are not linked. If you merge normally, it will be quite troublesome to check this, or later you will notice that it is not tied at all and it will be a disaster, but if you use ʻindicator` introduced in this article, you can easily check this. I will.

There is no particular problem if the target to be analyzed is only in-house data and all the data is completely linked with the key, but it is useful when pulling unfamiliar data from the outside and linking it with the company's own data. To do.

Let's take a concrete example.

Concrete example

Suppose you want to attach the weight data b picked up from the outside to the height data a you have.

a = pd.DataFrame({
    "ID" : [1, 2, 3, 4, 6],
    "name": ["Tom", "Bob", "Alex", "Mike", "John"],
    "heigt": [169, 173, 163, 170, 182]
})

b = pd.DataFrame({
    "ID" : [1, 2, 3, 4, 5],
    "weight": [65, 70, 72, 58, 61]
})
a
スクリーンショット 2019-12-22 11.58.39.png
b
スクリーンショット 2019-12-22 11.58.45.png

If you want to analyze without thinking

c = a.merge(b, on="ID", how="left")
# or
c = a.merge(b, on="ID", how="inner")

I think you should do it according to your purpose.

Here, b is the data brought from the outside, so we want to know the following.

――How much are a and b linked? --I want to see the details of the record where a (b) does not stick

You can easily check this with the ʻindicator option. It's easy to use, just add ʻindicator = True as shown below.

(Duplicate check of join key is not mentioned here)

c = a.merge(b, on="ID", how="outer", indicator=True)
c
スクリーンショット 2019-12-22 12.14.25.png

Then the _merge column will be added along with the merged dataset as above. The _merge column contains one of the values both, left_only, or right_only, which tells you where the record came from. After that, use this column

c_left_only = c[c["_merge"]=="left_only"]
c_left_only
スクリーンショット 2019-12-22 12.21.01.png Then you can get a list of people who are tall but not weight.

It is convenient to set how = "outer" at the time of merging and later extract the data with the value of "_merge" according to the purpose.

Also, ʻindicator` supports not only bool but also str type,

d = a.merge(b, on="ID", how="outer", indicator="flg_weight")
d
スクリーンショット 2019-12-22 12.22.58.png The `_merge` part can be the specified string.

It's a convenient function, so please use it! Let's have a good data analysis life ~~

Recommended Posts

Use indicator with pd.merge
Use RTX 3090 with PyTorch
Use ansible with cygwin
Use pipdeptree with virtualenv
[Python] Use JSON with Python
Use Mock with pytest
Use Gentelella with django
Use mecab with Python3
Use tensorboard with Chainer
Use DynamoDB with Python
Use pip with MSYS2
Use Python 3.8 with Anaconda
Use pyright with Spacemacs
Use python with docker
Use TypeScript with django-compressor
Use LESS with Django
Use MySQL with Django
Use Enums with SQLAlchemy
Use tensorboard with NNabla
Use GPS with Edison
Use nim with Jupyter
Use Trello API with python
Use shared memory with shared libraries
Use "$ in" operator with mongo-go-driver
Use directional graphs with networkx
Use TensorFlow with Intellij IDEA
Use Twitter API with Python
Use pip with Jupyter Notebook
Use DATE_FORMAT with SQLAlchemy filter
Use TUN / TAP with Python
Use sqlite3 with NAO (Pepper)
Use sqlite load_extensions with Pyramid
Use Windows 10 fonts with WSL
Use chainer with Jetson TK1
Use SSL with Celery + Redis
Use Cython with Jupyter Notebook
Use Maxout + CNN with Pylearn2
Use WDC-433SU2M2 with Manjaro Linux
Use OpenBLAS with numpy, scipy
Use subsonic API with python3
Use Sonicwall NetExtener with Systemd
Use prefetch_related conveniently with Django
Use AWS interpreter with Pycharm
Use Bokeh with IPython Notebook
Use Python-like range with Rust
Use MLflow with Databricks ④ --Call model -
Python: How to use async with
Use PointGrey camera with Python (PyCapture2)
Use vl53l0x with Raspberry Pi (python)
Use PX-S1UD / PX-Q1UD with Jetson nano
Use the preview feature with aws-cli
How to use virtualenv with PowerShell
[Python] Use Basic/Digest authentication with Flask
Use NAIF SPICE TOOLKIT with Python
Use rospy with virtualenv in Python3
Use markdown with jupyter notebook (with shortcut)
Use Python in pyenv with NeoVim
Use Tensorflow 2.1.0 with Anaconda on Windows 10!
How to use FTP with Python
Use Windows 10 speech synthesis with Python
Use curl / jq library with Go