[LINUX] Exact behavior of diff --ignore-matching-lines = RE

Introduction

The diff command has the option --ignore-matching-lines = RE. Looking at the man, it says "ignore changes where all lines match RE" and it seems to ignore when the diff matches the regex. Also, Japanese translation says, "Ignore changes that just insert / delete lines that match REGEXP. ".

I think these are quite misleading expressions. (I don't know what I mean by reading it again after I understand it ...) This section describes the exact behavior of this option.

Concrete example

Consider the difference between the following two files.

file1.txt


Date: 2020/4/7
Time: 20:00:00

file2.txt


Date: 2020/4/6
Time: 21:00:00

If you take diff normally

$ diff file1.txt file2.txt
1,2c1,2
< Date: 2020/4/7
< Time: 20:00:00
---
> Date: 2020/4/6
> Time: 21:00:00

It will be. Here, suppose you want to ignore the difference in Date and take the difference only in Time.

$ diff -I Date file1.txt file2.txt
1,2c1,2
< Date: 2020/4/7
< Time: 20:00:00
---
> Date: 2020/4/6
> Time: 21:00:00

But the result is no different. Now try inserting a blank line between Date and Time in file1.txt and file2.txt.

file1.txt


Date: 2020/4/7

Time: 20:00:00

file2.txt


Date: 2020/4/6

Time: 21:00:00

Similarly, if you take diff

$ diff -I Date file1.txt file2.txt
3c3
< Time: 20:00:00
---
> Time: 21:00:00

I was able to retrieve only Time correctly.

This is because --ignore-matching-lines matches hunk, not lines. A hunk is a bunch of differences, in the first example

1,2c1,2
< Date: 2020/4/7
< Time: 20:00:00
---
> Date: 2020/4/6
> Time: 21:00:00

Is one hunk. If you try to take diff normally with a blank line inserted

$ diff file1.txt file2.txt
1c1
< Date: 2020/4/7
---
> Date: 2020/4/6
3c3
< Time: 20:00:00
---
> Time: 21:00:00

And two hunks. The correct behavior of --ignore-matching-lines is to apply a regular expression match to each of these hunks and ignore the first hunk.

In the documentation

The GNU documentation describes the exact behavior, including this.

However, -I only ignores the insertion or deletion of lines that contain the regular expression if every changed line in the hunk—every insertion and every deletion—matches the regular expression. In other words, for each nonignorable change, diff prints the complete set of changes in its vicinity, including the ignorable ones.

As it says "every changed line in the hunk", it will not be ignored unless all lines of the hunk match the regular expression. If you understand so far, you will understand that the meaning of "all" written in man and the word "only" in the Japanese translation refer to all hunks, but it is quite difficult to notice it at first sight. is not it….

Recommended Posts

Exact behavior of diff --ignore-matching-lines = RE
Behavior of multiprocessing.pool.Pool.map
[python] behavior of argmax
behavior of matplotlib: histogram normed
Behavior of pandas rolling () method