[PYTHON] Getting Started with Git (1) History Storage

The goal of this story

Introduction

Git is one of the version control systems. What is a version control system?

It is a system with functions such as. Here, "change history" is

Refers to a series of information recorded in chronological order.

In addition to Git, there are other version control systems such as CVS and Subversion, but Git has the following features.

Content storage

First, let's take a look at how git works as a content hangar. First, create an empty git repository

% mkdir repo1
% cd repo1
% git init
Initialized empty Git repository in repo1/.git/

Create a file called readme.txt, store its contents in the repository, overwrite the file with other contents, and store it in the repository.

% echo aaa > readme.txt
% git hash-object -w readme.txt
72943a16fb2c8f38f9dde202b7a70ccc19c52f34
% echo bbb > readme.txt
% git hash-object -w readme.txt
f761ec192d9f0dca3329044b96ebdb12839dbff6
% rm -f readme.txt

The contents of the stored file can be retrieved using the character string displayed when the hash-object command is executed as a key.

% git cat-file -p 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 > readme.txt
% cat readme.txt
aaa
% git cat-file -p f761ec192d9f0dca3329044b96ebdb12839dbff6 > readme.txt
% cat readme.txt
bbb

What should be noted

That is. When the unfinished part becomes possible, it becomes possible to keep the history of changes, but before looking at it, "something I don't understand" is the SHA1 hash value of the file contents (+ α). Let's make sure there is.

hash-object.py


import hashlib
import sys

if len(sys.argv) != 2:
    print("usage: %s file" % sys.argv[0])
    sys.exit(-1)

try:
    f = open(sys.argv[1])
except Exception:
    print("open %s failed" % sys.argv[1])
    sys.exit(-1)

data = f.read()
sha1 = hashlib.sha1("blob %d" % len(data) + "\0" + data).hexdigest()
print(sha1)

This python program

Calculates the SHA1 hash value of the concatenation of and outputs its hexadecimal representation. When I actually use it for the above two contents,

% echo aaa | python hash-object.py /dev/stdin
72943a16fb2c8f38f9dde202b7a70ccc19c52f34
% echo bbb | python hash-object.py /dev/stdin
f761ec192d9f0dca3329044b96ebdb12839dbff6

And you can see that the calculation result matches the key used to store the content earlier (the content is stored using the SHA1 hash value of the file content + α as the key). Now, where is the content stored in the repository?

% find .git/objects -type f
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
.git/objects/f7/61ec192d9f0dca3329044b96ebdb12839dbff6

The SHA1 hash value is the concatenation of the directory name and the file name. If you take the SHA1 value of these files themselves

% sha1sum `find .git/objects -type f`
cf6e4f80cfae36e20ae7eb1a90919ca48f59514b  .git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
cdb05607e2e073287a81a908564d9d901ccdd687  .git/objects/f7/61ec192d9f0dca3329044b96ebdb12839dbff6

And the value is different. This is because the contents are compressed and stored, for example.

decompress_sha1.py


import hashlib
import sys
import zlib

if len(sys.argv) != 2:
    print("usage: %s git_object_file" % sys.argv[0])
    sys.exit(-1)

path = sys.argv[1]
try:
    f = open(path)
except Exception:
    print("open %s failed" % path)
    sys.exit(-1)

data = zlib.decompress(f.read())
sha1 = hashlib.sha1(data).hexdigest()
print("%s: %s" % (path, sha1))

If you calculate the hash value after decompressing using the program

% for i in `find .git/objects -type f`; do python ../decompress_sha1.py $i; done
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34: 72943a16fb2c8f38f9dde202b7a70ccc19c52f34
.git/objects/f7/61ec192d9f0dca3329044b96ebdb12839dbff6: f761ec192d9f0dca3329044b96ebdb12839dbff6

You can see that they match properly (the hash values match, so the contents are expected to match).

Directory tree storage

I saw how to store the contents of a file under .git / objects /. Git also stores filename and commit log message information in a file under .git / objects / called a Git object.

At the time of creating the repository, no objects are stored.

% mkdir repo2
% cd repo2
% git init
Initialized empty Git repository in repo2/.git/
% ls .git
HEAD         config       hooks/       objects/
branches/    description  info/        refs/
% find .git/objects -type f

Let's add a file to the staging area with git add.

% echo aaa > readme.txt
% git add readme.txt
% find .git/objects -type f
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
% ls .git
HEAD         config       hooks/       info/        refs/
branches/    description  index        objects/

One object has been added and a file called index has been created. You can check the contents of the Git object with git cat-file.

% git cat-file -t 729
fatal: Not a valid object name 729
% git cat-file -t 7294
blob
% git cat-file -s 7294
4
% wc -c readme.txt
4 readme.txt
% git cat-file -p 7294
aaa
% cat readme.txt
aaa

As a way to use cat-file

Next, let's write out the information contained in the index as an object.

% git write-tree
580c73c39691399d09ad01152ad0a691ce80bccf
% find .git/objects -type f
.git/objects/58/0c73c39691399d09ad01152ad0a691ce80bccf
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
% git cat-file -t 580c
tree
% git cat-file -p 580c
100644 blob 72943a16fb2c8f38f9dde202b7a70ccc19c52f34    readme.txt

At this time,

I understand this.

tree-1.png

Next, create a directory and a file under it and try git add.

% mkdir tmp
% echo bbb > tmp/bbb.txt
% git add tmp/bbb.txt
% find .git/objects -type f
.git/objects/58/0c73c39691399d09ad01152ad0a691ce80bccf
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
.git/objects/f7/61ec192d9f0dca3329044b96ebdb12839dbff6
% git cat-file -t f761
blob
% git cat-file -p f761
bbb

The newly added object is a blob object that contains the contents of bbb.txt. If you write index as an object again in this state,

% git write-tree
6434b2415497a42647800c7e828038a2fb6fbbaf
% find .git/objects -type f
.git/objects/58/0c73c39691399d09ad01152ad0a691ce80bccf
.git/objects/5c/40d98927de9cdb27df5b3a7bd4f7ee95dbfc85
.git/objects/64/34b2415497a42647800c7e828038a2fb6fbbaf
.git/objects/72/943a16fb2c8f38f9dde202b7a70ccc19c52f34
.git/objects/f7/61ec192d9f0dca3329044b96ebdb12839dbff6
% git cat-file -t 6434
tree
% git cat-file -p 6434
100644 blob 72943a16fb2c8f38f9dde202b7a70ccc19c52f34    readme.txt
040000 tree 5c40d98927de9cdb27df5b3a7bd4f7ee95dbfc85    tmp
% git cat-file -t 5c40
tree
% git cat-file -p 5c40
100644 blob f761ec192d9f0dca3329044b96ebdb12839dbff6    bbb.txt

here,

You can see that.

tree-2.png

I wrote a parser for tree because it was a big deal.

parse_tree.py


import hashlib
import sys
import zlib

if len(sys.argv) != 2:
    print("usage: %s git_object_file" % sys.argv[0])
    sys.exit(-1)

try:
    f = open(sys.argv[1])
except Exception:
    print("open %s failed" % sys.argv[1])
    sys.exit(-1)

data = zlib.decompress(f.read())
sha1 = hashlib.sha1(data).hexdigest()

eoh = data.find("\0")
if eoh < 0:
    print("no end of header")
    sys.exit(-1)

header = data[:eoh]
t, n = header.split(" ")

if len(data) - eoh - 1 != int(n):
    print("size mismatch %d,%d" % (len(data) - eoh - 1, int(n)))
    sys.exit(-1)
if t != "tree":
    print("not tree: %s" % t)
    sys.exit(-1)

dsize = hashlib.sha1().digest_size
ptr = eoh + 1
while ptr < len(data):
    eorh = data.find("\0", ptr)
    if eorh < 0:
        print("no end of reference header")
        sys.exit(-1)
    mode, name = data[ptr:eorh].split(" ")
    sha1_ = "".join(map(lambda x: "%02x" % ord(x), data[eorh+1:eorh+1+dsize]))
    print("%s (%6s) %s" % (sha1_, mode, name))
    ptr = eorh + 1 + dsize
% python parse_tree.py .git/objects/64/34b2415497a42647800c7e828038a2fb6fbbaf
72943a16fb2c8f38f9dde202b7a70ccc19c52f34 (100644) readme.txt
5c40d98927de9cdb27df5b3a7bd4f7ee95dbfc85 ( 40000) tmp

The data structure of the tree object is in zlib-compressed data (similar to blobs).

The content part is

It is a repetition of.

Commit history storage

Now that we've seen two types of objects, blob and tree, let's look at the commit object at the end. Try creating a commit that references the tree object 580c.

% git commit-tree -m "initial commit" 580c
7a5c786478f17fd96b385c725c95d10fa74e4576
% ls .git/objects/7a/5c786478f17fd96b385c725c95d10fa74e4576
.git/objects/7a/5c786478f17fd96b385c725c95d10fa74e4576
% git cat-file -t 7a5c
commit
% git cat-file -p 7a5c
tree 580c73c39691399d09ad01152ad0a691ce80bccf
author Yoichi Nakayama <[email protected]> 1447772602 +0900
committer Yoichi Nakayama <[email protected]> 1447772602 +0900

initial commit

Next, let's create a commit that references the tree object 6434, with that commit object 7a5c as the parent.

% git commit-tree -p 7a5c -m "second commit" 6434
88470d975c1875e2e03a46877c13dde9ed2fd1ea
% ls .git/objects/88/470d975c1875e2e03a46877c13dde9ed2fd1ea
.git/objects/88/470d975c1875e2e03a46877c13dde9ed2fd1ea
% git cat-file -t 8847
commit
% git cat-file -p 8847
tree 6434b2415497a42647800c7e828038a2fb6fbbaf
parent 7a5c786478f17fd96b385c725c95d10fa74e4576
author Yoichi Nakayama <[email protected]> 1447772754 +0900
committer Yoichi Nakayama <[email protected]> 1447772754 +0900

second commit

commits.png

If you enter the hash value of this commit object in master referenced by HEAD, You can see the history with git log.

% cat .git/HEAD
ref: refs/heads/master
% echo 88470d975c1875e2e03a46877c13dde9ed2fd1ea > .git/refs/heads/master
% git log
commit 88470d975c1875e2e03a46877c13dde9ed2fd1ea
Author: Yoichi Nakayama <[email protected]>
Date:   Wed Nov 18 00:05:54 2015 +0900

    second commit

commit 7a5c786478f17fd96b385c725c95d10fa74e4576
Author: Yoichi Nakayama <[email protected]>
Date:   Wed Nov 18 00:03:22 2015 +0900

    initial commit

You can now see the history you normally see after git commit. You can also give git diff the hash value of the target commit object to see the diff.

% git diff 7a5c 8847
diff --git a/tmp/bbb.txt b/tmp/bbb.txt
new file mode 100644
index 0000000..f761ec1
--- /dev/null
+++ b/tmp/bbb.txt
@@ -0,0 +1 @@
+bbb

The data structure of the commit object is the same as the blob object except that it starts with "commit" and as content

Includes.

Summary

reference

Continued

Recommended Posts

Getting Started with Git (1) History Storage
Getting started with Android!
1.1 Getting Started with Python
Getting started with apache2
Getting Started with Golang 1
Getting Started with Python
Getting Started with Django 1
Getting Started with Optimization
Getting Started with Golang 3
Getting Started with Numpy
Getting started with Spark
Getting Started with Python
Getting Started with Pydantic
Getting Started with Golang 4
Getting Started with Jython
Getting Started with Django 2
Translate Getting Started With TensorFlow
Getting Started with Tkinter 2: Buttons
Getting Started with Go Assembly
Getting Started with PKI with Golang ―― 4
Getting Started with Python Django (1)
Getting Started with Python Django (4)
Getting Started with Python Django (3)
Getting Started with Python Django (6)
Getting Started with Django with PyCharm
Python3 | Getting Started with numpy
Getting Started with Python Django (5)
Getting Started with Python responder v2
Getting started with Sphinx. Generate docstring with Sphinx
Getting Started with Python Web Applications
Getting Started with Python for PHPer-Classes
Getting Started with Sparse Matrix with scipy.sparse
Getting Started with Julia for Pythonista
Getting Started with Python Basics of Python
Getting Started with Cisco Spark REST-API
Getting started with USD on Windows
Getting Started with Python Genetic Algorithms
Getting started with Python 3.8 on Windows
Getting Started with Python for PHPer-Functions
Getting Started with CPU Steal Time
Getting Started with python3 # 1 Learn Basic Knowledge
Getting Started with Flask with Azure Web Apps
Getting Started with Python Web Scraping Practice
Getting Started with Python for PHPer-Super Basics
Getting Started with Python Web Scraping Practice
Getting started with Dynamo from Python boto
Getting Started with Lisp for Pythonista: Supplement
Getting Started with Heroku, Deploying Flask App
Getting Started with TDD with Cyber-dojo at MobPro
git with subprocess.Popen
Grails getting started
Getting started with Python with 100 knocks on language processing
Getting Started with Drawing with matplotlib: Writing Simple Functions
Getting started with Keras Sequential model Japanese translation
[Translation] Getting Started with Rust for Python Programmers
Django Getting Started Part 2 with eclipse Plugin (PyDev)
Getting started with AWS IoT easily in Python
Getting Started with Python's ast Module (Using NodeVisitor)
Materials to read when getting started with Python
Settings for getting started with MongoDB in python
Django 1.11 started with Python3.6