Git as a NoSql database

Git’s man-pages state that it’s a stupid content tracker . It’s probably the most used version control system in the world. Which is very strange, since it doesn’t describe itself as being a source control system. And in fact, you can use git to track any type of content. You can create a Git NoSQL database for example.

The reason why it says stupid in the man-pages is that it makes no assumptions about what content you store in it. The underlying git model is rather basic. In this post I want to explore the possibilities of using git as a NoSQL database (a key-value store). You could use the file system as a data store and then use git add and git commit to save your files:

# saving a document
echo '{"id": 1, "name": "kenneth"}' > 1.json
git add 1.json
git commit -m "added a file"
# reading a document
git show master:1.json
=> {"id": 1, "name": "kenneth"}

That works, but you’re now using the file system as a database: paths are the keys, values are whatever you store in them. There are a few disadvantages:

We need to write all our data to disk before we can save them into git We’re saving data multiple times File storage is not deduplicated and we lose the benefit git provides us for automatic data deduplication If we want to work on multiple branches at the same time, we need multiple checked out directories

What we want rather is a bare repository, one where none of the files exist in the file system, but only in the git database. Let’s have a look at git’s data model and the plumbing commands to make this work.

Git as a NoSQL database

Git is a content-addressable file system . This means that it’s a simple key-value store. Whenever you insert content into it, it will give you back a key to retrieve that content later.

Let’s create some content:

#Initialize a repository
mkdir MyRepo
cd MyRepo
git init
# Save some content
echo {"id": 1, "name": "kenneth"} | git hash-object -w --stdin
da95f8264a0ffe3df10e94eed6371ea83aee9a4d

Hash-object is a git plumbing command which takes content, stores is it in the database and returns the key

The w switch tells it to store the content, otherwise it would just calculate the hash. the -stdin switch tells git to read the content from the input, instead of from a file.

The key it returns is a sha-1 based on the content. If you run the above commands on your machine, you’ll see it returns the exact same sha-1. Now that we have some content in the database, we can read it back:

git cat-file -p da95f8264a0ffe3df10e94eed6371ea83aee9a4d
{"id": 1, "name": "kenneth"} Git Blobs

We now have a key-value store with one object, a blob:

There’s only one problem: we can’t update this, because if we update the content, the key will change. That would mean that for every version of our file, we’d have to remember a different key. What we want instead, is to specify our own key which we can use to track the versions.

Git Trees

Trees solve two problems:

the need to remember the hashes of our objects and its version the possibility to storing groups of files.

The best way to think about a tree is like a folder in the file system. To create a tree you have to follow two steps:

# Create and populate a staging area
git update-index --add --cacheinfo 100644 da95f8264a0ffe3df10e94eed6371ea83aee9a4d 1.json
# write the tree
git write-tree
d6916d3e27baa9ef2742c2ba09696f22e41011a1

This also gives you back a sha. Now we can read back that tree:

git cat-file -p d6916d3e27baa9ef2742c2ba09696f22e41011a1
100644 blob da95f8264a0ffe3df10e94eed6371ea83aee9a4d 1.json

At this point our object database looks as follows:

To modify the file, we follow the same steps:

# Add a blob
echo {"id": 1, "name": "kenneth truyers"} | git hash-object -w --stdin
42d0d209ecf70a96666f5a4c8ed97f3fd2b75dda
# Create and populate a staging area
git update-index --add --cacheinfo 100644 42d0d209ecf70a96666f5a4c8ed97f3fd2b75dda 1.json
# Write the tree
git write-tree
2c59068b29c38db26eda42def74b7142de392212

That leaves us with the following situation:

We now have two trees that represent the different states of our files. That doesn’t help much, since we still need to remember the sha-1 values of the trees to get to our content.

Git Commits

One level up, we get to commits. A commit holds 5 pieces of key information:

Author of the commit Date it was created Why it was created (message) A single tree object it points to One or more previous commits (for now we’ll only consider commits with only a single parent, commits with multiple parents are merge commits ).

Let’s commit the above trees:

# Commit the first tree (without a parent)
echo "commit 1st version" | git commit-tree d6916d3
05c1cec5685bbb84e806886dba0de5e2f120ab2a
# Commit the second tree with the first commit as a parent
echo "Commit 2nd version" | git commit-tree 2c59068 -p 05c1cec5
9918e46dfc4241f0782265285970a7c16bf499e4

This leaves us with the following state:

Now we have built up a complete history of our file. You could open the repository with any git client and you’ll see how 1.json is being tracked correctly. To demonstrate that, this is the output of running git log :

git log --stat 9918e46
9918e46dfc4241f0782265285970a7c16bf499e4 "Commit 2nd version"
1.json | 1 +
1 file changed, 1 insertions(+)
05c1cec5685bbb84e806886dba0de5e2f120ab2a "Commit 1st version"
1.json | 1 +
1 file changed, 1 insertion(+)

And to get the content of the file at the last commit:

git show 9918e46:1.json
{"id": 1, "name": "kenneth truyers"}

We’re still not there though, because we have to remember the hash of the last commit. Up until now, all objects we have created are part of git’s object database. One characteristic of that database is that it stores only immutable objects. Once you write a blob, a tree or a commit, you can never modify it without changing the key. You can also not delete them (at least not directly, the git gc command does delete objects that are dangling ).

Git References

Yet another level up, are Git references. References are not a part of the object database, they are part of the reference database and are mutable . There are different types of references such as branches, tags and remotes. They are similar in nature with a few minor differences. For the moment, let’s just consider branches. A branch is a pointer to a commit. To create a branch we can write the hash of the commit to the file system:

echo 05c1cec5685bbb84e806886dba0de5e2f120ab2a > .git/refs/heads/master

We now have a branch master , pointing at our first commit. To move the branch, we issue the following command:

git update-ref refs/heads/master 9918e46

This leaves us with the following graph:

Git as a NoSql database

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎