Git Object Hashing and Content Addressability

Summary:
Understand how Git uses SHA-1 hashes to identify content.

Git, the distributed version control system, is renowned for its speed, efficiency, and reliability in tracking changes across projects. At the core of Git’s effectiveness lies a fundamental concept: content addressability via cryptographic hashes. This post will explain what content addressability means, how Git uses SHA-1 hashes to identify and store content, and why this design is central to Git’s power and flexibility.

What is Content Addressability?

Content addressability is a way of referencing data by its content rather than its location or name. Instead of "here’s file X," we say, "here’s the data with hash Y." This hash is typically generated by applying a cryptographic hash function (in Git's case, SHA-1) to the content itself. The outcome? If two pieces of content are identical, they receive the same identifier, and any change, no matter how minor, produces a new hash.

Advantages

Integrity: Hashes act as fingerprints, ensuring what you access is exactly what was stored.
Duplicate Elimination: Identical content doesn’t consume extra space.
Immutability: Data, once stored and addressed by its hash, cannot be altered undetected.

Git’s Use of SHA-1 Hashes

Every object stored in a Git repository is addressed by the SHA-1 hash of its content. These objects, four in all (blob, tree, commit, and tag), each serve a role:

Blob: Stores file data.
Tree: Stores directory contents and structure.
Commit: Records a snapshot pointing to a tree and parent commits.
Tag: Marks specific objects, usually commits, for reference.

How Is the Hash Computed?

Take, for example, a file you'd like to place under version control. Before storing, Git creates a blob object by:

Preparing the Content:
Prepend a "type" and size to the content, e.g.:
```
blob 14\0Hello, world!\n
```
(Note the null character after the header.)
Computing the Hash:
Apply SHA-1 to this data:
```
SHA1("blob 14\0Hello, world!\n") = a0b65939670bc350e3d7bb401cfd2e68aa5bdfa6
```
Now, this 40-character hash becomes the address of the blob.
Storing the Object:
Git compresses and stores this under .git/objects/ using the first two characters as folder, e.g.:
```
.git/objects/a0/b65939670bc350e3d7bb401cfd2e68aa5bdfa6
```

Why the SHA-1 Hash?

SHA-1 (Secure Hash Algorithm 1) produces a 160-bit (40 hex digit) identifier. While SHA-1 has known security weaknesses for cryptographic purposes, for content addressing and duplicate detection, it remains practical for most use cases in Git (though projects are underway to adopt SHA-256 in the future).

How Content Addressability Powers Git

Efficient Storage

De-duplication:
Multiple files with the same content, or repeated commits with unchanged files, point to the same blob object. Storage is conserved automatically.

Lightning-Fast Integrity Checks

Error Detection:
If a file in .git/objects/ is corrupted or changed, its hash no longer matches, and Git knows it immediately.

Rapid Comparison & Synchronization

Object Identity:
Instead of comparing entire files, Git can simply compare SHA-1 hashes: if the hashes match, the contents are identical.

Exchanging Data

Pushing, Fetching, Cloning:
During remote operations, Git only transmits objects that do not exist on the other side, as determined by hashes.

Real-World Example

Suppose you edit a file and commit it. Git creates a new blob object, computes its SHA-1, and references it in a new tree and commit object. If you revert the file to its original contents and commit again, Git recognizes the blob already exists and simply reuses it: no duplicate storage is consumed.

You can even manually inspect Git objects:

# Find the SHA-1 of a file as Git does
echo "Hello, world!" | git hash-object --stdin
# Output: a0b65939670bc350e3d7bb401cfd2e68aa5bdfa6

# Check the stored object
git cat-file -p a0b65939670bc350e3d7bb401cfd2e68aa5bdfa6
# Output: Hello, world!

Looking Forward

The content-addressable model built on SHA-1 hashing defined Git from its very first commit, enabling remarkable efficiency, integrity, and scalability. While the move to stronger hashes like SHA-256 is on the horizon for security reasons, the principles you’ve learned remain foundational to Git’s design.

Conclusion

Git’s use of SHA-1 hashing for content addressability underpins much of its reliability and power. By understanding how Git stores and references everything by hash, you gain deeper insight into why Git works so well — and how it can manage some of the world’s largest codebases with apparent ease.

Further Reading: