Commit | Line | Data |
---|---|---|
e83c5163 LT |
1 | |
2 | GIT - the stupid content tracker | |
3 | ||
4 | "git" can mean anything, depending on your mood. | |
5 | ||
6 | - random three-letter combination that is pronounceable, and not | |
7 | actually used by any common UNIX command. The fact that it is a | |
8 | mispronounciation of "get" may or may not be relevant. | |
9 | - stupid. contemptible and despicable. simple. Take your pick from the | |
10 | dictionary of slang. | |
11 | - "global information tracker": you're in a good mood, and it actually | |
12 | works for you. Angels sing, and a light suddenly fills the room. | |
13 | - "goddamn idiotic truckload of sh*t": when it breaks | |
14 | ||
15 | This is a stupid (but extremely fast) directory content manager. It | |
16 | doesn't do a whole lot, but what it _does_ do is track directory | |
17 | contents efficiently. | |
18 | ||
19 | There are two object abstractions: the "object database", and the | |
20 | "current directory cache". | |
21 | ||
22 | The Object Database (SHA1_FILE_DIRECTORY) | |
23 | ||
24 | The object database is literally just a content-addressable collection | |
25 | of objects. All objects are named by their content, which is | |
26 | approximated by the SHA1 hash of the object itself. Objects may refer | |
27 | to other objects (by referencing their SHA1 hash), and so you can build | |
28 | up a hierarchy of objects. | |
29 | ||
30 | There are several kinds of objects in the content-addressable collection | |
31 | database. They are all in deflated with zlib, and start off with a tag | |
32 | of their type, and size information about the data. The SHA1 hash is | |
33 | always the hash of the _compressed_ object, not the original one. | |
34 | ||
35 | In particular, the consistency of an object can always be tested | |
36 | independently of the contents or the type of the object: all objects can | |
37 | be validated by verifying that (a) their hashes match the content of the | |
38 | file and (b) the object successfully inflates to a stream of bytes that | |
39 | forms a sequence of <ascii tag without space> + <space> + <ascii decimal | |
40 | size> + <byte\0> + <binary object data>. | |
41 | ||
42 | BLOB: A "blob" object is nothing but a binary blob of data, and doesn't | |
43 | refer to anything else. There is no signature or any other verification | |
44 | of the data, so while the object is consistent (it _is_ indexed by its | |
45 | sha1 hash, so the data itself is certainly correct), it has absolutely | |
46 | no other attributes. No name associations, no permissions. It is | |
47 | purely a blob of data (ie normally "file contents"). | |
48 | ||
49 | TREE: The next hierarchical object type is the "tree" object. A tree | |
50 | object is a list of permission/name/blob data, sorted by name. In other | |
51 | words the tree object is uniquely determined by the set contents, and so | |
52 | two separate but identical trees will always share the exact same | |
53 | object. | |
54 | ||
55 | Again, a "tree" object is just a pure data abstraction: it has no | |
56 | history, no signatures, no verification of validity, except that the | |
57 | contents are again protected by the hash itself. So you can trust the | |
58 | contents of a tree, the same way you can trust the contents of a blob, | |
59 | but you don't know where those contents _came_ from. | |
60 | ||
61 | Side note on trees: since a "tree" object is a sorted list of | |
62 | "filename+content", you can create a diff between two trees without | |
63 | actually having to unpack two trees. Just ignore all common parts, and | |
64 | your diff will look right. In other words, you can effectively (and | |
65 | efficiently) tell the difference between any two random trees by O(n) | |
66 | where "n" is the size of the difference, rather than the size of the | |
67 | tree. | |
68 | ||
69 | Side note 2 on trees: since the name of a "blob" depends entirely and | |
70 | exclusively on its contents (ie there are no names or permissions | |
71 | involved), you can see trivial renames or permission changes by noticing | |
72 | that the blob stayed the same. However, renames with data changes need | |
73 | a smarter "diff" implementation. | |
74 | ||
75 | CHANGESET: The "changeset" object is an object that introduces the | |
76 | notion of history into the picture. In contrast to the other objects, | |
77 | it doesn't just describe the physical state of a tree, it describes how | |
78 | we got there, and why. | |
79 | ||
80 | A "changeset" is defined by the tree-object that it results in, the | |
81 | parent changesets (zero, one or more) that led up to that point, and a | |
82 | comment on what happened. Again, a changeset is not trusted per se: | |
83 | the contents are well-defined and "safe" due to the cryptographically | |
84 | strong signatures at all levels, but there is no reason to believe that | |
85 | the tree is "good" or that the merge information makes sense. The | |
86 | parents do not have to actually have any relationship with the result, | |
87 | for example. | |
88 | ||
89 | Note on changesets: unlike real SCM's, changesets do not contain rename | |
90 | information or file mode chane information. All of that is implicit in | |
91 | the trees involved (the result tree, and the result trees of the | |
92 | parents), and describing that makes no sense in this idiotic file | |
93 | manager. | |
94 | ||
95 | TRUST: The notion of "trust" is really outside the scope of "git", but | |
96 | it's worth noting a few things. First off, since everything is hashed | |
97 | with SHA1, you _can_ trust that an object is intact and has not been | |
98 | messed with by external sources. So the name of an object uniquely | |
99 | identifies a known state - just not a state that you may want to trust. | |
100 | ||
101 | Furthermore, since the SHA1 signature of a changeset refers to the | |
102 | SHA1 signatures of the tree it is associated with and the signatures | |
103 | of the parent, a single named changeset specifies uniquely a whole | |
104 | set of history, with full contents. You can't later fake any step of | |
105 | the way once you have the name of a changeset. | |
106 | ||
107 | So to introduce some real trust in the system, the only thing you need | |
108 | to do is to digitally sign just _one_ special note, which includes the | |
109 | name of a top-level changeset. Your digital signature shows others that | |
110 | you trust that changeset, and the immutability of the history of | |
111 | changesets tells others that they can trust the whole history. | |
112 | ||
113 | In other words, you can easily validate a whole archive by just sending | |
114 | out a single email that tells the people the name (SHA1 hash) of the top | |
115 | changeset, and digitally sign that email using something like GPG/PGP. | |
116 | ||
117 | In particular, you can also have a separate archive of "trust points" or | |
118 | tags, which document your (and other peoples) trust. You may, of | |
119 | course, archive these "certificates of trust" using "git" itself, but | |
120 | it's not something "git" does for you. | |
121 | ||
122 | Another way of saying the same thing: "git" itself only handles content | |
123 | integrity, the trust has to come from outside. | |
124 | ||
125 | Current Directory Cache (".dircache/index") | |
126 | ||
127 | The "current directory cache" is a simple binary file, which contains an | |
128 | efficient representation of a virtual directory content at some random | |
129 | time. It does so by a simple array that associates a set of names, | |
130 | dates, permissions and content (aka "blob") objects together. The cache | |
131 | is always kept ordered by name, and names are unique at any point in | |
132 | time, but the cache has no long-term meaning, and can be partially | |
133 | updated at any time. | |
134 | ||
135 | In particular, the "current directory cache" certainly does not need to | |
136 | be consistent with the current directory contents, but it has two very | |
137 | important attributes: | |
138 | ||
139 | (a) it can re-generate the full state it caches (not just the directory | |
140 | structure: through the "blob" object it can regenerate the data too) | |
141 | ||
142 | As a special case, there is a clear and unambiguous one-way mapping | |
143 | from a current directory cache to a "tree object", which can be | |
144 | efficiently created from just the current directory cache without | |
145 | actually looking at any other data. So a directory cache at any | |
146 | one time uniquely specifies one and only one "tree" object (but | |
147 | has additional data to make it easy to match up that tree object | |
148 | with what has happened in the directory) | |
149 | ||
150 | ||
151 | and | |
152 | ||
153 | (b) it has efficient methods for finding inconsistencies between that | |
154 | cached state ("tree object waiting to be instantiated") and the | |
155 | current state. | |
156 | ||
157 | Those are the two ONLY things that the directory cache does. It's a | |
158 | cache, and the normal operation is to re-generate it completely from a | |
159 | known tree object, or update/compare it with a live tree that is being | |
160 | developed. If you blow the directory cache away entirely, you haven't | |
161 | lost any information as long as you have the name of the tree that it | |
162 | described. | |
163 | ||
164 | (But directory caches can also have real information in them: in | |
165 | particular, they can have the representation of an intermediate tree | |
166 | that has not yet been instantiated. So they do have meaning and usage | |
167 | outside of caching - in one sense you can think of the current directory | |
168 | cache as being the "work in progress" towards a tree commit). |