1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
|
# Tar.jl
[](https://travis-ci.org/JuliaIO/Tar.jl)
[](https://codecov.io/gh/JuliaIO/Tar.jl)
The `Tar` package can list, extract and create POSIX TAR archives ("tarballs")
as specified in [POSIX
1003.1-2001](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html).
It is designed to support using the TAR format as a mechanism for sending trees
of files from one system to another, rather than for the historical use case of
backing up files for restoration to the same system. Because of this design
goal, `Tar` intentionally ignores much of the metadata included in the TAR
format, which does not make sense for the data transfer use case. The package
also does not aim to read or create legacy non-POSIX variants of the TAR format,
although it does support reading GNU long name and long link extensions.
## API & Usage
The public API of `Tar` includes five functions and one type:
* `create` — creates a tarball from an on-disk file tree
* `extract` — extracts a tarball to an on-disk file tree
* `list` — lists the contents of a tarball as a vector of `Header` objects
* `rewrite` — rewrite a tarball to the standard format `create` produces
* `tree_hash` — compute a tree hash of the content of a tarball (default: git
SHA1)
* `Header` — struct representing metadata that `Tar` considers important in a
TAR entry
None of these are exported, however: the recommended usage is to do `import Tar`
and then access all of these names fully qualified as `Tar.create`,
`Tar.extract` and so on.
<!-- BEGIN: copied from inline doc strings -->
### Tar.create
```jl
create([ predicate, ] dir, [ tarball ]; [ skeleton ]) -> tarball
```
* `predicate :: String --> Bool`
* `dir :: AbstractString`
* `tarball :: Union{AbstractString, AbstractCmd, IO}`
* `skeleton :: Union{AbstractString, AbstractCmd, IO}`
Create a tar archive ("tarball") of the directory `dir`. The resulting archive
is written to the path `tarball` or if no path is specified, a temporary path is
created and returned by the function call. If `tarball` is an IO object then the
tarball content is written to that handle instead (the handle is left open).
If a `predicate` function is passed, it is called on each system path that is
encountered while recursively searching `dir` and `path` is only included in the
tarball if `predicate(path)` is true. If `predicate(path)` returns false for a
directory, then the directory is excluded entirely: nothing under that directory
will be included in the archive.
If the `skeleton` keyword is passed then the file or IO handle given is used as
a "skeleton" to generate the tarball. You create a skeleton file by passing the
`skeleton` keyword to the `extract` command. If `create` is called with that
skeleton file and the extracted files haven't changed, an identical tarball is
recreated. The `skeleton` and `predicate` arguments cannot be used together.
### Tar.extract
```jl
extract([ predicate, ] tarball, [ dir ];
[ skeleton, ] [ copy_symlinks ]) -> dir
```
* `predicate :: Header --> Bool`
* `tarball :: Union{AbstractString, AbstractCmd, IO}`
* `dir :: AbstractString`
* `skeleton :: Union{AbstractString, AbstractCmd, IO}`
* `copy_symlinks :: Bool`
Extract a tar archive ("tarball") located at the path `tarball` into the
directory `dir`. If `tarball` is an IO object instead of a path, then the
archive contents will be read from that IO stream. The archive is extracted to
`dir` which must either be an existing empty directory or a non-existent path
which can be created as a new directory. If `dir` is not specified, the archive
is extracted into a temporary directory which is returned by `extract`.
If a `predicate` function is passed, it is called on each `Header` object that
is encountered while extracting `tarball` and the entry is only extracted if the
`predicate(hdr)` is true. This can be used to selectively extract only parts of
an archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the extraction process.
If the `skeleton` keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the `skeleton` keyword to the `create`
function. The `skeleton` and `predicate` arguments cannot be used together.
If `copy_symlinks` is `true` then instead of extracting symbolic links as such,
they will be extracted as copies of what they link to if they are internal to
the tarball and if it is possible to do so. Non-internal symlinks, such as a
link to `/etc/passwd` will not be copied. Symlinks which are in any way cyclic
will also not be copied and will instead be skipped. By default, `extract` will
detect whether symlinks can be created in `dir` or not and will automatically
copy symlinks if they cannot be created.
### Tar.list
```jl
list(tarball; [ strict = true ]) -> Vector{Header}
list(callback, tarball; [ strict = true ])
```
* `callback :: Header --> Bool`
* `tarball :: Union{AbstractString, AbstractCmd, IO}`
* `strict :: Bool`
List the contents of a tar archive ("tarball") located at the path `tarball`. If
`tarball` is an IO handle, read the tar contents from that stream. Returns a
vector of `Header` structs. See [`Header`](@ref) for details. If a `callback` is
provided then instead of returning a vector of headers, the callback is called
on each `Header`. This can be useful if the number of items in the tarball is
large or if you want examine items prior to an error in the tarball.
By default `list` will error if it encounters any tarball contents which the
`extract` function would refuse to extract. With `strict=false` it will skip
these checks and list all the the contents of the tar file whether `extract`
would extract them or not. Beware that malicious tarballs can do all sorts of
crafty and unexpected things to try to trick you into doing something bad.
If the `tarball` argument is a skeleton file (see `extract` and `create`) then
`list` will detect that from the file header and appropriately list or iterate
the headers of the skeleton file.
### Tar.rewrite
```jl
rewrite([ predicate, ], old_tarball, [ new_tarball ]) -> new_tarball
```
* `predicate :: Header --> Bool`
* `old_tarball :: Union{AbstractString, AbstractCmd, IO}`
* `new_tarball :: Union{AbstractString, AbstractCmd, IO}`
Rewrite `old_tarball` to the standard format that `create` generates, while also
checking that it doesn't contain anything that would cause `extract` to raise an
error. This is functionally equivalent to doing
```jl
Tar.create(Tar.extract(predicate, old_tarball), new_tarball)
```
However, it never extracts anything to disk and instead uses the `seek` function
to navigate the old tarball's data. If no `new_tarball` argument is passed, the
new tarball is written to a temporary file whose path is returned.
If a `predicate` function is passed, it is called on each `Header` object that
is encountered while extracting `old_tarball` and the entry is skipped unless
`predicate(hdr)` is true. This can be used to selectively rewrite only parts of
an archive, to skip entries that would cause `extract` to throw an error, or to
record what content is encountered during the rewrite process.
### Tar.tree_hash
```jl
tree_hash([ predicate, ] tarball;
[ algorithm = "git-sha1", ]
[ skip_empty = false ]) -> hash::String
```
* `predicate :: Header --> Bool`
* `tarball :: Union{AbstractString, AbstractCmd, IO}`
* `algorithm :: AbstractString`
* `skip_empty :: Bool`
Compute a tree hash value for the file tree that the tarball contains. By
default, this uses git's tree hashing algorithm with the SHA1 secure hash
function (like current versions of git). This means that for any tarball whose
file tree git can represent—i.e. one with only files, symlinks and non-empty
directories—the hash value computed by this function will be the same as the
hash value git would compute for that file tree. Note that tarballs can
represent file trees with empty directories, which git cannot store, and this
function can generate hashes for those, which will, by default (see `skip_empty`
below for how to change this behavior), differ from the hash of a tarball which
omits those empty directories. In short, the hash function agrees with git on
all trees which git can represent, but extends (in a consistent way) the domain
of hashable trees to other trees which git cannot represent.
If a `predicate` function is passed, it is called on each `Header` object that
is encountered while processing `tarball` and an entry is only hashed if
`predicate(hdr)` is true. This can be used to selectively hash only parts of an
archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the hashing process.
Currently supported values for `algorithm` are `git-sha1` (the default) and
`git-sha256`, which uses the same basic algorithm as `git-sha1` but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
using in the future (due to known attacks on SHA1). Support for other file tree
hashing algorithms may be added in the future.
The `skip_empty` option controls whether directories in the tarball which
recursively contain no files or symlinks are included in the hash or ignored.
In general, if you are hashing the content of a tarball or a file tree, you care
about all directories, not just non-empty ones, so including these in the
computed hash is the default. So why does this function even provide the option
to skip empty directories? Because git refuses to store empty directories and
will ignore them if you try to add them to a repo. So if you compute a reference
tree hash by by adding files to a git repo and then asking git for the tree
hash, the hash value that you get will match the hash value computed by
`tree_hash` with `skip_empty=true`. In other words, this option allows
`tree_hash` to emulate how git would hash a tree with empty directories. If you
are hashing trees that may contain empty directories (i.e. do not come from a
git repo), however, it is recommended that you hash them using a tool (such as
this one) that does not ignore empty directories.
### Tar.Header
The `Header` type is a struct representing the essential metadata for a single
record in a tar file with this definition:
```jl
struct Header
path :: String # path relative to the root
type :: Symbol # type indicator (see below)
mode :: UInt16 # mode/permissions (best viewed in octal)
size :: Int64 # size of record data in bytes
link :: String # target path of a symlink
end
```
Types are represented with the following symbols: `file`, `hardlink`, `symlink`,
`chardev`, `blockdev`, `directory`, `fifo`, or for unknown types, the typeflag
character as a symbol. Note that [`extract`](#Tarextract) refuses to extract
records types other than `file`, `symlink` and `directory`; [`list`](#Tarlist)
will only list other kinds of records if called with `strict=false`.
<!-- END: copied from inline doc strings -->
### Compression
It is typical to compress tarballs when saving of transferring them. In the UNIX
tradition of doing one thing and doing it well, the `Tar` package does not do
any kind of compression and instead makes it easy to compose it's API with
external compression tools. The simplest way to read a compressed archive is to
use a command-line tool to decompress it. For example:
```jl
Tar.list(`gzcat $tarball`)
Tar.extract(`gzcat $tarball`)
```
This will spawn the `gzcat $tarball` command, read the uncompressed tarball data
from the output of that process, and then close the process. Creating a tarball
with the `gzip` command is nearly as easy:
```jl
Tar.create(dir, pipeline(`gzip -9`, tarball))
```
This assumes that `dir` is the directory you want to archive and `tarball` is
the path you want to create as a compressed archive.
If you want to compress or decompress a tarball in the same process, you can
using various
[[TranscodingStreams](https://github.com/JuliaIO/TranscodingStreams.jl)
packages:
```jl
using CodecZlib
tar_gz = open(tarball, write=true)
tar = GzipCompressorStream(tar_gz)
Tar.create(dir, tar)
close(tar)
```
This assumes that `dir` is the directory you want to archive and `tarball` is
the path you want to create as a compressed archive. You can decompress
in-process in a similar manner:
```jl
using CodecZlib
tar_gz = open(tarball)
tar = GzipDecompressorStream(tar_gz)
dir = Tar.extract(tar)
close(tar)
```
This assumes that `tarball` is the path of the compressed archive you want to
extract.
### API comparison with command-line tar
It might be helpful to compare the `Tar` API with command-line `tar`. Unlike
`tar -c` the `Tar.create` function does not include any of the path you tell it
to bundle in the resulting TAR file: the location of the data is not part of the
data. Doing `Tar.create(dir, tarball)` is roughly equivalent to running the
following `tar` command:
```sh
tar -f $tarball -C $dir -c $(cd $dir; ls -A)
```
In other words, `tar` is told to change into the directory `dir` before
constructing the tarball and then include all the top-level items in that
directory without any path prefix. Note that the above command does not fully
emulate the behavior of `Tar.create`: it does not sort entries in the same order
and it still records user and group information, modification times and exact
permissions. Coaxing command-line `tar` programs to omit this non-portable
information and use a portable (and `git`-compatible sort order) is non-trivial.
Another difference from command-line `tar`: non-empty directories are also
omitted from the tarballs that `Tar` creates since no metadata is recorded about
directories aside from the fact that they exist and the existence of non-empty
directories is already implied by the fact that they contain something else. If,
in the future, the ability to record metadata about directories is added,
tarballs will record entries for non-empty directories with non-default
metadata.
On the extraction side of things, doing `Tar.extract(tarball, dir)` is roughly
equivalent to the following commands:
```sh
test -d $dir || mkdir $dir
tar -f $tarball -C $dir -mx
```
Again, `tar` is told to change into the directory `dir` before extracting the
tarball and to extract each path relative to that directory. The `-m` option
tells `tar` to ignore the modification times recorded in the tarball and just
let files and directories have their natural modification times.
If the current user has elevated privileges, the `tar` command will attempt to
change the owner and group of files to what is recorded in the tarball, whereas
`Tar.extract` will never do that. The `tar` command may also try to restore
permissions without respecting the current `umask` if the user is an
administrator. Again, `Tar.extract` will never do that—it behaves the same way
for any users: by ignoring any user/group/permission information, aside from
whether plain files are executable by their owner or not. To suppress these
behaviors with GNU tar, you can use the `--no-same-owner` and
`--no-same-permissions` options; these options are not broadly supported by
other `tar` commands, which may not have options to support these behaviors.
## Design & Features
Unlike the `tar` command line tool, which was originally designed to archive
data in order to restore it back to the same system or to a replica thereof, the
`Tar` package is designed for using the TAR format to transfer trees of files
and directories from one system to another. This design goal means that some
metadata fields supported by the TAR format and used by default by historical
`tar` tools are not used or supported by `Tar`. In short, the choice of features
and defaults for `Tar` are designed to support transfer of data, rather than
backup and restoration.
The TAR format can, for example, record the name and ID of the user that owns
each file. Recording this information makes perfect sense when using tarballs
for backup: the `tar` program should run as root when restoring data, so it can
restore the original owner of each file and directory. On the other hand, this
ownership information is of no use when using the TAR format to transfer data
from one system to another: the user names and IDs will not generally be the
same on different systems, and the tool should _not_ be run as `root`, so it
cannot change the owner of anything it extracts. For data transfer, ownership
metadata should be disregarded and need not be recorded in the first place.
Similarly, it makes little sense, when using tarballs for data transfer, to copy
the modification time of each file from the source system. Those time stamps are
unlikely to be relevant on the destination system, and in some cases, clock skew
between the systems could mean that time stamps from the source appear to be in
the future at the destination. This can confuse some programs and may even be
perceived as an attempted security breach; most `tar` command line tools print
warnings when extracting files with time stamps from the future. When using the
TAR format for data transfer, it is better to ignore time stamps and just let
the extracted contents have natural modification times.
The features and defaults of the `Tar` package are guided by the principle that
it uses the TAR format for transmitting data, not as a tool for backup and
restoration. If you want to use the TAR format for archival purposes, you are
likely better off using a traditional command line tool like [GNU
tar](https://www.gnu.org/software/tar/). If, on the other hand, you want to use
the TAR format to transmit data from one system to another, then you've come to
the right place.
### File Types
Since `Tar` is designed for transmission of file and directory trees, it
supports only the following file types:
* plain files
* directories
* symlinks
The `Tar` package does not support other file types that the TAR format can
represent, including: hard links, character devices, block devices, and FIFOs.
If you attempt to create or extract an archive that contains any of these kinds
of entries, `Tar` will raise an error. You can, however, list the contents of a
tarball containing other kinds of entries by passing the `strict=false` flag to
the `list` function; without this option, `list` raises the same error as
`extract` would.
In the future, optional support may be added for using hard links within
archives to avoid duplicating identical files.
### Time Stamps
Also in accordance with its design goal as a data transfer tool, the `Tar`
package does not record or set modification times upon tarball creation and
extraction. When creating a tarball, it sets the time stamp of each entry to
`0`, representing the UNIX epoch (Jan 1st, 1970). When extracting a tarball, it
ignores the time stamps of entries and lets all extracted content have "natural"
modification times based on when each file or directory is extracted.
In the future, optional support may be added for recording and restoring time
stamps.
### Users & Groups
`Tar` ignores user and group names and IDs when creating and extracting
tarballs. This is due to two facts:
* names and IDs on source and destination systems will generally not match;
* names and IDs can only be changed if `Tar` is run with elevated privileges.
The first fact means that it probably doesn't make sense to try to restore
ownership when transferring data, while the second fact means that it's probably
not possible. Accordingly, `Tar` disregards user and group names and IDs when
creating and extracting tarballs. During creation, the ID fields are recorded as
`0` and names fields are recorded as the empty string. When extracting a
tarball, the user and group fields are ignored entirely and all extracted
content is owned by the current user.
It is unlikely that support will be added for recording or restoring ownership
of files or directories since that functionality only makes sense when using the
TAR format for backup, a purpose better served by using a command line `tar`
tool.
### Permissions
Upon tarball extraction, `Tar` respects the permissions recorded for each file.
When creating tarball, however, it ignores most permission information and
normalizes permissions as follows:
* files that are not executable by the owner are archived with mode `0o644`;
* files that are executable by the owner are archived with mode `0o755`;
* directories and symlinks are always archived with mode `0o755`.
In other words, `Tar` records only one significant bit of information: whether
plain files are executable by their owner or not. No permission information for
directories or symlinks is considered significant. This one bit of information
is the only one which makes sense across all platforms, so this choice makes
`Tar`'s behavior as portable as possible. On systems (like Windows) that do not
use POSIX modes, whatever permission mechanism exists (_e.g._ ACLs) should be
queried/modified to determine whether each file is executable by its owner or
not. Unfortunately, this is currently broken on Windows since `libuv` does not
correctly support querying or changing the user executable "bit"; this is
actively being worked on, however, and should be fixed in future versions of
Julia.
In the future, optional support may be added for recording exact permission
modes on POSIX systems, and possibly for normalizing permissions on extraction
in the same way that they are normalized upon archive creation.
### Reproducibility
The information that `Tar` records about permissions is the same information
that `git` considers to be significant when recording and hashing tree contents
(admittedly not by coincidence). As a result, an important and useful
consequence of `Tar`'s design is that it has the following properties:
* if you create a tarball from a file tree and extract it, the new tree will
have the same `git` tree hash as the original;
* if you `git checkout` a file tree and archive it using `Tar`, the resulting
TAR archive file is always the same.
One important caveat to keep in mind is that `git` ignores directories that
recursively contain only directories—_i.e._ unless there's a file or a symlink
somewhere, `git` will not acknowledge the existence of a subdirectory. This
means that two trees with the same `git` tree hash can produce different
tarballs if they differ by subdirectories containing no files or symlinks: `git`
will ignore those subdirectories, while `Tar` will not. Therefore, they will
have the same `git` tree hash, but produce different tarballs. Two _identical_
file trees will always produce identical tarballs, however, and that tarball
should remain stable in future versions of the `Tar` package.
The `tree_hash` function can be used to compute a git-style tree hash of the
contents of a tarball (without needing to extract it). Moreover, two tarballs
created by the `Tar` package will have the same hash if and only if they contain
the same file tree, which is true if and only if they are identical tarballs.
You can, however, hash tarballs not created by `Tar` this way to see if they
represent the same file tree, and you can use the `skip_empty=true` option to
`tree_hash` to compute the hash that `git` would assign the tree, ignoring empty
directories.
|