This is git-annex's todo list. Link items to done when done.

git-annex unused eats memory
Posted Sat Sep 22 03:36:59 2012

parallel possibilities
Posted Tue Jul 17 17:54:57 2012

wishlist: swift backend
Posted Tue Jul 17 17:54:57 2012

tahoe lfs for reals
Posted Tue Jul 17 17:54:57 2012

union mounting
Posted Tue Jul 17 17:54:57 2012

hidden files
Posted Tue Jul 17 17:54:57 2012

optimise git-annex merge
Posted Tue Jul 17 17:54:57 2012

cache key info
Posted Tue Jul 17 17:54:57 2012

smudge
Posted Tue Jul 17 17:54:57 2012

add -all option
Posted Tue Jul 17 17:54:57 2012

windows support
Posted Tue Jul 17 17:54:57 2012

redundancy stats in status
Posted Tue Jul 17 17:54:57 2012

automatic bookkeeping watch command
Posted Tue Jul 17 17:54:57 2012

wishlist: special-case handling of Youtube URLs in Web special remote
Posted Tue Jul 17 17:54:57 2012

support S3 multipart uploads
Posted Tue Jul 17 17:54:57 2012

I have the same use case as Asheesh but I want to be able to see which filenames point to the same objects and then decide which of the duplicates to drop myself. I think

git annex drop --by-contents

would be the wrong approach because how does git-annex know which ones to drop? There's too much potential for error.

Instead it would be great to have something like

git annex finddups

While it's easy enough to knock up a bit of shell or Perl to achieve this, that relies on knowledge of the annex symlink structure, so I think really it belongs inside git-annex.

If this command gave output similar to the excellent fastdup utility:

Scanning for files... 672 files in 10.439 seconds
Comparing 2 sets of files...

2 files (70.71 MB/ea)
        /home/adam/media/flat/tour/flat-tour.3gp
        /home/adam/videos/tour.3gp

Found 1 duplicate of 1 file (70.71 MB wasted)
Scanned 672 files (1.96 GB) in 11.415 seconds

then you could do stuff like

git annex finddups | grep /home/adam/media/flat | xargs rm
Comment by Adam Thu Dec 22 12:31:29 2011

My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository, and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories. (The only exception is the unused command, and reducing its memory usage is a continuing goal.)

So I would rather come at this from a different angle.. like providing a way to output a list of files and their associated keys, which the user can then use in their own shell pipelines to find duplicate keys:

  git annex find --include '*' --format='${file} ${key}\n' | sort --key 2 | uniq --all-repeated --skip-fields=1

Which is implemented now!

(Making that pipeline properly handle filenames with spaces is left as an exercise for the reader..)

Comment by joey Thu Dec 22 16:39:24 2011

Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.

I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.

(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)

Comment by joey Wed Mar 16 03:22:45 2011

What about Cygwin? It emulates POSIX fairly well under Windows (including signals, forking, fs (also things like /dev/null, /proc), unix file permissions), has all standard gnu utilities. It also emulates symlinks, but they are unfortunately incompatible with NTFS symlinks introduced in Vista due to some stupid restrictions on Windows.

If git-annex could be modified to not require symlinks to work, the it would be a pretty neat solution (and you get a real shell, not some command.com on drugs (aka cmd.exe))

Comment by Zoltán Tue May 15 00:14:08 2012

What is the potential time-frame for this change? As I am not using git-annex for production yet, I can see myself waiting to avoid any potential hassle.

Supporting generic metadata seems like a great idea. Though if you are going this path, wouldn't it make sense to avoid metastore for mtime etc and support this natively without outside dependencies?

-- RichiH

Comment by Richard Tue Mar 15 14:08:41 2011

The mtime cannot be stored for all keys. Consider a SHA1 key. The mtime is irrelevant; 2 files with different mtimes, when added to the SHA1 backend, should get the same key.

Probably our spam filter doesn't like your work IP.

Comment by joey Wed Mar 16 16:32:52 2011

Windows support is a must. In my experience, binary file means proprietary editor, which means Windows.

Unfortunately, there's not much overlap between people who use graphical editors in Windows all day vs. people who are willing to tolerate Cygwin's setup.exe, compile a Haskell program, learn git and git-annex's 90-odd subcommands, and use a mintty terminal to manage their repository, especially now that there's a sexy GitHub app for Windows.

That aside, I think Windows-based content producers are still the audience for git-annex. First Windows support, then a GUI, then the world.

Comment by Michael Wed May 23 19:30:21 2012

For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.

I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.

Perhaps:

git annex drop --by-contents

could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)

-- Asheesh.

Comment by Asheesh Fri Apr 29 11:48:22 2011

Ah, OK. I assumed the metadata would be attached to a key, not part of the key. This seems to make upgrades/extensions down the line harder than they need to be, but you are right that this way, merges are not, and never will be, an issue.

Though with the SHA1 backend, changing files can be tracked. This means that tracking changes in mtime or other is possible. It also means that there are potential merge issues. But I won't argue the point endlessly. I can accept design decisions :)

The prefix at work is from a university netblock so yes, it might be on a few hundred proxy lists etc.

Comment by Richard Wed Mar 16 21:05:38 2011

I agree with Christian.

One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.

Of course this could be implemented using parallel&concurrency features of Haskell to do this.

Comment by npouillard Fri May 20 20:14:15 2011
Thank your for your answer and the link !
Comment by jbd Sat Feb 26 10:26:12 2011
Another thought - an SHA1 digest is 20 bytes. That means you can fit over 50 million keys into 1GB of RAM. Granted you also need memory to store the values (pathnames) which in many cases will be longer, and some users may also choose more expensive backends than SHA1 ... but even so, it seems to me that you are at risk of throwing the baby out with the bath water.
Comment by Adam Thu Dec 22 20:15:22 2011
It does offer a S3 compability layer, but that is de facto non-functioning as of right now.
Comment by Richard Sat May 14 15:00:51 2011

I really do want just one filename per file, at least for some cases.

For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.

I hope that makes things clearer!

For now I'm just doing this:

  • paulproteus@renaissance:/mnt/backups-terabyte/paulproteus/sd-card-from-2011-01-06/sd-cards/DCIM/100CANON $ for file in *; do hash=$(sha1sum "$file"); if ls /home/paulproteus/Photos/in-flickr/.git-annex | grep -q "$hash"; then echo already annexed ; else flickr_upload "$file" && mv "$file" "/home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk" && (cd /home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk && git annex add . && git commit -m ...) ; fi; done

(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)

Comment by Asheesh Fri Jan 28 07:30:05 2011

Sounds like a good idea.

  • git annex fsck (or similar) should check/rebuild the caches
  • I would simply require a clean tree with a verbose error. 80/20 rule and defaulting to save actions.
Comment by Richard Tue May 17 07:27:02 2011
Comment by joey Fri Feb 25 19:54:28 2011

Your perl script is not O(n). Inserting into perl hash tables has overhead of minimum O(n log n).

What's your source for this assertion? I would expect an amortized average of O(1) per insertion, i.e. O(n) for full population.

Not counting the overhead of resizing hash tables, the grevious slowdown if the bucket size is overcome by data (it probably falls back to a linked list or something then), and the overhead of traversing the hash tables to get data out.

None of which necessarily change the algorithmic complexity. However real benchmarks are far more useful here than complexity analysis, and the dangers of premature optimization should not be forgotten.

Your memory size calculations ignore the overhead of a hash table or other data structure to store the data in, which will tend to be more than the actual data size it's storing. I estimate your 50 million number is off by at least one order of magnitude, and more likely two;

Sure, I was aware of that, but my point still stands. Even 500k keys per 1GB of RAM does not sound expensive to me.

in any case I don't want git-annex to use 1 gb of ram.

Why not? What's the maximum it should use? 512MB? 256MB? 32MB? I don't see the sense in the author of a program dictating thresholds which are entirely dependent on the context in which the program is run, not the context in which it's written. That's why systems have files such as /etc/security/limits.conf.

You said you want git-annex to scale to enormous repositories. If you impose an arbitrary memory restriction such as the above, that means avoiding implementing any kind of functionality which requires O(n) memory or worse. Isn't it reasonable to assume that many users use git-annex on repositories which are not enormous? Even when they do work with enormous repositories, just like with any other program, they would naturally expect certain operations to take longer or become impractical without sufficient RAM. That's why I say that this restriction amounts to throwing out the baby with the bathwater. It just means that those who need the functionality would have to reimplement it themselves, assuming they are able, which is likely to result in more wheel reinventions. I've already shared my implementation but how many people are likely to find it, let alone get it working?

Little known fact: sort(1) will use a temp file as a buffer if too much memory is needed to hold the data to sort.

Interesting. Presumably you are referring to some undocumented behaviour, rather than --batch-size which only applies when merging multiple files, and not when only sorting STDIN.

It's also written in the most efficient language possible and has been ruthlessly optimised for 30 years, so I would be very surprised if it was not the best choice.

It's the best choice for sorting. But sorting purely to detect duplicates is a dismally bad choice.

Comment by Adam Fri Dec 23 17:22:11 2011
I don't suppose this SWIFT api is compatible with the eucalytpus walrus api ?
Comment by Jimmy Sat May 14 10:04:36 2011

I'd expect the checksumming to be disk bound, not CPU bound, on most systems.

I suggest you start off on the WORM backend, and then you can run a job later to migrate to the SHA1 backend.

Comment by joey Fri Feb 25 19:12:42 2011
I don't think that finding duplicate files is hard to find, and the multiple different ways it shows to deal with the duplicate files shows the flexability of the unix pipeline approach.
Comment by joey Fri Dec 23 17:52:21 2011
Hmm, I added quite a few comments at work, but they are stuck in moderation. Maybe I forgot to log in before adding them. I am surprised this one appeared immediately. -- RichiH
Comment by Richard Wed Mar 16 01:19:25 2011
BTW, sort -S '90%' benchmarks consistently 2x as fast as perl's hashes all the way up to 1 million files. Of course the pipeline approach allows you to swap in perl or whatever else is best for you at scale.
Comment by joey Fri Dec 23 18:02:24 2011

(Sadly, it cannot create a symlink, as git still wants to write the file afterwards. So the nice current behavior of unavailable files being clearly missing due to dangling symlinks, would be lost when using smudge/clean filters. (Contact git developers to get an interface to do this?)

Have you checked what the smudge filter sees when the input is a symlink? Because git supports tracking symlinks, so it should also support pushing symlinks through a smudge filter, right? Either way: yes, contact the git devs, one can only ask and hope. And if you can demonstrate the awesomeness of git-annex they might get more 1interested :)

Comment by dieter Sun Apr 3 20:30:21 2011

Hashing & segmenting seems to be around the corner, which is nice :)

Is there a chance that you will optionally add mtime to your native metadata store? If yes, I'd rather wait for v2 to start with the native system from the start. If not, I will probably set it up tonight.

PS: While posting from work, my comments are held for moderation once again. I am somewhat confused as to why this happens when I can just submit directly from home. And yes, I am using the same auth provider and user in both cases.

Comment by Richard Wed Mar 16 15:51:30 2011

https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups

but it would be better in git-annex itself ...

Comment by Adam Thu Dec 22 15:43:51 2011
If you support generic meta-data, keep in mind that you will need to do conflict resolution. Timestamps may not be synched across all systems, so keeping a log of old metadata could be used, sorting by history and using the latest. Which leaves the situation of two incompatible changes. This would probably mean manual conflict resolution. You will probably have thought of this already, but I still wanted to make sure this is recorded. -- RichiH
Comment by Richard Wed Mar 16 01:16:48 2011

My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository

Only if you want to search the whole repository for duplicates, and if you do, then you're necessarily going to have to chew up memory in some process anyway, so what difference whether it's git-annex or (say) a Perl wrapper?

and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories.

That's a worthy goal, but if everything could be implemented with an O(1) memory footprint then we'd be in much more pleasant world :-) Even O(n) isn't that bad ...

That aside, I like your --format="%f %k\n" idea a lot. That opens up the "black box" of .git/annex/objects and makes nice things possible, as your pipeline already demonstrates. However, I'm not sure why you think git annex find | sort | uniq would be more efficient. Not only does the sort require the very thing you were trying to avoid (i.e. the whole list in memory), but it's also O(n log n) which is significantly slower than my O(n) Perl script linked above.

More considerations about this pipeline:

  • Doesn't it only include locally available files? Ideally it should spot duplicates even when the backing blob is not available locally.
  • What's the point of --include '*' ? Doesn't git annex find with no arguments already include all files, modulo the requirement above that they're locally available?
  • Any user using this git annex find | ... approach is likely to run up against its limitations sooner rather than later, because they're already used to the plethora of options find(1) provides. Rather than reinventing the wheel, is there some way git annex find could harness the power of find(1) ?

Those considerations aside, a combined approach would be to implement

git annex find --format=...

and then alter my Perl wrapper to popen(2) from that rather than using File::Find. But I doubt you would want to ship Perl wrappers in the distribution, so if you don't provide a Haskell equivalent then users who can't code are left high and dry.

Comment by Adam Thu Dec 22 20:04:14 2011

Hm... O(N^2)? I think it just takes O(N). To read an entry out of a directory you have to download the entire directory (and store it in RAM and parse it). The constants are basically "too big to be good but not big enough to be prohibitive", I think. jctang has reported that his special remote hook performs well enough to use, but it would be nice if it were faster.

The Tahoe-LAFS folks are working on speeding up mutable files, by the way, after which we would be able to speed up directories.

Comment by zooko Tue May 17 19:20:39 2011

Whoops! You'd only told me O(N) twice before..

So this is not too high priority. I think I would like to get the per-remote storage sorted out anyway, since probably it will be the thing needed to convert the URL backend into a special remote, which would then allow ripping out the otherwise unused pluggable backend infrastructure.

Update: Per-remote storage is now sorted out, so this could be implemented if it actually made sense to do so.

Comment by joey Tue May 17 19:57:33 2011

Adam, to answer a lot of points breifly..

  • --include='*' makes find list files whether their contents are present or not
  • Your perl script is not O(n). Inserting into perl hash tables has overhead of minimum O(n log n). Not counting the overhead of resizing hash tables, the grevious slowdown if the bucket size is overcome by data (it probably falls back to a linked list or something then), and the overhead of traversing the hash tables to get data out.
  • I think that git-annex's set of file matching options is coming along nicely, and new ones can easily be added, so see no need to pull in unix find(1).
  • Your memory size calculations ignore the overhead of a hash table or other data structure to store the data in, which will tend to be more than the actual data size it's storing. I estimate your 50 million number is off by at least one order of magnitude, and more likely two; in any case I don't want git-annex to use 1 gb of ram.
  • Little known fact: sort(1) will use a temp file as a buffer if too much memory is needed to hold the data to sort. It's also written in the most efficient language possible and has been ruthlessly optimised for 30 years, so I would be very surprised if it was not the best choice.
Comment by joey Fri Dec 23 16:07:39 2011

Hey Asheesh, I'm happy you're finding git-annex useful.

So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.

Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.

Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)

So, what if you want to remove the unnecessary copies? Well, there's a really simple way:

cd /media/usb-1
git remote add other-disk /media/usb-0
git annex add
git annex drop

This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.

In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.

Comment by joey Thu Jan 27 18:29:44 2011
would, with these modifications in place, there still be a way to really git-add a file? (my main repository contains both normal git and git-annex files.)
Comment by chrysn Sat Feb 26 21:43:21 2011
I also think, that fetching keys via rsync can be done by one rsync process, when the keys are fetched from one host. This would avoid establishing a new TCP connection for every file.
Comment by Christian Fri Apr 8 12:41:43 2011

Unless you are forced to use a password, you should really be using a ssh key.

ssh-keygen
#put local .ssh/id_?sa.pub into remote .ssh/authorized_keys (which needs to be chmod 600)
ssh-add
git annex whatever
Comment by Richard Fri May 6 18:30:02 2011
Comments on this page are closed.