This is a place to discuss using git-annex. If you need help, advice, or anything, post about it here. (But post bug reports over here.)
[Installation] base-3.0.3.2 requires syb ==0.1.0.2 however syb-0.1.0.2 was excluded because json-0.5 requires syb >=0.3.3
Posted Sat Sep 22 03:36:59 2012
exporting annexed files
Posted Wed Sep 19 16:46:44 2012
Wishlist: automatic reinject
Posted Wed Sep 19 16:46:44 2012
Wishlist: getting the disk used by a subtree of files
Posted Wed Sep 19 16:46:44 2012
Wishlist: logging to file when running as a daemon (for the assistant)
Posted Wed Sep 19 16:46:44 2012
autobuilders for git-annex to aid development
Posted Wed Sep 19 16:46:44 2012
migration to git-annex and rsync
Posted Tue Jul 17 17:54:57 2012
post-copy/sync hook
Posted Tue Jul 17 17:54:57 2012
wishlist: do round robin downloading of data
Posted Tue Jul 17 17:54:57 2012
migrate existing git repository to git-annex
Posted Tue Jul 17 17:54:57 2012
How to expire old versions of files that have been edited?
Posted Tue Jul 17 17:54:57 2012
Error while adding a file "createSymbolicLink: already exists"
Posted Tue Jul 17 17:54:57 2012
wishlist: define remotes that must have all files
Posted Tue Jul 17 17:54:57 2012
Wishlist: Is it possible to "unlock" files without copying the file data?
Posted Tue Jul 17 17:54:57 2012
seems to build fine on haskell platform 2011
Posted Tue Jul 17 17:54:57 2012
Can I store normal files in the git-annex git repository?
Posted Tue Jul 17 17:54:57 2012
Auto archiving
Posted Tue Jul 17 17:54:57 2012
version 3 upgrade
Posted Tue Jul 17 17:54:57 2012
fsck gives false positives
Posted Tue Jul 17 17:54:57 2012
working without git-annex commits
Posted Tue Jul 17 17:54:57 2012
wishlist: traffic accounting for git-annex
Posted Tue Jul 17 17:54:57 2012
tips: special_remotes/hook with tahoe-lafs
Posted Tue Jul 17 17:54:57 2012
wishlist: git-annex replicate
Posted Tue Jul 17 17:54:57 2012
wishlist: push to cia.vc from the website's repo, not your personal one
Posted Tue Jul 17 17:54:57 2012
"permission denied" in fsck on shared repo
Posted Tue Jul 17 17:54:57 2012
Git Annex Transfer Protocols
Posted Tue Jul 17 17:54:57 2012
OSX's haskell-platform statically links things
Posted Tue Jul 17 17:54:57 2012
vlc and git-annex
Posted Tue Jul 17 17:54:57 2012
example of massively disconnected operation
Posted Tue Jul 17 17:54:57 2012
Debugging Git Annex
Posted Tue Jul 17 17:54:57 2012
Please fix compatibility with ghc 7.0
Posted Tue Jul 17 17:54:57 2012
wishlist: git annex status
Posted Tue Jul 17 17:54:57 2012
pure git-annex only workflow
Posted Tue Jul 17 17:54:57 2012
Moving older version's file content without doing checkout
Posted Tue Jul 17 17:54:57 2012
Is an automagic upgrade of the object directory safe?
Posted Tue Jul 17 17:54:57 2012
What can be done in case of conflict
Posted Tue Jul 17 17:54:57 2012
new microfeatures
Posted Tue Jul 17 17:54:57 2012
retrieving previous versions
Posted Tue Jul 17 17:54:57 2012
Wishlist: Ways of selecting files based on meta-information
Posted Tue Jul 17 17:54:57 2012
Need new build instructions for Debian stable
Posted Tue Jul 17 17:54:57 2012
unannex alternatives
Posted Tue Jul 17 17:54:57 2012
hashing objects directories
Posted Tue Jul 17 17:54:57 2012
windows port?
Posted Tue Jul 17 17:54:57 2012
Handling web special remote when content changes?
Posted Tue Jul 17 17:54:57 2012
git-annex communication channels
Posted Tue Jul 17 17:54:57 2012
cloud services to support
Posted Tue Jul 17 17:54:57 2012
Behaviour of fsck
Posted Tue Jul 17 17:54:57 2012
advantages of SHA* over WORM
Posted Tue Jul 17 17:54:57 2012
error in installation of base-4.5.0.0
Posted Tue Jul 17 17:54:57 2012
wishlist: git backend for git-annex
Posted Tue Jul 17 17:54:57 2012
performance improvement: git on ssd, annex on spindle disk
Posted Tue Jul 17 17:54:57 2012
Will git annex work on a FAT32 formatted key?
Posted Tue Jul 17 17:54:57 2012
Recommended number of repositories
Posted Tue Jul 17 17:54:57 2012
Windows support
Posted Tue Jul 17 17:54:57 2012
wishlist: command options changes
Posted Tue Jul 17 17:54:57 2012
Sharing annex with local clones
Posted Tue Jul 17 17:54:57 2012
getting git annex to do a force copy to a remote
Posted Tue Jul 17 17:54:57 2012
How to handle the git-annex branch?
Posted Tue Jul 17 17:54:57 2012
A really stupid question
Posted Tue Jul 17 17:54:57 2012
incompatible versions?
Posted Tue Jul 17 17:54:57 2012
using git annex to merge and synchronize 2 directories (like unison)
Posted Tue Jul 17 17:54:57 2012
bainstorming: git annex push & pull
Posted Tue Jul 17 17:54:57 2012
--print0 option as in "find"
Posted Tue Jul 17 17:54:57 2012
can git-annex replace ddm?
Posted Tue Jul 17 17:54:57 2012
confusion with remotes, map
Posted Tue Jul 17 17:54:57 2012
What happened to the walkthrough?
Posted Tue Jul 17 17:54:57 2012
location tracking cleanup
Posted Tue Jul 17 17:54:57 2012
unlock/lock always gets me
Posted Tue Jul 17 17:54:57 2012
wishlist: git annex put -- same as get, but for defaults
Posted Tue Jul 17 17:54:57 2012
wishlist: simpler gpg usage
Posted Tue Jul 17 17:54:57 2012
Preserving file access rights in directory tree below objects/
Posted Tue Jul 17 17:54:57 2012
git-annex on OSX
Posted Tue Jul 17 17:54:57 2012
syncing non-git trees with git-annex
Posted Tue Jul 17 17:54:57 2012
batch check on remote when using copy
Posted Tue Jul 17 17:54:57 2012
sparse git checkouts with annex
Posted Tue Jul 17 17:54:57 2012
nfs mounted repo results in errors on drop/move
Posted Tue Jul 17 17:54:57 2012
relying on git for numcopies
Posted Tue Jul 17 17:54:57 2012
git-subtree support?
Posted Tue Jul 17 17:54:57 2012
git annex ls / metadata in git annex whereis
Posted Tue Jul 17 17:54:57 2012
wishlist:alias system
Posted Tue Jul 17 17:54:57 2012
OSX's default sshd behaviour has limited paths set
Posted Tue Jul 17 17:54:57 2012
git pull remote git-annex
Posted Tue Jul 17 17:54:57 2012
tell us how you're using git-annex
Posted Tue Jul 17 17:54:57 2012
by
Automatic commit messages for git annex sync
Posted Tue Jul 17 17:54:57 2012
rsync over ssh?
Posted Tue Jul 17 17:54:57 2012
Automatic `git annex get` after invalidation of local files due to external modification?
Posted Tue Jul 17 17:54:57 2012
fail to git annex add some files: getFileStatus: does not exist(v 3.20111231)
Posted Tue Jul 17 17:54:57 2012
Problem with bup: cannot lock refs
Posted Tue Jul 17 17:54:57 2012
wishlist: special remote for sftp or rsync
Posted Tue Jul 17 17:54:57 2012
"git annex lock" very slow for big repo
Posted Tue Jul 17 17:54:57 2012
git tag missing for 3.20111011
Posted Tue Jul 17 17:54:57 2012
Getting started with Amazon S3
Posted Tue Jul 17 17:54:57 2012
Making git-annex less necessary
Posted Tue Jul 17 17:54:57 2012
Problems with large numbers of files
Posted Tue Jul 17 17:54:57 2012
Podcast syncing use-case
Posted Tue Jul 17 17:54:57 2012
git annex add crash and subsequent recovery
Posted Tue Jul 17 17:54:57 2012
This is now about different build failure than the bug you reported, which was already fixed. Conflating the two is just confusing.
The error message about
syb
is because by using cabal-install on an Ubuntu system from 2010, you're mixing the very old versions of some haskell libraries in Ubuntu with the new versions cabal wants to install. The solution is to stop mixing two package management systems --apt-get remove ghc
, and manually install a current version of The Haskell Platform and use cabal.Running
git checkout
by hand is fine, of course.Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the
git reset
this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved". ;)Still, it is possible to speed this up at git-annex's level. Rather than doing a
git reset
followed by a git checkout, it can justgit checkout HEAD -- file
, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!
(Hey, this even speeds up the one file case greatly, since
git reset -- file
is slooooow -- it seems to scan the entire repository tree. Yipes.)@joey
OK, I'll try increasing the stack size and see if that helps.
For reference, I was running:
git annex add .
on a directory containing about 100k files spread over many nested subdirectories. I actually have more than a dozen projects like this that I plan to keep in git annex, possibly in separate repositories if necessary. I could probably tar the data and then archive that, but I like the idea of being able to see the structure of my data even though the contents of the files are on a different machine.
After the crash, running:
git annex unannex
does nothing and returns instantly. What exactly is 'git annex add' doing? I know that it's moving files into the key-value store and adding symlinks, but I don't know what else it does.
--Justin
If
Are the files identical or different? I today did something like that with similar, but not identical directories containing media files, and git happily merged them. but there, same files had same content.
Also, make sure you use the same backend. In my case, one of the machines runs Debian stable, so I use the WORM backend, not the SHA backend.
Thanks for the tips so far. I guess a bare-only repo helps, but as well is something that I don’t need (for my use case), any only have to do because git works like this.
Also, if I have a mobile device that I want to push to, then I’d have to have two repositories on the device, as I might not be able to reach my main bare repository when traveling, but I cannot push to the „real“ repo on the mobile device from my computer. I guess I am spoiled by darcs, which will happily push to a checked out remote repository, updating the checkout if possible without conflict.
If I introduce a central bare repository to push to and from; I’d still have to have the other non-bare repos as remotes, so that git-annex will know about them and their files, right?
I’d appreciate a "git annex sync" that does what you described (commit all, pull, merge, push). Especially if it comes in a "git annex sync --all" variant that syncs all reachable repositories.
Right, I have thought about untrusting all but a few remotes to achieve something similar before and I'm sure it would kind of work. It would be more of an ugly workaround, however, because I would have to untrust remotes that are, in reality, at least semi-trusted. That's why an extra option/attribute for that kind of purpose/remote would be nice.
Obviously I didn't see the scalability problem though. Good Point. Maybe I can achieve the same thing by writing a log parsing script for myself?
Can't you just use an underscore instead of a colon?
Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.
All of the above would make merging annexes by hand a lot harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.
-- RichiH
windows support has everything I know about making a windows port. This badly needs someone who understand Windows to dive into it. The question of how to create a symbolic link (or the relevant Windows equivilant) from haskell on Windows is a good starting point..
+1 for a generic user configurable backend that a user can put shell commands in, which has a disclaimer such that if a user hangs themselves with misconfiguration then its their own fault :P
I would love to be able to quickly plugin an irods/sector set of put/get/delete/stat(get info) commands into git-annex to access my private clouds which aren't s3 compatible.
I think that's because the SSH was successful (I entered the password and let it connect), so it got the UUID and put that in the .dot instead. The same UUID (for psychosis) then ended up in two different "subgraph" stanzas, and Graphviz just plotted them together as one node.
Maybe this will clarify:
On psychosis, run "git annex map" and press ^C at the ssh password prompt: map-nossh.dot
On psychosis, run "git annex map" and type the correct password: map-goodssh.dot
As I see it:
windows support has everything I know about making a windows port. This badly needs someone who understand Windows to dive into it. The question of how to create a symbolic link (or the relevant Windows equivilant) from haskell on Windows is a good starting point..
The git tweak-fetch hook that I have been developing, and hope will be accepted into git soon, provides some abilities that could be used to make "git pull remote" always merge remote/master. Normall, git can only be configured to do that merge automatically for one remote (ie, origin). But the tweak-fetch hook can flag arbitrary branches as needing merge.
So, it could always flag tracking branches of the currently checked out branch for merge. This would be enabled by some setting, probably, since it's not necessarily the case that everyone wants to auto-merge when they pull like this. (Which is why git doesn't do it by default after all.)
(The tweak-fetch hook will also entirely eliminate the need to run git annex merge manually, since it can always take care of merging the git-annex branch.)
This was already asked here, but I have a use case where I need to unlock with the files being hardlinked instead of copied (my fs does not support CoW), even though 'git annex lock' is now much faster ;-) . The idea is that 1) I want the external world see my repo "as if" it wasn't annexed (because of its own limitation to deal with soft links), and 2) I know what I do, and am sure that files won't be written to but only read.
My case is: the repo contains a snapshot A1 of a certain remote directory. Later I want to rsync this dir into a new snapshot A2. Of course, I want to transfer only new or changed files, with the --copy-dest=A1 (or --compare-dest) rsync's options. Unfortunately, rsync won't recognize soft-links from git-annex, and will re-transfer everything.
Maybe I'm overusing git-annex ;-) but still, I find it is a legitimate use case, and even though there are workarounds (I don't even remember what I had to do), it would be much more straightforward to have 'git annex unlock --readonly' (or '--readonly-unsafe'?), ... or have rsync take soft-links into account, but I did not see the author ask for microfeatures ideas :) (it was discussed, and only some convoluted workarounds were proposed). Thanks.
@justin, I discovered that "git annex describe" did what I wanted
@joey, yep that is the behaviour of "tahoe ls", thanks for the tip on removing the file from the remote.
It seems to be working okay for now, the only concern is that on the remote everything is dumped into the same directory, but I can live with that, since I want to track biggish blobs and not lots of small little files.
Hey Jimmy: how's this working for you now? I would expect it to go slower and slower since Tahoe-LAFS has an O(N) algorithm for reading or updating directories.
Of course, if it is still fast enough for your uses then that's okay. :-)
(We're working on optimizations of this for future releases of Tahoe-LAFS.)
I'd like to understand the desired behavior of store-hook and retrieve-hook better, in order to see if there is a more efficient way to use Tahoe-LAFS for this.
Off to look for docs.
Regards,
Zooko
If
tahoe ls
outputs only the key, on its own line, and exits nonzero if it's not present, then I think you did the right thing.To remove a file, use
git annex move file --from tahoe
and then you can drop it locally.Suppose you do that to repos A and B. Now, in A, you
git annex drop
a file that is only present in those repositories. A checks B to make sure it still has a copy of the file. It sees the (same) file there, so assumes it's safe to drop. The file is removed from A, also removing it from B, and losing data.It is possible to configure A and B to mutually distrust one-another and avoid this problem, but there will be other problems too.
Instead, git-annex supports using
cp --reflink=auto
, which on filesystems supporting Copy On Write (eg, btrfs), avoids duplicating contents when A and B are on the same filesystem.My last comment is a bit confused. The "git fetch" command allows to get all the information from a remote, and it is then possible to merge while being offline (without access to the remote). I would like a "git annex fetch remote" command to be able to get all annexed files from remote, so that if I later merge with remote, all annexed files are already here. And "git annex fetch" could (optionally) call "git fetch" before getting the files.
It seems also that in my last post, I should have written "git annex get --from=remote" instead of "git annex copy --from=remote", because "annex copy --from" copies all files, even if the local repo already have them (is this the case? if yes, when is it useful?)
git annex add
recover when ran a second time.This begs the question: What is the default remote? It's probably not the same repository that git's master branch is tracking (ie, origin/master). It seems there would have to be an annex.defaultremote setting.
BTW, mr can easily be configured on a per-repo basis so that "mr push" copies to somewhere:
push = git push; git annex push wherever
First, you need a bare git repository that you can push to, and pull from. This simplifies most git workflow.
Secondly, I use mr, with this in
.mrconfig
:Which makes "mr update" in repositories where I rarely care about git details take care of syncing my changes.
I also make "mr update" do a "git annex get" of some files in some repositories that I want to always populate. git-annex and mr go well together. :)
Perhaps my annexupdate above should be available as "git annex sync"?
Going one step further, a --min-copy could put all files so that numcopies is satisfied. --all could push to all available ones.
To take everything another step further, if it was possible to group remotes, one could act on the groups. "all" would be an obvious choice for a group that always exists, everything else would be set up by the user.
Git-annex's commit hook does not prevent unannex being used. The file you unannex will not be checked into git anymore and will be a regular file again, not a git-annex symlink.
For example, here's a transcript:
My guess is that psychosis has not pulled the git-annex branch since bacon was set up (or that bacon's git-annex branch has not been pushed to origin). git-annex status only shows remotes present in git-annex:uuid.log This may be a bug.
The duplicate links in the map I don't quite understand. I only see duplicate links in my maps when I have the same repository configured as two different git remotes (for example, because the same repository can be accessed two different ways). You don't seem to have that in your config.
Yes, there is value in layering something over git-annex to use a policy to choose what goes where.
I use mr to update and manage all my repositories, and since mr can be made to run arbitrary commands when doing eg, an update, I use its config file as such a policy layer. For example, my podcasts are pulled into my sound repository in a subdirectory; boxes that consume podcasts run "git pull; git annex get podcasts --exclude="/out/"; git annex drop podcasts/*/out". I move podcasts to "out" directories once done with them (I have yet to teach mpd to do that for me..), and the next time I run "mr update" to update everything, it pulls down new ones and removes old ones.
I don't see any obstacle to doing what you want. May be that you'd need better querying facilities in git-annex (so the policy layer can know what is available where), or finer control (--exclude is a good enough hammer for me, but maybe not for you).
doc/walkthrough/
.That's awesome, I had not heard of git sparse checkouts before.
It does not make sense to tie the log files to the directory of the corresponding files, as then the logs would have to move when the files are moved, which would be a PITA and likely make merging log file changes very complex. Also, of course, multiple files in different locations can point at the same content, which has the same log file. And, to cap it off, git-annex can need to access the log file for a given key without having the slightest idea what file in the repository might point to it, and it would be very expensive to scan the whole repository to find out what that file is in order to lookup the filename of the log file.
The most likely change in git-annex that will make this better is in this todo item -- but it's unknown how to do it yet.
This is an entirely reasonable way to go about it.
However, doing it this way causes files in B to always "win" -- If the same filename is in both repositories, with differing content, the version added in B will superscede the version from A. If A has a file that is not in B, a git commit -a in B will commit a deletion of that file.
I might do it your way and look at the changes in B before (or even after) committing them to see if files from A were deleted or changed.
Or, I might just instead keep B in a separate subdirectory in the repository, set up like so:
Or, a third way would be to commit A to a branch like branchA and B to a separate branchB, and not merge the branches at all.
remotes/origin/master
I've committed the queue flush improvements, so it will buffer up to 10240 git actions, and then flush the queue.
There may be other memory leaks at scale (besides the two I mentioned earlier), but this seems promising. I'm well into running
git annex add
on a half million files and it's using 18 mb ram and has flushed the queue several times. This run will fail due to running out of inodes for the log files, not due to memory. :)Right, --in goes by git-annex's location tracking information; actually checking if a remote still has the files would make --in too expensive in many cases.
So you need to give
gpodder-on-usbdisk
current information. You can do that by going tousb-ariaz
and doing agit annex fsck
. That will find the deleted files and update the location information. Then, back ongpodder-on-usbdisk
,git pull usb-ariaz
, and then you can proceed with the commands you showed.While having remotes redistribute introduces some obvious security concerns, I might use it.
As remotes support a cost factor already, you can basically implement bandwidth through that.
git annex sync
would be nice, although auto-commit does not suit every use case, so it would be better not to couple one to the other.It's ok that
git pull
does not merge the git-annex branch. You can merge it withgit annex merge
, or it will be done automatically when you use other git-annex commands.If you use
git pull
andgit push
without any options, the defaults will make git pull and push the git-annex branch automatically.But if you're in the habit of doing
git push origin master
, that won't cause the git-annex branch to be pushed (usegit push origin git-annex
to manually push it then). Similarly,git pull origin master
won't pull it. And also, theremote.origin.fetch
setting in.git/config
can be modified in ways that makegit pull
not automatically pull the git-annex branch. So those are the things to avoid after upgrade to v3, basically.hmmmm - I'm still not sure I get this.
If I'm using a whole bunch of distributed annexs with no central repo, then I can not do a
git pull remote
without either specifying the branch to use or changing default tracked remote viagit branch --set-upstream
. The former like you note doesn't pull the git-annex branch down the latter only works one-at-a-time.The docs read to me as though I ought to be able to do a
git pull remote ; git annex get .
using anyone of my distributed annexs.Am I doing something wrong? Or is the above correct?
--from=...
or--all
? (thus, among other things, one could determine if a remote has a complete checkout.)With a lazy branch, I get "git-annex: no branch is checked out". Weird.. my best guess is that it's because this is running at the seek stage, which is unusual, and the value is not used until a later stage and so perhaps the git command gets reaped by some cleanup code before its output is read.
(pipeRead is lazy because often it's used to read large quantities of data from git that are processed progressively.)
I did make it merge both branches, separately. It would be possible to do one single merge, but it's probably harder for the user to recover if there are conflicts in an octopus merge. The order of the merges does not seem to me to matter much, barring conflicts it will work either way. Dealing with conflicts during sync is probably a weakness of all this; after the first conflict the rest of the sync will continue failing.
I think:
For the second case, after the "spurious" SSH, it could still recognize that the repositories are the same by the duplicated annex uuid, which currently shows up in
map.dot
twice. I wonder what it would take to avoid the spurious SSH -- maybe some config that lists "alternate" URLs that should be considered the same as the current repository? Or actually list URLs in uuid.log? Fortunately, I think this only affects the map, so it's not a big problem.I agree on the naming suggestions, and that it does not suit everybody. Maybe I’ll think some more about it. The point is: I’m trying to make live easy for those who do not want to manually create some complicated setup, so if it needs configuration, it is already off that track. But turning the current behavior into something people have to configure is also not well received by the users.
Given that "git annex sync" is a new command, maybe it is fine to have this as a default behavior, and offer an easy way out. The easy way out could be one of two flags that can be set for a repo (or a remote):
Maybe central is enough.
It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.
It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.
I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.
No matter what you end up doing, I would appreciate a git-annex-announce@ list.
I really like the persistence of ikiwiki, but it's not ideal for quick communication. I would be fine with IRC and/or ML. The advantage of a ML over ikiwiki is that it doesn't seem to be as "wasteful" to mix normal chat with actual problem-solving. But maybe that's merely my own perception.
Speaking of RSS: I thought I had added a wishlist item to ikiwiki about providing per-subsite RSS feeds. For example there is no (obvious) way to subscribe to changes in http://git-annex.branchable.com/forum/git-annex_communication_channels/ .
FWIW, I resorted to tagging my local clone of git-annex to keep track of what I've read, already.
-- RichiH
Seems to have a scalability problem, what happens when such a repository becomes full?
Another way to accomplish I think the same thing is to pick the repositories that you would include in such a set, and make all other repositories untrusted. And set numcopies as desired. Then git-annex will never remove files from the set of non-untrusted repositories, and fsck will warn if a file is present on only an untrusted repository.
Well, it should only move files to
.git/annex/bad/
if their filesize is wrong, or their checksum is wrong.You can try moving a file out of
.git/annex/bad/
and re-run fsck and see if it fails it again. (And if it does, paste in a log!)To do that -- Suppose you have a file
.git/annex/bad/SHA256-s33--5dc45521382f1c7974d9dbfcff1246370404b952
and you know that filefoobar
was supposed to have that content (you can check thatfoobar
is a symlink to that SHA value). Then reinject it:git annex reinject .git/annex/bad/SHA256-s33--5dc45521382f1c7974d9dbfcff1246370404b952 foobar
Another nice thing would be a summary of what is wrong. I.e.
And the same/similar for all other failure modes.
-- RichiH
Thanks for the quick reply :)
I wanted to look up the UUID of the current repo so that I can find out which repo is alive from the collection of repos with the same name. I could have looked for it in .git/config though, since it's pretty obvious. I just looked into the git-annex branch and didn't find it there. Thanks for the tip about using ".". By the way, could there be some kind of warning about using non-unique names for repos? That would make this scenario less likely. Or maybe that is a bad idea given the decentralized nature of git.
By the way, do the trust settings propagate to other repos? If I mark some UUID as untrusted on one computer does it become globally untrusted?
Thanks for the update, Joey. I think you forgot to change libghc-missingh-dev to libghc6-missingh-dev for the copy & paste instructions though.
Also, after having checked that I have everything installed I'm still getting this error:
ANNEX_HASH_*
oversight. (It also affected removal, btw.)For future reference, git can recover from a corrupted index file with
rm .git/index; git reset --mixed
.Of course, you lose any staged changes that were in the old index file, and may need to re-stage some files.
The .git-annex/ directory is what really needs hashing.
Consider that when git looks for changes in there, it has to scan every file in the directory. With hashing, it should be able to more quickly identify just the subdirectories that contained changed files, by the directory mtimes.
And the real kicker is that when committing there, git has to create a tree object containing every single file, even if only 1 file changed. That will be a lot of extra work; with hashed subdirs it will instead create just 2 or 3 small tree objects leading down to the changed file. (Probably these trees both pack down to similar size pack files, not sure.)
Ah - very good to know that recovery is easier than the method I used.
I wonder if it could be made a feature to automatically and safely recover/resume from an interrupted
git add
?It would be clearer to call "git-annex-master" "synced/master" (or really "synced/$current_branch"). That does highlight that this method of syncing is not particularly specific to git-annex.
I think this would be annoying to those who do use a central bare repository, because of the unnecessary pushing and pulling to other repos, which could be expensive to do, especially if you have a lot of interconnected repos. So having a way to enable/disable it seems best.
Maybe you should work up a patch to Command/Sync.hs, since I know you know haskell :)
OMG, my first sizable haskell patch!
So trying this out..
In each repo I want to sync, I first
git branch synced/master
Then in each repo, I found I had to pull from each of its remotes, to get the tracking branches that
defaultSyncRemotes
looks for to know those remotes are syncable. This was the surprising thing for me, I had expected sync to somehow work out which remotes were syncable without my explicit pull. And it was not very obvious that sync was not doing its thing before I did that, since it still does a lot of "stuff".Once set up properly,
git annex sync
fetches from each remote, merges, and then pushes to each remote that has a synced branch. Changes propigate around even when some links are one-directional. Cool!So it works fine, but I think more needs to be done to make setting up syncing easier. Ideally, all a user would need to do is run "git annex sync" and it syncs from all remotes, without needing to manually set up the synced/master branch.
While this would lose the ability to control which remotes are synced, I think that being able to
git annex sync origin
and only sync from/to origin is sufficient, for the centralized use case.Code review:
Why did you make
branch
strict?There is a bit of a bug in your use of Command.Merge.start. The git-annex branch merge code only runs once per git-annex run, and often this comes before sync fetches from the remotes, leading to a push conflict. I've fixed this in my "sync" branch, along with a few other minor things.
mergeRemote
merges fromrefs/remotes/foo/synced/master
. But that will only be up-to-date ifgit annex sync
has recently been run there. Is there any reason it couldn't merge fromrefs/remotes/foo/master
?git annex status
now includes a list of all known repositories.Yes, trust setting propigate on git push/pull like any other git-annex information.
These are good examples; I think you've convinced me at least for upgrades going forward after v2. I'm not sure we have enough users and outdated git-annex installations to worry about it for v1.
(Hoping such upgrades are rare anyway.. Part of the point of changes made in v2 was to allow lots of changes to be made later w/o needing a v3.)
Update: Upgrades from v1 to v2 will no longer be handled automatically now.
What a good idea!
150 lines of haskell later, I have this:
Git can actually push into a non-bare repository, so long as the branch you change there is not a checked out one. Pushing into
remotes/$foo/master
andremotes/$foo/git-annex
would work, however determining the value that the repository expects for$foo
is something git cannot do on its own. And of course you'd still have togit merge remotes/$foo/master
to get the changes.Yes, you still keep the non-bare repos as remotes when adding a bare repository, so git-annex knows how to get to them.
I've made
git annex sync
run the simple script above. Perhaps it can later be improved to sync all repositories.Cool, that seems to make things work as expected, here's an updated recipe
I just needs some of the output redirected to /dev/null.
(I updated this comment to fix a bug. --Joey)
It makes sense to have separate repositories when you have well-defined uses for them.
I have a separate repository just for music and podcasts, which I can put various places where I have no need of the overhead of a tree of other files.
If you're using it for whatever arbitrary large files you accumulate, I find it's useful to have them in one repository. This way I can rearrange things as makes sense. It might make sense to have "photos" and "isos" as categories today, but next year you might prefer to move those under 2011/{photos,isos}. It would certainly make sense to have different repositories for home, work, etc.
How to split repositories up for a home directory is a general problem that the vcs-home project has surely considered at one time or another.
git annex sync
only syncs git metadata, not file contents, and metadata is not stored on S3, so it does notthing (much).git annex move . --to s3
orgit annex copy . --to s3
is the right way to send the files to S3. I'm not sure why you say it's not working. I'd try it but Amazon is not letting me sign up for S3 again right now. Can you show what goes wrong with copy?I've made git-annex-shell run the git
hooks/annex-content
after content is received or dropped.Note that the clients need to be running at least git-annex version 3.20120227 , which runs git-annex-shell commit, which runs the hook.
thanks Joey,
is it possible to run some git annex command that tells me, for a specific directory, which files are available in an other remote? (and which remote, and which filenames?) I guess I could run that, do my own policy thingie, and run
git annex get
for the files I want.For your podcast use case (and some of my use cases) don't you think git [annex] might actually be overkill? For example your podcasts use case, what value does git annex give over a simple rsync/rm script? such a script wouldn't even need a data store to store its state, unlike git. it seems simpler and cleaner to me.
for the mpd thing, check http://alip.github.com/mpdcron/ (bad project name, it's a plugin based "event handler") you should be able to write a simple plugin for mpdcron that does what you want (or even interface with mpd yourself from perl/python/.. to use its idle mode to get events)
Dieter
The symlinks are in the git repository. So if the rsync damanged one, git would see the change. And nothing that happens to the symlinks can affect fsck.
git-annex does not use hard links at all.
fsck corrects mangled file permissions. It is possible to screw up the permissions so badly that it cannot see the files at all (ie, chmod 000 on a file under .git/annex/objects), but then fsck will complain and give up, not move the files to bad. So I don't see how a botched rsync could result in fsck moving a file with correct content to bad.
I'm not currently planning to support sharedRepository perms on special remotes. I suppose I could be convinced otherwise, it's perhaps doable for the ones you mention (rsync might be tricky). (bup special remote already supports it of course.)
thanks for the use case!
All that git annex fsck does is checksum the file and move it away if the checksum fails.
If bad data was somehow read from the disk that one time, what you describe could occur. I cannot think of any other way it could happen.
Thanks for the fast response!
Unfortunately, I had another problem:
================================== Building git-annex-3.20120419... Utility/libdiskfree.c: In function ‘diskfree’:
Utility/libdiskfree.c:61:0: warning: ‘statfs64’ is deprecated (declared at /usr/include/sys/mount.h:379) [ 6 of 157] Compiling Build.SysConfig ( Build/SysConfig.hs, dist/build/git-annex/git-annex-tmp/Build/SysConfig.o ) [ 15 of 157] Compiling Utility.Touch ( dist/build/git-annex/git-annex-tmp/Utility/Touch.hs, dist/build/git-annex/git-annex-tmp/Utility/Touch.o )
Utility/Touch.hsc:118:21: Not in scope: `noop' cabal: Error: some packages failed to install: git-annex-3.20120419 failed during the building phase. The exception was:
ExitFailure 1
I also tried to look for information on the internet, and I did not find anything useful. Any idea of what happened?
Thanks again!
Before dropping unsused items, sometimes I want to check the content of the files manually. But currently, from e.g. a sha1 key, I don't know how to find the corresponding file, except with 'find .git/annex/objects -type f -name 'SHA1-s1678--70....', wich is too slow (I'm in the case where "git log --stat -S'KEY'" won't work, either because it is too slow or it was never commited). By the way, is it documented somewhere how to determine the 2 (nested) sub-directories in which a given (by name) object is located?
So I would like 'git-annex unused' be able to give me the list of paths to the unused items. Also, I would really appreciate a command like 'git annex unused --log NUMBER [NUMBER2...]' which would do for me the suggested command "git log --stat -S'KEY'", where NUMBER is from the 'git annex unused' output. Thanks.
You say you started the repo with "git init --shared" .. but what that's really meant for is bare repositories, which can have several users pushing into it, not a non-bare repository.
The strange mode on the directories "dr-x--S---" and files "-r--r-----" must be due to your umask setting though. My umask is 022 and the directories and files under
.git/annex/objects
are "drwxr-xr-x" and "-r--r--r--", which allows anyone to read them unless an upper directory blocks it -- and with this umask, none do unless I explicitly remove permissions from one to lock down a repository.About mpd, the obvious fix is to run mpd not as a system user but as yourself. I put "@reboot mpd" in my crontab to do this.
You get a regular git merge conflict, which can be resolved in any of the regular ways, except that conflicting files are just symlinks.
Example:
This message comes from ghc's runtime memory manager. Apparently your ghc defaults to limiting the stack to 80 mb. Mine seems to limit it slightly higher -- I have seen haskell programs successfully grow as large as 350 mb, although generally not intentionally. :)
Here's how to adjust the limit at runtime, obviously you'd want a larger number:
I've tried to avoid git-annex using quantities of memory that scale with the number of files in the repo, and I think in general successfully -- I run it on 32 mb and 128 mb machines, FWIW. There are some tricky cases, and haskell makes it easy to accidentally write code that uses much more memory than would be expected.
One well known case is
git annex unused
, which has to build a structure of every annexed file. I have been considering using a bloom filter or something to avoid that.Another possible case is when running a command like
git annex add
, and passing it a lot of files/directories. Some code tries to preserve the order of your input after passing it throughgit ls-files
(which destroys ordering), and to do so it needs to buffer both the input and the result in ram.It's possible to build git-annex with memory profiling and generate some quite helpful profiling data. Edit the Makefile and add this to GHCFLAGS:
-prof -auto-all -caf-all -fforce-recomp
then when running git-annex, add the parameters:+RTS -p -RTS
, and look for the git-annex.prof file.Personally, I deal with this problem by having a directory, or directories where I put files that I want to have on my partial checkout laptop, and run
git annex get
in that directory.It's not a perfect solution, but I don't know that a perfect solution exists.
Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!
The web special remote will happily download files when you
git annex get
even if they don't have the same content that they did before.git annex fsck
will detect such mismatched content to the best ability of the backend (so checking the SHA1, or verifying the file size at least matches if you use WORM), and complain and move such mismatched content aside.git annex addurl --fast
deserves a special mention; it uses a backend that only records the URL, and so if it's used, fsck cannot later detect such changes. Which might be what you want..For most users, this is one of the reasons
git annex untrust web
is a recommended configuration. Once you untrust the web, any content you download from the web will be kept around in one of your own git-annex repositories, rather than the untrustworthy web being the old copy.git's code base makes lots of assumptions hardcoding the size of the hash, etc. (grep its source for magic numbers 40 and 42...) I'd like to see git get parameratised hashes. SHA1 insecurity may evenutally push it in that direction. However, when I asked the git developers about this at the Gittogether last year, there were several ideas floated that would avoid parameterisation, and a lot of good thoughts about problems parameterised hashes would cause.
Moving data into git proper would still leave the problems unique to large data of not being able to store it all on every clone. Which means a git-annex like thing is needed to track where the data resides and move it around.
(BTW, in markdown, you separate paragraphs with blank lines. Like in email.)
I have no experience using git-subtree, but as long as the home repository has the work one as a git remote, it will automatically merge work's git-annex branch with its own git-annex branch, and so will know what files are present at work, and will be able to get them.
Probably you won't want to make work have home as a remote, so work's git-annex will not know which files home has, nor will it be able to copy files to home (but home will be able to copy files to work).
git annex unused
does in fact do what I want. When I tried it, it just didn't show the obsolete versions of the files I edited because I hadn't yet synchronized all repositories, so that was why the obsolete versions were still considered used.Another option that would please the naive user without hindering the more advanced user: "git annex init", by default, creates a synced/master branch. "git annex sync" will pull from every /sync/master branch it finds, and also push to any /sync/master branch it finds, but will not create any. So by default (at least for new users), this provides simple one-step syncing.
Advanced users can disable this per-repo by just deleting the synced/master branch. Presumably the logic will be: Every repo that should not be pushed to, because it has access to some central repo, should not have a synced/master branch. Every other repo, including the (or one of the few) central repos, will have the branch.
This is not the most expressive solution, as it does not allow configuring syncing between arbitrary pairs of repos, but it feels like a good compromise between that and simplicity and transparency.
I think it's about time that I provide less talk and more code. I’ll see when I find the time :-)
No-so-subtle sarcasm taken and acknowledged :)
Arguably, git-annex should know about any local limits and not have them implemented via mr from the outside. I guess my concern boils down to having git-annex do the right thing all by itself with minimal user interaction. And while I really do appreciate the flexibility of chaining commands, I am a firm believer in exposing the common use cases as easily as possible.
And yes, I am fully aware that not all annexes are created equal. Point in case, I would never use git annex pull on my laptop, but I would git annex push extensively.
I've just tried to use the ANNEX_HASH_ variables, example of my configuration
It's seems to work quite well for me now, I did run across this when I tried to drop a file locally, leaving the file on my remote
I do know that the files exist in my library as I have just inserted them, it seemed to work when I didnt have the hashing, it appears that the checkpresent doesn't seem to pass the ANNEX_HASH_* variables (from the limited debugging I did)
Whups, the comment above got stuck in moderation queue for 27 days. I will try to check that more frequently.
In the meantime, I've implemented "git annex whereis" -- enjoy!
I find keeping my podcasts in the annex useful because it allows me to download individual episodes or poscasts easily when low bandwidth is available (ie, dialup), or over sneakernet. And generally keeps everything organised.
From what you say, it seems that vlc is following the symlink to the movie content, and then looking for subtitles next to the file the symlink points to. It would have to explicitly realpath the symlink to have this behavior, and this sounds like a misfeature.. perhaps you could point out to the vlc people the mistake in doing so?
There's a simple use-case where this behavior is obviously wrong, without involving git-annex. Suppose I have a movie, and one version of subtitles for it, in directory
foo
. I want to modify the subtitles, so I make a new directorybar
, symlink the large movie file fromfoo
to save space, and copy over and edit the subtitles fromfoo
. Now I run vlc inbar
to test my new subtitles. If it ignores the locally present subtitles and goes off looking for the ones inbar
, I say this is broken behavior.After some thought, perhaps the default fsck output should be at least machine readable and copy and pasteable i.e.
so I can then copy the list of borked files and then just paste it into a for loop in my shell to recover the files. it's just an idea.
Specifying the UUID was supposed to work, I think I broke it a while ago. Fixed now in git.
I'm not sure why you need to look up the UUID of the current repository. You can always refer to the current repository as ".". Anyway, the UUID of the current repository is in
.git/config
, or usegit config annex.uuid
.Thanks!
git annex addurl --fast
does exactly what I want it to do.Wow. Yet another special backend for me to play with. :-)
Probably more like 150 lines of haskell. Maybe just 50 lines if the bup repository is required to be on the same computer as the git-annex repository.
Since I do have some repositories where I'd appreciate this level of assurance that data not be lost, it's mostly a matter of me finding a free day.
Hmm, so it seems there is almost a way to do this already.
I think the one thing that isn't currently possible is to have 'plain' ssh remotes.. basically something just like the directory remote, but able to take a ssh user@host/path url. something like sshfs could be used to fake this, but for things like fsck you would want to do the sha1 calculations on the remote host.
i'll comment on each of the points separately, well aware that even a single little leftover issue can show that my plan is faulty:
all together, it seems to be a bit more complicated than i imagined, although not completely impossible. a combination of hidden files and maybe a simpler reduction of the number of requests might though achieve the important goals as well.
Besides the cost values, annex.diskreserve was recently added. (But is not available for special remotes.)
I have held off on adding high-level management stuff like this to git-annex, as it's hard to make it generic enough to cover use cases.
A low-level way to accomplish this would be to have a way for
git annex get
and/orcopy
to skip files whennumcopies
is already satisfied. Then cron jobs could be used.As joey points out the problem is B overwrites A, so that any files in A that aren't in B will be removed. But the suggestion to keep B in a separate subdirectory in the repository means I'll end up with duplicates of files in both A and B. What I want is to have the merged superset of all files from both A and B with only one copy of identical files.
The problem is that unique symlinks in A/master are deleted when B/master is merged in. To add back the deleted files after the merge you can do this:
Once the first merge has been done after set up, you can continue to make changes to A and B and future merges won't require accounting for deleted files in this way.
I'll give it a try as soon as I get rid of this:
fatal: index file smaller than expected fatal: index file smaller than expected % git status fatal: index file smaller than expected %
And no, I am not sure where that is coming from all of a sudden... (it might have to do with a hard lockup of the whole system due to a faulty hdd I tested, but I didn't do anything to it for ages before that lock-up. So meh. Also, this is prolly off topic in here)
Richard
Took me a minute to see this is not about em>descriptive commit messages in git-annex branch, but about the "git-annex automatic sync" message that is used when committing any changes currently on the master branch before doing the rest of the sync.
So.. It would be pretty easy to
ls-files
the relevant files before the commit and make a message. Although this would roughly double the commit time in a large tree, since that would walk the whole tree again (git commit -a already does it once). Smarter approaches could be faster.. perhaps it could find unstaged files, stage them, generate the message, and thengit commit
the staged changes.But, would this really be useful? It's already easy to get
git log
to show a summary of the changes made in such a commit. So it's often seen as bad form to unnecessarily mention which files a commit changes in the commit message.Perhaps more useful would be to expand the current message with details like where the sync is being committed, or what remotes it's going to sync from, or something like that.
git annex get --auto .
going to import all those files from the work remote into my home if themaster
branch of that remote isn't merged?My goal for
git-annex merge
is that users should not need to know about it, so it should not be doing expensive pulls.I hope that
git annex sync
will grow some useful features to support fully distributed git usage, as being discussed in pure git-annex only workflow. I still use centralized git to avoid these problems myself.No extra remotes (that I'm aware of); that output was only edited to change hostnames.
On all three hosts, "git push origin" and "git pull origin" say everything is up to date.
I'm using git-annex 3.20111011 on all hosts (although some were running 3.20110928 when I created the repositories).
Regarding the multiple links, I've put a copy of the dot file here. It shows psychosis in three separate subgraphs, that are just getting rendered together as one, if that helps clarify anything.
Wait, I just realized you said "the git-annex branch". My origin only has "master". Do you mean the one specifically named "git-annex"? I thought that was something that gets managed automatically, or is it something I need to manually check out and deal with?
Any other info I could provide?
Indeed, see add a git backend, where you and I have already discussed this idea. :)
With the new support for special remotes, which will be used by S3, it would be possible to make such a git repo, using bup, be a special remote. I think it would be pretty easy to implement now. Not a priority for me though.
I got a good laugh out of it :-)
Storing the key unencrypted would make things easier.. I think at least for my use-cases I don't require another layer of protection on top of the ssh keys that provide access to the encrypted remotes themselves.
git annex unannex
the ones you've already annexed.)Yes, it can read id3-tags and guess titles from movie filenames but it sometimes gets confused by the filename metadata provided by the WORM-backend.
I think I have a good enough solution to this problem. It's not efficient when it comes to renames but handles adding and deletion just fine
The -L flag looks at symbolic links and copies the actual data they are pointing to. Of course "source" must have all data locally for this to work.
I think what is happening with "git annex unannex" is that "git annex add" crashes before it can "git add" the symlinks. unannex only looks at files that "git ls-files" shows, and so files that are not added to git are not seen. So, this can be recovered from by looking at git status and manually adding the symlinks to git, and then unannex.
That also suggests that "git annex add ." has done something before crashing. That's consistent with you passing it < 2 parameters; it's not just running out of memory trying to expand and preserve order of its parameters (like it might if you ran "git annex add experiment-1/ experiment-2/")
I'm pretty sure I know where the space leak is now. git-annex builds up a queue of git commands, so that it can run git a minimum number of times. Currently, this queue is only flushed at the end. I had been meaning to work on having it flush the queue periodically to avoid it growing without bounds, and I will prioritize doing that.
(The only other thing that "git annex add" does is record location log information.)
In my case, the remotes are the same, but adding a new option could make sense.
And while I can tell mr what to do explicitly, I would prefer if it did the right thing all by itself. Having to change configs in two separate places is less than ideal.
I am not sure what you mean by
git annex push
as that does not exist. Did you mean copy?I have made a new
autosync
branch, where all that the user needs to do is rungit annex sync
and it automatically sets up the synced/master branch. I find this very easy to use, what do you think?Note that
autosync
is also pretty smart about not running commands like "git merge" and "git push" when they would not do anything. So you may findgit annex sync
not showing all the steps you'd expect. The only step a sync always performs now is pulling from the remotes.How remote is REMOTE? If it's a directory on the same computer, then git-annex copy --to is actually quickly checking that each file is present on the remote, and when it is, skipping copying it again.
If the remote is ssh, git-annex copy talks to the remote to see if it has the file. This makes copy --to slow, as Rich complained before. :)
So, copy --to does not trust location tracking information (unless --fast is specified), which means that it should be doing exactly what you want it to do in your situation -- transferring every file that is really not present in the destination repository already.
Neither does copy --from, by the way. It always checks if each file is present in the current repository's annex before trying to download it.
You handle conflicts in annexed files the same as you would handle them in other binary files checked into git.
For example, you might choose to
git rm
orgit add
the file to resolve the conflict.Previous discussion
Doh! Total brain melt on my part. Thanks for the additional info. Not taking my time and reading things properly - kept assuming that the full remote pull failed due to the warning:
Rookie mistake indeed.
The rsync or directory special remotes would work if the media player uses metadata in the files, rather than directory locations.
Beyond that there is the smudge idea, which is hoped to be supported sometime.
Well, the modes you show are wrong. Nothing in the annex should be writable. fsck needs to fix those. (It's true that it also always chmods even correct mode files/directories.. I've made a change avoiding that.)
I have not thought or tried shared git annex repos with multiple unix users writing to them. (Using gitolite with git-annex would be an alternative.) Seems to me that removing content from the annex would also be a problem, since the directory will need to be chmodded to allow deleting the content from it, and that will fail if it's owned by someone else. Perhaps git-annex needs to honor core.sharedRepository and avoid these nice safeguards on file modes then.
Heh, cool, I was thinking throwing about 28million files at git-annex. Let me know how it goes, I suspect you have just run into a default limits OSX problem.
You probably just need to up some system limits (you will need to read the error messages that first appear) then do something like
There are other system limits which you can check by doing a "ulimit -a", once you make the above changes, you will need to reboot to make the changes take affect. I am unsure if the above will help as it is an example of what I did on 10.6.6 a few months ago to fix some forking issues. From the error you got you will probably need to increase the stacksize to something bigger or even make it unlimited if you feel lucky, the default stacksize on OSX is 8192, try making it say 10times that size first and see what happens.
Thanks, joey, but I still do not know, why the file that has been (and is) OK according to separate sha1 and sha256 checks, has been marked 'bad' by
fsck
and moved to.git/annex/bad
. What could be a reason for that? Could haversync
caused it? I know too little about internal workings ofgit-annex
to answer this question.But one thing I know for certain - the false positives should not happen, unless something is wrong with the file. Otherwise, if it is unreliable, if I have to check twice, it is useless. I might as well just keep checksums of all the files and do all checks by hand...
My experience is that modern filesystems are not going to have many issues with tens to hundreds of thousands of items in the directory. However, if a transition does happen for FAT support I will consider adding hashing. Although getting a good balanced hash in general without, say, checksumming the filename and taking part of the checksum, is difficult.
I prefer to keep all the metadata in the filename, as this eases recovery if the files end up in lost+found. So while "SHA/" is a nice workaround for the FAT colon problem, I'll be doing something else. (What I'm not sure yet.)
There is no point in creating unused hash directories on initialization. If anything, with a bad filesystem that just guarantees worst performance from the beginning..
Well, lock could check for modifications and require --force to lose them. But the check could be expensive for large files.
But
git annex lock
is just a convenient way to rungit checkout
. And runninggit checkout
orgit reset --hard
will lose your uncommitted file the same way obviously.Perhaps the best fix would be to get rid of
lock
entirely, and let the user use the underlying git commands same as they would to drop modifications to other files. It would then also make sense to removeunlock
, leaving onlyedit
.After some experimentation, this seems to work better:
Maybe this approach can be enhance to skip stuff gracefully if there is no git-annex-master branch and then be added to what "git annex sync" does, this way those who want to use the feature can do so by running "git branch git-annex-master" once. Or, if you like this and want to make it default, just make git-annex-init create the git-annex-master branch :-)
Git-annex has really helped me with my media files. I have a big NAS drive where I keep all my music, tv, and movies files, each in their own git annex. I tend to keep the media that I want to watch or listen to on my laptop and then drop it when it is done. This way I don't have too much on my laptop at any one time, but I have a nice selection for when I'm traveling and don't have access to my NAS.
Additionally, I have a mp3 player that will format itself randomly every few months or so. I keep my podcasts on it in a git annex and in a git annex on my laptop. When I am done with a podcast, I can delete it from the mp3 player and then sync that information with my laptop. With this method, I have a backup of what should be on my mp3 player, so I don't need to worry about losing it all when the mp3 player decides it's had enough.
Thank you,
I imagined it was something like that. I 'm just sorry I posted that on the forum and not on the bugs section (I hadn't discovered it at that time). but now, if people search for this error, they should find this.
Note for Fedora users: unfortunately GHC 7.4 will not be shipped with Fedora 17 (which is still not released). The feature page mention it for Fedora 18. I feel like I am using debian ... outdated packages the day of the release.
And many thanks for this wonderful piece of software.
Mildred
I don't know how to approach this yet, but I support the idea -- it would be great if there was a tool that could punch files out of git history and put them in the annex. (Of course with typical git history rewriting caveats.)
Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?
To get to a specific version of a file, you need to have a tag or a branch that includes that version of the file. Check out the branch and
git annex get $file
.(Of course, even without a tag or branch, old file versions are retained, unless dropped with
unused
/dropunused
. So you could evengit checkout $COMMITID
.)--to and --from seem to have different semantics than --source and --destination. Subtle, but still different.
That being said, I am not sure --from and --to are needed at all. Calling the local repo . and all remotes by their name, they are arguably redundant and removing them would make the syntax a lot prettier; mv and cp don't need them, either.
I am not sure changing syntax at this point is considered good style though personally, I wouldn't mind adapting and would actually prefer it over using --to and --from.
-v and -q would be nice.
Richard
core.sharedRepository
.You're taking a very long and strange way to a place that you can reach as follows:
Which is just as shown in getting file content.
In particular, "git pull remote" first fetches all branches from the remote, including the git-annex branch. When you say "git pull remote master", you're preventing it from fetching the git-annex branch. If for some reason you want the slightly longer way around, it is:
Or, eqivilantly but with less network connections:
BTW, notice that this is all bog-standard git branch pulling stuff, not specific to git-annex in the least. Consult your extensive and friendly git documentation for details. :)
And something else i've done is, that i symlinked the video/ directory from the media annex to the normal raid annex
And it's working out great.
I really like this, perhaps it is a good idea to store all log files in every repo, but maybe there is a possibilitiy to to pack multiple log files into one single file, where not only the time, the present bit and the annex-repository is stored, but also the file key. I don't know if this format would also be merged correctly by the union merge driver.
Here's another handy command-line which annexes all files in repo B which have already been annexed in repo A:
The 'T' outputted by git status for these files indicates a type change: it's a symlink to the annex in repo A, but a normal file in repo B.
Yes, I think that add -all option is the right approach for this. Seems unlikely you'd have some files' hashes handy without having them checked out, but operating on all content makes sense.
That page discusses some problems implementing it for some commands, but should not pose a problem for
move
. It would also be possible to supportget
andcopy
, except--auto
couldn't be used with--all
. Evenfsck
could support it.ghc7.0
branch in git that is being maintained to work with that version.The encryption uses a symmetric cipher that is stored in the git repository already. It's just stored encrypted to the various gpg keys that have been configured to use it. It would certianly be possible to store the symmetric cipher unencrypted in the git repo.
I don't see your idea of gpg-options saving any work. It would still require you to do key distribution and run commands in each repo to set it up.
The bug with newlines is now fixed.
Thought I'd mention how to clean up from interrupting
git annex add
. When you do that, it doesn't get a chance togit add
the files it's added (this is normally done at the end, or sometimes at points in the middle when you're adding a lot of files). Which is also why fsck, whereis, and unannex wouldn't operate on them, since they only deal with files in git.So the first step is to manually use
git add
on any symlinks.Then,
git commit
as usual.At that point,
git annex unannex
would get you back to your starting state.Maybe, otoh, part of the point of git-annex is that the data may be too large to pull down all of it.
I find mr useful as a policy layer over top of git-annex, so "mr update" can pull down appropriate quantities of data from appropriate locations.
git-annex is just amazing. I just started using it and for once, I have hope to be able to organize my files a little better than now.
Currently, I have a huge homedir. From time to time, I move file away in external hard drives, then forget about them. When I want to look at them back, I just can't because I have forgotten where they are. I have also a ton of files on those drives that I can't access because they are not indexed. With git-annex I have hope to put all of these files on a git repository. I will be able to see them everywhere, and find them when I need to.
I might stop loosing files for once.
I might avoid having multiple copies of the same things over and over again, without knowing so. and regain some more disk space.
For the moment, I'm archiving my photographs. But there is one thing that might not go very well: directory hierarchies where everything is important (file owner, specific permissions, symlinks). I won't just be able to blindly annex all of these files. But for the moment I'll stick at archiving ocuments and it should be amazing.
Mildred
Sorry for not replying earlier, but my non-mailinglist-communications-workflows are suboptimal :-)
Right. But "git fetch" ought to be enough.
Personally, I’d just pull and push everywhere, but you pointed out that it ought to be manageable. The existence of the synced/master branch is the flag that indicates this, so you need to propagate this once. Note that if the branch were already created by "git annex init", then this would not be a problem.
It is not required to use "git fetch" once, you can also call "git annex sync " once with the remote explicitly mentioned; this would involve a fetch.
I’d leave this decision to you. But I see that you took the decision already, as your code now creates the synced/master branch when it does not exist (e290f4a8).
Because it did not work otherwise :-). It uses pipeRead, which is lazy, and for some reason git and/or your utility functions did not like that the output of the command was not consumed before the next git command was called. I did not investigate further. For better code, I’d suggest to add a function like pipeRead that completely reads the git output before returning, thus avoiding any issues with lazyIO.
Hmm, good question. It is probably save to merge from both, and push only to synced/master. But which one first? synced/master can be ahead if the repo was synced to from somewhere else, master can be ahead if there are local changes. Maybe git merge should be called on all remote heads simultaniously, thus generating only one commit for the merge. I don’t know how well that works in practice.
Thanks for including my code, Joachim
Thank you for your comment! Indeed, setting the umask to, for example, 022 has the desired effect that annex/objects etc. are executable (and in this special case also writable), my previous umask setting was 077; the "strange" permissions on the git directories was probably due to --shared=all, and the mode of "440" on the files within the git-annex tree is correct (the original file was 640 and stripped of its write permission).
Using this umask setting and newgrp to switch the default group, I was successfully able to set up the repositories.
However, I would like to suggest adding the execute bit to the directories below .git/annex/objects/ per default, even if the umask of the current shell differs. As the correct rights are already preserved in the actual files (minus their write permission) together with correct owner and group, the files are still protected the same way as previously, and because +x does not allow directory listings, no additional information can leak out either. Not having to set the umask to something "sensible" before operating git-annex would be a huge plus, too :)
The reason why I am not running MPD as my user is that I am a bit wary of running an application even exposed to the local network as my main user, and I see nothing wrong with running it as its own user.
Thank you again for your help and the time you put into this project!
On the plus side, the past me wanted exactly what I had in mind.
On the meh side, I really forgot about this conversation :/
When you say this todo is not a priority, does that mean there's no ETA at all and that it will most likely sleep for a long time? Or the almost usual "what the heck, I will just wizard it up in two lines of haskell"?
-- RichiH
I see the following problems with this scheme:
Disallows removal of files when disconnected. It's currently safe to force that, as long as git-annex tells you enough other repos are belived to have the file. Just as long as you only force on one machine (say your laptop). With your scheme, if you drop a file while disconnected, any other host could see that the counter is still at N, because your laptop had the file last time it was online, and can decide to drop the file, and lose the last version.
pushing a changed counter commit to other repos is tricky, because they're not bare, and the network topology to get the commit pulled into the other repo could vary.
Merging counter files issues. If the counter file doesn't automerge, two repos dropping the same file will conflict. But, if it does automerge, it breaks the counter conflict detection.
Needing to revert commits is going to be annoying. An actual git revert could probably not reliably be done. It's need to construct a revert and commit it as a new commit. And then try to push that to remotes, and what if that push conflicts?
I do like the pre-removal dropping somewhat as an alternative to trust checking. I think that can be done with current git-annex though, just remove the files from the location log, but keep them in-annex. Dropping a file only looks at repos that the location log says have a file; so other repos can have retained a copy of a file secretly like this, and can safely remove it at any time. I'd need to look into this a bit more to be 100% sure it's safe, but have started hidden files.
I don't see any reduced round trips. It still has to contact N other repos on drop. Now, rather than checking that they have a file, it needs to push a change to them.
I thought about this some more, and I think I have a pretty decent solution that avoids a central bare repository. Instead of pushing to master (which git does not like) or trying to guess the remote branch name on the other side, there is a well-known branch name, say git-annex-master. Then a sync command would do something like this (untested):
The nice things are: One can push to any remote repository, and thus avoid the issue of pushing to a portable device; the merging happens on the master branch, so if it fails to merge automatically, regular git foo can resolve it, and all changes eventually reach every repository.
What do you think?
Sorry for all the followups, but I see now that if you unannex, then add the file to git normally, and commit, the hook does misbehave.
This seems to be a bug. git-annex's hook thinks that you have used git annex unlock (or "git annex edit") on the file and are now committing a changed version, and the right thing to do there is to add the new content to the annex and update the symlink accordingly. I'll track this bug over at unannex vs unlock hook confusion.
So, committing after unannex, and before checking the file into git in the usual way, is a workaround. But only if you do a "git commit" to commit staged changes.
Anyway, this confusing point is fixed in git now!
You should be able to fix the missing label by editing .git-annex/uuid.log and adding
thanks, that's great. will there be a way to have sharedRepository work for shared remotes (rsync, directory) too, or is that better taken care of by acls?
@not thought of shared repos: we're having our family photo archive spread over our laptops, and backed up on our home storage server and on an rsync+encryption off-site server, with everyone naturally having their own accounts on all systems -- just if you need a use case.
I'd recommend using the SHA backend for this, the WORM backend would produce conflicts if the files' modification times changed.
syncing non-git trees with git-annex describes one way to do it.
Let's see..
-v is already an alias for --verbose
I don't find --source and --destination as easy to type or as clear as --from or --to.
-F is fast, so it cannot be used for --force. And I have no desire to make it easy to mistype a short option and enable --force; it can lose data.
@richard while it would be possible to support some syntax like "git annex copy . remote"; what is it supposed to do if there are local files named foo and bar, and a remotes named foo and bar? Does "git annex copy foo bar" copy file foo to remote bar, or file bar from remote foo? I chose to use --from/--to to specify remotes independant of files to avoid such ambiguity, which plain old
cp
doesn't have since it's operating entirely on filesystem objects, not both filesystem objects and abstract remotes.Seems like nothing to do here. done --Joey
My current workflow looks like this (I'm still experimenting):
Create backup clone for migration
Inject git annex initialization at repository base
Start migration with tree filter
There are still some drawbacks:
OK, thanks. I was just wondering - since there are links in git(-annex), and a hard links too, that maybe the issue has been caused by
rsync
.I will keep my eye on that and run checks with my own checksum and
fsck
from time to time, and see what happens. I will post my results here, but the whole run (fsck
or checksum) takes almost 2 days, so I will not do it too often... ;)Thinking about this more, I think minimally git-annex could support a
or
for options to be passed to gpg. I'm not sure how automatically setting it to $ANNEX_ROOT/.gnupg/.. would work.
I need to read the encryption code to fully understand it, but I also wonder if there is not also a way to just bypass gpg entirely and store the remote-encryption keys locally in plain text.
Remote as in "another physical machine". I assumed that
would have not trusted the contents in the current directory (or the remote that is being copied to) and then just go off and re-download/upload all the files and overwrite what is already there. I expected the combination of --force and copy --to that it would not bother to check if the files are there or not and just copy it regardless of the outcome.
@joey thanks for the update in the previous comment, I had forgotten about updating it.
@zooko it's working okay for me right now, since I'm only putting fairly big blogs on stuff on to it and only things that I really care about. On the performance side, if it ran faster then it would be nicer :)
git-annex needs ghc 7.4, that's why it depends on that base version that comes with it. So you either need to upgrade your ghc, or you can build from the
ghc7.0
branch in git, like this:we could include the information about the current directory as well, if the command is not issued in the local git root directory. to avoid large numbers of similar lines, that could look like this:
with the percentages being replaced with "complete" if really all files are present (and not just many enough for the value to be rounded to 100%).
I got bitten by this too. It seems that the user is expected to fetch remote git-annex branches themselves, but this is not documented anywhere.
The man page says of "git annex merge":
I am not a git newbie, but even so I had incorrectly assumed that git annex merge would take care of pulling the git-annex branch from the remote prior to merging, thereby ensuring all versions of the git-annex branch would be merged, and that the location tracking data would be synced across all peer repositories.
My master branches do not track any specific upstream branch, because I am operating in a decentralized fashion. Therefore the error message caused by
git pull $remote
succeeded in encouraging me to instead usegit pull $remote master
, and this excludes the git-annex branch from the fetch. Even worse, a git newbie might realise this and be tempted to dogit pull $remote git-annex
.Therefore I think it needs to be explicitly documented that
is required when the local branch doesn't track an upstream branch. Or maybe a
--fetch
option could be added togit annex merge
to perform the fetch from all remotes before running the merge(s).additional filter criteria could come from the git history:
git annex get --touched-in HEAD~5..
to fetch what has recently been worked ongit annex get --touched-by chrysn --touched-in version-1.0..HEAD
to fetch what i've been workin on recently (based on regexp or substring match in author; git experts could probably craft much more meaningful expressions)these options could also apply to
git annex find
-- actually, looking at the normal file system tools for such tasks, that might even be sufficient (thinkgit annex find --numcopies-gt 3 --present-on lanserver1 --drop
likefind -iname '*foo*' -delete
(i was about to open a new forum discussion for commit-based getting, but this is close enough to be usefully joint in a discussion)
I don't mind changing the behavior of git-annex sync, certianly..
Looking thru git's documentation, I found some existing configuration that could be reused following your idea. There is a remote.name.skipDefaultUpdate and a remote.name.skipFetchAll. Though both have to do with fetches, not pushes. Another approach might be to use git's remote group stuff.
If you can't segment the names retroactively, it's better to start with segmenting, imo.
As subdirectories are cheap, going with ab/cd/rest or even ab/cd/ef/rest by default wouldn't hurt.
Your point about git not needing to create as many tree objects is a kicker indeed. If I were you, I would default to segmentation.
Ah HA! Looks like I found the cause of this.
Spot the file name with a newline character in it! This causes the error message above. It seems that the files proceeding this badly named file are sym-linked but not registered.
Perhaps a bug?
@Jimmy mentioned anonymous git push -- that is now enabled for this wiki. Enjoy!
I may try to spend more time on #vcs-home -- or I can be summoned there from my other lurking places on irc, I guess.
And following on to my transcript, you can then add the file to git in the regular git way, and it works fine:
It should sufficient to honor GIT_DIR/GIT_WORK_TREE/GIT_INDEX_FILE environment variables. git filter-branch sets GIT_WORK_TREE to ., but this can be mitigated by starting the filter script with 'GIT_WORK_TREE=$(pwd $GIT_WORK_TREE)'. E.g. GIT_DIR=/home/tyger/repo/.git, GIT_WORK_TREE=/home/tyger/repo/.git-rewrite/t, then git annex should be able to compute the correct relative path or maybe use absolute pathes in symlinks.
Another problem I observed is that git annex add automatically commits the symlink; this behaviour doesn't work well with filter-tree. git annex commits the wrong path (.git-rewrite/t/LINK instead of LINK). Also filter-tree doesn't expect that the filter script commmits anything; new files in the temporary work tree will be committed by filter-tree on each iteration of the filter script (missing files will be removed).
This bug was fixed in git-annex 3.20120230. You have a few options to get the fix:
git checkout ghc7.0
-- that branch will build with your old ghc and has the fix.Use
du -L
for the disk space used locally. The other number is not currently available, but it would be nice to have. I also sometimes would like to have data on which backends are used how much, so making thisgit annex status --subdir
is tempting. Unfortunatly, it's current implementation scans.git/annex/objects
and not the disk tree (better for accurate numbers due to copies), so it would not be a very easy thing to add. Not massively hard, but not something I can pound out before I start work today..Sure, you can simply:
Or just attach the file right from the git repository to an email, like any other file. Should work fine.
If you wanted to copy a whole directory to export, you'd need to use the -L flag to make cp follow the symlinks and copy the real contents:
Joey, thanks for you quick help! I'll try the manual haskell-platform install once I have quicker internet again, i.e. tomorrow.
And sorry for the mess-up; I splitted the post into two. Hope it's clearer now.