MoonShadow/DistributedFileSystem

Primary goal: to reliably store small amounts of data (e.g. someone's home directory)

Implemented as a Linux FS, mounts transparently. FS divides available storage space into blocks, which are replicated to other people running the FS across an always-on internet connection in exchange for keeping local copies of their data blocks. FS tries to maintain number of available replicas on different hosts between a preset minimum and maximum. Iff the local copy of a block becomes unavailable or damaged for some reason, it can be reconstructed by querying the replicas.

Implementation braindump

Each block is identified by (IP address of source at replication time, replication time, FS's logical block number, MD5 hash). Blocks are encrypted at source, so it doesn't matter if people ask for and get other people's blocks. The block directory must itself be stored within the FS so it can be replicated. The directory is simply a list of records containing IDs of replicas and the corresponding server IP addresses for all replicas for the last few modifications; also, for each replica, the ID of the replica of the remote host's data that is being held locally in return and its location on disk.

Replication contracts are enforced by periodically requesting each remote host to send an MD5 of a random subset of the data they are holding for you. If, for several such queries over a small period of time, the MD5 does not match the local data, or if a response is not available for some period of time, the contract is considered broken and the space used to hold local replicas for that host may be reclaimed as desired.

The FS regularly checks the state of the replicas, updating those for which contracts have been broken and attempting to create contracts for blocks which have had the fewest replicas for the longest time.

A list of hosts running the FS and thus available as replication servers is maintained locally. New hosts are discovered by asking existing hosts to send a random host from their list. Initial hosts may be hardwired or supplied by the user in order to bootstrap the process. A score is kept, and hosts that are regularly unavailable or regularly break contracts are blacklisted (and so ultimately can't get their data replicated 'cos no-one wants to accept a contract from them).

If one needs to retrieve a block, one simply queries its replica servers (locations pulled from its directory record), from newest to oldest, and reads the blocks returned until one gets back a block whose MD5 hash (part of the ID) matches the data contained. The whole thing hangs on keeping safe copies of the encryption keys - same problem as with any other keys; they don't change as often as the data encrypted under them, so they are easier to back up in some other way *handwave* - and the ID for the latest version of a root directory block - to bootstrap the whole thing after a massive local failure; perhaps each remote replica could contain the version of this at the time it was made, and after a big failure one could go through the server discovery process asking "what do you have stored for this IP?" until one hits a recent one - as old directory blocks are found, one recovers more and more info about the hosts one was actually in contact with before the failure, accelerating the recovery process. -

How do you get a bootstrap server for the server discovery process?

: Enter one or more manually, discovered in some out-of-band way; unless someone can think of something better. - MoonShadow

Since contracts require a 1:1 exchange of data, actual space used locally per block is the block size times the number of replicas.

PeterTaylor wonders what you do if someone writes a bunch of blocks you can't find another host which wants to set up a contract.

: Then you're stuffed, basically. You keep trying different hosts, asking existing known hosts for random subsets of their known-hosts lists, until you can acquire a contract. If everyone behaves well and there's enough people using the system, you will get some eventually since other servers will be looking for people to establish contracts with too. Hmm.. perhaps replicas stored under some existing contracts could be swapped over, so that while some replica counts might drop below target all blocks have more than one replica? - MoonShadow

Recovery must be initiated promptly after a failure, otherwise the remote hosts will start assuming you've broken your contracts with them and will begin to delete your replicas. OTOH, responding with a "what do you have stored for me?" to hosts contacting you with contract checks will speed recovery after a failure.

Anyone see a way to abuse this? For that matter, anyone else actually think this sort of thing could be useful? - MoonShadow

Trying to understand this. In non-techy terms, is the general idea to use your 500 Meg to keep 100 Meg of your data, replicated 5 times to defend against error, but have those 5 copies of your 100 Meg spread over a number of hosts? You still use 500 Meg locally because it's a 1:1 exchange rate for storing replicas, but your 100 Meg is safe if your whole computer bursts into flames, or you and one friend's machines both die simultaneously. Right? If that's the case, then I can see it potentially being useful, although quite expensive in storage space:data stored ratio terms, more so than a standard 2:1 RAID array. --AlexChurchill

: That's the general idea, although the sizes quoted are perhaps an order of magnitude too large :) The storage space:data stored ratio is adjustable depending on how paranoid you are (just as it is with RAID, for that matter); unlike RAID, however, storage is distributed geographically, making it harder to lose your data if your office burns down, say, or all your equipment gets stolen. - MoonShadow

PeterTaylor observes that a write to the FS could potentially be quite slow because the block being written to may be half-way through a slow transfer to another machine. He presumes, though, that MoonShadow has already considered this.

Why not just abort the transfer of the old version? Generally, replication writes happen in the background, and remote reads only happen if the local copy is damaged. When all is well, FS access is fast but there is a delay before stuff gets backed up; replicated copies of old versions of the data are an occasionally helpful emergent property, not a requirement. - MoonShadow

Abort how? Drop the connection, and you've got the overhead of setting up a new one. Write junk data so the MD5 hash doesn't fit and the server rejects the block? Probably best not to give it a dodgy block, in case you end up trying to recover using that block.

It all depends on how big we're thinking of making the blocks, really. If they're a few tens of kilobytes, they'll be buffered in memory during the send and we can finish it even though the copy on disk has changed. If they're a few megabytes, the overhead of setting up a new connection is probably smaller than that of finishing the transfer. Besides, we're going to have to set up a new contract *anyway* because we really want to replicate the new version and don't particularly care about the old one any more; and the quicker we abort the existing send, the quicker we can get the new version replicated. - MoonShadow

I dread to think what various filesystem implementations would think of trying to decode a mixture of old and new blocks. --Admiral

: Hashes used to determine which block is the correct one for each location are pulled from the block directory during recovery. The block directory is itself updated every write (or perhaps every few writes). Therefore after recovery it should be the case that only the last few blocks written (the last few dozen, if we're updating the directory every few writes) are inconsistent. I'm not certain I quite see why the problem is significantly worse than the one of pulling the plug on a machine (possibly with hard drives containing quite large caches) in the middle of a write-to-disk - which modern journalling filesystems appear to cope adequately with.. - MoonShadow

At the very least, some sort of block versioning system should be used, so that you can recover with the latest blocks available. One good way to do this would be to flush out to the replicas a new MD5 for a block, then flush out the new block itself, then lastly a record saying that the old MD5 for the block is no longer valid. It makes the block directory a little more ugly, with (effectively) a journal. You don't want to throw away all the old blocks before you have flushed out the fact the new block is valid, and you don't want to throw away the old block MD5 until you are guaranteed that the new blocks are safe. Otherwise, you could end up not being able to retrieve either. --Admiral

There's a consistency/performance tradeoff here. I suspect it might be best to make this a per-installation option - ranging from full ACID commit mode to "gather a bunch of writes before updating block directory" mode, depending on requirements and on how well the overlying filesystem can cope with inconsistency. - MoonShadow

: It'd kind of suck to be unable to make use of valid older blocks out there just because you were half-way through writing a new version of the block. It's more just a principle of "don't throw away your old backups until you have a new backup." --Admiral

DouglasReay thinks it would be more elegant if this was split into parts

: * PART I - a file system that knows it should interact with a data-backer-upper system. Basically something you can tell to monitor the directories listed in /etc/backerupper (which perhaps maps directory path to number of duplicates of that dir tree you want)
: * PART II - a backerupper that keeps track of contracts, splitting, encrypting and digesting the data. the interface between PART I and PART III
: * PART III - an existing peer to peer file sharing application like kazaa or bittorrent or freenet.

You could then write alternative PART II (that used ftp to dump complete images, for example).

Splitting the filesystem away from the distributed backup mechanism seems like a good idea. One could achieve this under Linux by, for instance, making the backup mechanism a virtual block device - effectively, it pretends to be a partition, you mount it in /etc/backups or whatever and use whatever existing filesystem you like, and when data is written to it it takes care of backing it up; this seems very sensible. However, can you really split "part 2" and "part 3" like that and still be able to keep track of where data is stored, whose contacts can be trusted etc? Most existing p2p applications I can think of focus on getting you the data quickly, hiding the information on where it comes from in the process; you mention bittorrent in particular, which seems utterly useless for the purpose.. - MoonShadow

: I think something could be done. At minimum, I think when developing it I'd develope it against two transport mechanisms (eg email to a Gmail account and a Frost Freenet client) at the same time, to avoid building in unnecesasary conceptual limitations. --DR

ToDo: this page is out of date. Read "secure hash" for "MD5".