The Living Thing / Notebooks :

Synchronising and backing up securely

Dropbox for dropbox haters

Usefulness: 🔧 🔧 🔧
Novelty: đź’ˇ đź’ˇ
Uncertainty: 🤪
Incompleteness: 🚧 🚧

Related: backing up data.

Purely cloud-based network drives just aren’t as awesome at fast or distributed work. Realising this is why the Dropbox founders are now rich, when they invented a thing which keeps the files locally and syncs them with your co-workers online. Well done them.

However, Dropbox’s solution, as groundbreaking as it was, is also unsatisfactory, being hamstrung by technical and legal shortcomings. Same goes for the Google and Microsoft options. How can I get something like Dropbox without all the security holes and creepy behaviour?

Peer to peer synchronising (i.e. no special server at all) is one robust option and I do a lot of this. I then have no 3rd party helping me, which is a plus and a minus. PLus: I can feel safer. Minus: I do not synchronise if my computers are not simultaneously online.

Taking it further, how about everything be sneakernets?

Or I can use some hacks to make Dropbox less awful, by e.g. encrypting my files to inhibit some Dropbox data mining, and by using alternative clients that are less suspect and so on.

Various options follow.

Syncthing

Choose this if… You have a collection of various folders that you need shared to various different machines, and you would like many of the the different machines to be able to edit them. You don’t need a server and thus you are happy for syncing to happen if and when the peers are online. And you don’t care about iOS. e.g. I use this for synchronising my music production files across my studio machines, studio backup machines and gig machines.

Syncthing has an elegant decentralised peer-to-peer design. It is mostly simple and friendly to use, although I spent too long reading the manual and being intimidated to dive in and discover that.

Granularity is per-folder-per-machine – each shared folder (and all sub folders) is a separate share. It doesn’t support iOS. In contrast, it doesn’t support archiving stuff to USB keys or semi-offline stores, or multiple copies of the same folder on one machine.

Stated design criteria:

Editorial note: Note that, if any of your machines are compromised your attacker still gets the data on that machine. It’s not magic. It’s simpy that you don’t have to worry about a whole other copy of your data being on a machine owned by some faceless third party who may or may not not have acceptable confidentiality practice.

You probably want to set the following files ignored in your .stignore, file if you don’t want it to synchronise a little too aggressively.

// From Windows
$RECYCLE.BIN
System Volume Information
$WINDOWS.~BT
pagefile.sys
desktop.ini

// From OS X
Icon?
.Spotlight-V100
.Trashes
(?d).DS_Store
.fseventsd
(?d)._*


// From Linux
lost+found
.gvfs
.local/share/trash
.Trash-*
.fuse_hidden*

There is a CLI syncthing manager for your remote cloud instances, the snazzily named syncthingmanager. It has a macOS client.

Syncthing also has file versioning and such, but cryptographic signing of versions and guaranteeing consistent snapshots and so on is not a front-and-centre feature.

Its major gotcha is that syncing between case insensitive and case sensitive file systems is broken and can delete data. That’s right, this app works beautifully, smoothly and easily except that the moment you use it to sync between linux and macOS or Linux and Windows (which have different case sensitivity per default, although it’s a long story) stuff goes horribly wrong. You know, classic demonic pact read-the-small-print situation.

This has only in fact been an issue with me when syncing iTunes, which does case-changes when you import new music.

A workaround for now is vigilance, and “trash can file versioning” which allows you to recover the missing files with the following command

cd .stversions
find . -type f -exec mv \{\} ../\{\} \;

A lesser problem is that it uses a non-trivial amount of disk space in a central cache folder (i.e. outside where the backups are being made). This is something like 1% of the size of the data being backed up, which if you are syncing 6TB really adds up.

Seafile

Choose this if… You have a collection of various folders that you need to synchronise between macOS, Windows, Linux, Android and iOS. You don’t mind installing or paying for a central server to coordinate all this.

Seafile is an open-source file sync service with a premium enterprise server available for a small fee. It has clients for browsers and desktop and mobile. Semi-host-blind encryption is available but optional and doesn’t work from browsers or encrypt metadata. FWIW it looks simpler than Nextcloud to install manually, although in either case it’s easiest to run a pre-made docker image if you trust the creator. You would use this is you wanted a convenient iOS client, which is otherwise tricky to get. I would happily pay for a hosted version of this if there was one available, but there isn’t.

Also, NB, this is one of the protocols that rclone doesn’t talk.

dat

Choose this if… You have a large data set that you wish to share amongst many strangers, and if there is a single source of truth. I wuld use this for sharing finished research or journalistic data sets if I had any. but I wouldn’t use it for collectively updating data across my computation computer cluster, because it doesn’t suport multiple writers.

Dat is similar to syncthing, with a different emphasis – sharing data to strangers rather than friends, with a special focus on datasets. You could also use it for backups or other sharing. See scientific data sharing.

NB it’s one-writer-many-readers, much like bittorrent, so don’t get excited about multiple data sources, or inter-lab collaboration. For this price, though, you get data versioning and robust verifiability, which is a bargain; completely alien to my current workflow though.

Some hacks exist for partial downloading. Otherwise, you can use Dat’s base layer, hyperdrive, directly from node.js. (However, no-one uses node.js for science at the moment, so if you find yourself working on this bit of plumbing, ask yourself if you are yak shaving, and whether your procrastinating might be better spent going outside for some fresh air or something.)

People have built more collaborative tools using the Dat tools, such as beaker browser, which is a decentralised web browser.

OrbitDB

Choose this if… Not sure yet. For now see

Mega

Choose this if… You want to share files, chats, data and whatever else, with people who can’t or won’t install their own software and so must use a browser to download stuff, and if you don’t care that the company behind it is dicey. e.g. I used this for some temporary file sharing music collaboration projects, but now that they are over, I’ve deleted it.

Mega Easy to run. Public source, but not open source. (Long story.) Host-blind encryption business from New Zealand.

Anyway it’s relatively easy to use because it works in the browser, so it won’t terrify your non-geek friends. Ok, maybe a little. Much cheaper than Dropbox, as well as being probably less creepy. The UI is occasionally freaky but it’s reasonably functional, especially for its bargain-basement price. A… unique?… tradeoff of respectability, privacy and affordability.

Rclone

Choose this if… You want to infrequently clone some files somewhere, collaborate with people who use a different sync solutions, or because you want a Swiss army knife fallback solution.

Rclone is a command line program to copy files and directories to and from various cloud sync solution Google Drive, Amazon S3, Memset Memstore, Dropbox etc. For some of these services it is the only Linux client. It is not actually a sync client per se, in that it does not mirror the files to your drive, but accesses remote storage that happens to usually be used by sync clients.

I have this around because it lets me plug into pretty much anything, e.g. accessing my Microsoft OneDrive from the campus Linux cluster. Also because my colleagues don’t agree on which provider to use, and I don’t have enough money, time, or trust to run Dropbox, google Drive, Onedrive and Mega.

Pro-tip: the encryption module turns any unencrypted file storage into an encrypted one for your personal use. Sharing the data from such a drive is tricky, but it is much safer. It is thus a convenient UI for encrypting things, even local files.

Pro-tip: It can also mount remote cloud storage on your local machine, which is handy although not recommended for daily use as it’s ungainly and slow, although you can set up caching but oh my this is getting complicated now isn’t it?

Features:

Cons: manual synchronising is to be avoided because every extra thing to remember is another thing you will forget at the worst possible time, so you probably still want an actual sync client of some kind as well. Or, you could script it into your build tool, which is what I do.

You could read the manual, but everything seems to work great for me if I simply run

rclone config

See below for a concrete application of this.

Upspin

Choose this if… you prize elegance above comprehensibility and but intermittently descend from the heights of your research program in abstract category theory descend to quotidian normalcy to share files.

Upspin is hard to explain, specifically because I don’t understand it. It’s not really a sync service (I think) but it fills some of the same niches. Rclone on steroids, with a server process, something like that?

When did you last…

Upspin is an attempt to address problems like these, and many more.

Upspin is in its early days, but the plan is for you to manage all your data—even data you’ve stored in commercial web services—in a safe, secure, and sharable way that makes it easy to discover what you’ve got and who you’ve shared it with.

If you’d like to help us make that vision a reality, we’d love to have you try out Upspin.

Upspin is an open-source project that comprises two main design elements:

  1. a set of protocols enabling secure, federated sharing using a global naming system; and
  2. reference implementations of tools and services that demonstrate the capabilities.

Summary: It backloads encrypted, permissioned (?), storage into arbitrary backends including cloud providers.

Owncloud/Nextcloud

Choose this if… your campus runs a giant free Owncloud service so you may as well. Owncloud and Nextcloud are two forks of the same codebase. Nextcloud seems to be hipper. They both look very similar to me. Since I encountered Owncloud first I mention it here; the differences seem largely philosophical.

For clarity where I am referring to both I will describe the superposition of both as Nowtcloud.

Nowtcloud is dubiously secure; they have security advisories all the time. Also the server doesn’t store files encrypted, so you get an increased ease of sharing files (no extra password needed) but decreased confidentiality in case the server is compromised. Lawks! That’s barely better than Dropbox!

OTOH, it’s possible to run on your own Nowtcloud server, e.g. using docker, so it’s useful for sharing something public such as open research etc for only the cost of hosting, which is low. However, if you want to do this, Seafile seems to be better software for the file sharing use case and is no harder to set up, so why not try that?

The real reason would be that someone else has gifted you a server already set up. Australian academics get a generous serving of storage from AARNET, a terrabyte I think, so we may as well.

However, there are various quirks to survive.

For one, native command-line usage is not obvious. How do you access your stored data files from your campus cluster?

First, you can access it as a WebDAV share, which is unwieldy but probably works. WebDAV is effectively making things available on a webserver, which sounds like it should be simple, but in practice you need a special client to do it because there are lots of fiddly details. e.g. you want to browse the folders to find the file, or you want to handle authentication etc. Boring.

Nowtcloud notionally has a command-line client, but the CLI documentation is hidden deeply, possibly because it’s not very good. (the documentation or the CLI) Tony Maro gives a walk-through. It’s sensitively version dependent. Beware.

There are also version clashes between different versions of Nowtcloud, and when you sync a folder with some other service, occasional silent data loss. I’m not a fan of this whole project.

I recommend that if you need to get data out of Nowtcloud from the command line you should extract it using rclone’s WebDAV mode and ignore Nowtcloud itself. It’s convenient that someone runs Nowtcloud code on some server somewhere, and I do in fact store a bunch of non-confidential data sets there. But the less of the Nowtcloud code I myself run, the better this has worked for me thus far.

git-annex

Choose this if… you are a giant nerd with harrowing restrictions on your data transfer and its worth your while to leverage this very sophisticated and yet confusing bit fo software to work around these challenges. E.g. you are integrating sneakernets and various online options. Which I am not.

git-annex supports explicit and customisable folder-tree synchronisation, merging and sneakernets and as such I am well disposed toward it. You can choose to have things in various stores, and to copy files to an from servers or disks as they become available. It doesn’t support iOS. Windows support is experimental. Granularity is per-file. It has weird symlink-based file access protocol which might be inconvenient for many uses. (I’m imagining this is trouble for Microsoft Word or whatever.)

Also, do you want to invoke various disk-online-disk-offline-how-sync-when options from the command line, or do you want stuff to magically replicate itself across some machines without requiring you to remember the correct incantation on a regular basis?

The documentation is nerdy and unclear, but I think my needs are nerdy and unclear by modern standards. However, the combinatorial explosion of options and excessive hands-on-ness is a serious problem which I will not realistically get around to addressing due to my to-do list already being too long.

Sparkleshare

Sparkleshare is designed for file syncing for designers, by using git as a backend. Haven’t explored it yet.

Turtl

Faintly left-field, Turtl is a notes syncing app. It aims to be a competitor to Evernote, but without their insipid privacy and faintly spammy attitude. No iOS support.

Promises host-proof encryption of “text, bookmark, password, image, and file/document”, but doesn’t synchronise arbitrary files like most of the other entrants here.

There is a supporting business offering hosting of their custom server in exchange for money, but it is open source. I will happily fork out money for this when they support iOS. Has nifty features such as sharing content, rendering mathematical markup and permissions for users. Doesn’t solve the big data sync problem, but does solve a useful subset of the syncing problem.

Joplin

Like Turtl, Joplin is about synchronised notes. However, it backloads encrypted synchronised notes into a pre-existing sync service, NextCloud or Dropbox or whatever. It even has an iOS client, and apparently full ability to import from Evernote.

Bonus trick: host-proof your sync with encryption

Convert your woefully insecure sync service into a somewhat less woeful service by encrypting the files one it. Options include cryptomator, and rclone which encrypts everything inside a certain folder, such as your sync folder, in particular stopping your sync provider from reading it. Even a crappy spying provider such as Google, Microsoft or Dropbox can be made safer.

The drawbacks that immediately occur to me are

NB you could do this anyway by manually encrypting everything, but would you? No, because it’s slow and tedious. You want a nice GUI like this.

If you don’t mind whether the files are local or not, you could use rclone’s encryption mode, which talks directly to the remote file store and also encrypts the content. Rclone can do everything.

Dropbox

Choose this if… you don’t mind giving access to your data to dubious strangers with little regard for your security and some of your colleagues are totally hooked on it.

Bonus tip: using dropbox without the dropbox client.

Use rclone to get dropbox data on demand

The upshot of rclone is that I can pull changes from dropbox into my git repository thusly

rclone sync --exclude=".git/" --update dropbox:ProjectForGitHaters/ ./ProjectForGitHaters/

and push changes into dropbox from the git repo like so

clone sync --exclude=".git/" --update ./ProjectForGitHaters/ dropbox:ProjectForGitHaters/

My colleagues need never know that I am using modern version control, change-tracking, merging, diffing and so on.

In practice, to exclude a lot of files at once, I recycle a standard exclude list from syncthing and replace --exclude=".git/" with --exclude-from=".stignore" in those commands. And to make sure I am not accidentally syncing git repo stuff I use the --dry-run option to verify that the expected files are getting copied/deleted/whatever.

Run dropbox proper on a spare computer and then sync using syncthing

This works pretty good. I run dropbox and syncthing on a spare computer I have lying around campus and synchronise the bits of dropbox I need automatically. One minus is that occasionally I get logged out of that machine when I am away, causing syncing to break.

Use a FUSE virtual mount

dbxfs or ff3d or rclone (above) allow you mount the remote dropbox file system without installing Dropbox’s suspect client software. this seems slow and clunky; I think you would only do this if you needed to coordinate on some dropbox thing in realtime but mistrusted the client. For my offline collaboration style, rclone is better.

Run a sandbox

If I must use Dropbox, I could perhaps sandbox it so at least I don’t need to run their stupid software on a real machine that I actually use. Dropbox itself doesn’t seem to ship any of the sandbox systems natively, (aside: why not? Does some part of their business model depends on intrusive access to everything you do?). I did try to a containerized, using docker. This is the only option AFAICT that can still sync the files to my local disk without intermediaries. However in practice for me it was fragile and RAM-heavy, difficult to debug and overall not recommended. Possibly other sandboxes would be more appropriate, but meh.

Others

Syncing dotfiles

You might try mackup to sync settings for linux and osx machines alike to some folder somewhere. It’s a database of which actual settings of various apps are actually syncable. On second thoughts, this is a fragile approach. And it freaks out if you have non-ascii characters in your filenames. Do something different.

Revised recommendation:

Use a bare git repo:

git init --bare $HOME/.dotfiles
alias dotfiles='git --git-dir=$HOME/.dotfiles/ --work-tree=$HOME'
dotfiles config --local status.showUntrackedFiles no
echo "alias dotfiles='git --git-dir=$HOME/.dotfiles/ --work-tree=$HOME'" \
  >> $HOME/.bashrc

Yes, much less freaky.

Actually, do you know what is even easier? Just make a git repo in your root dir. No more overthinking. Re-revised recommendation:

git init $HOME
git config --local status.showUntrackedFiles no

Now! go forth and steal other peoples’ dotfile tricks.

Encryption

There are tools to turn even your awful unencrypted untrustworthy system into an encrypted ones. Cryptomator is one cuddly friendly option. So is the more austere rclone, as mentioned earlier. Both those options are free and simple.