The Living Thing / Notebooks : Databases, data storage, data interchange

tl;dr

I do a a lot of data processing; most of the stores are too inconvenient for me. But here are the most convenient for my needs

Or maybe I can pipeline my python analysis using blaze?

OK, full notes now:

With a focus on slightly specialised data stores for use in my statistical jiggerypokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.

I prefer to avoid running a database server at all if I can; At least in the sense of a highly specialized multi-client server process. Those are not optimised for a typical scientific workflow. First stop is in-process non-concurrent-write data storage e.g. HDF5 or sqlite.

However, if you want to mediate between lots of threads/processes/machines updating your data in parallel, a “real” database server can be justified.

OTOH if your data is big enough, perhaps you need a crazy giant distributed store of some kind? Requirements change vastly depending on your scale.

Filesystem stores

Unless your data is very very big, this is what you want.

Time series/Event crunching/Streaming

Databases at the intersection of storing data and processing streams, usually for time series.

Document stores

Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy?

You’re looking for document stores!

If you are looking at document stores as your primary workhorse, as opposed to something you want to get data out of for other storage, then you have either

  1. Not very much data so performance is no problem, or
  2. a problem.

Let’s assume number 1, which is common.

Relational databases

Long lists of numbers? Spreadsheet-like tables? Wish to do queries mostly of the sort supported by database engines, such as grouping, sorting and range queries? Sqlite if it fits in memory. (No need to click on that link though, sqlite is already embedded in your tool of choice.) TODO: how to write safely to sqlite from multiple processes through write locks.

If not, or if you need to handle concurrent writing by multiple processes, MySQL or Postgres. Not because they are best for this job, but because they are common. Honestly, though, unless this is a live production service for many users, you should probably be using a disk-backed store.

Maybe you can make this easier using Blaze.

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

Clickhouse for example is a columnar database that avoids some of the problems of row-oriented tabular databases. I guess you could try that?

crazy giant distributed stores

Ever since google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.

See parallel computing.

Caches

redis and memcached are the default generic choices here. redis is newer and more flexible. memcached is sometimes faster? Dunno Perhaps see Why Redis beats Memcached for caching.

python caches

dogpile.cache:

dogpile.cache is a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism of dogpile.

It’s well-done, but doesn’t integrate especially smoothly with futures and modern concurrency as seen in tornado

joblib cache looks convenient, but i can’t work out if it’s multi-write safe, or supposed to be only invoked from some master process and thus not need locking

Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.

It looks convenient, and is the easiest mmap-compatible solution i know, but it only supports function memoization, so if you want to access results some other way or access partial results it can get very convoluted unless you can naturally factor your code into function memoizaions.

On that tip, klepto is scientific computation focussed, but doesn’t look as active or seem to provide many compelling selling points, yet.

cachetools extends the python3 lru_cache reference implementation.

Graphs stores

Graph-tuple oriented processing….

graphengine:

GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.

From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modeling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.

GUIs

sqlectron, sqlitebrowser, superset