Python caches

The fastest code is the code you don’t run

July 2, 2018 — November 30, 2023

computers are awful

python

I need to cache stuff in python sometimes, and I would like it to be easy. For my (scientific computation) needs, ideally I would like a cache which requires no server process, which will nonetheless provide locking or otherwise safe access for multiprocess writes, which will store big binary blobs of data and which has minimal installation requirements. If I can get all that at once, my life will be easy.

1 DiskCache

Current front-runner:

DiskCache: Disk Backed Cache

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn’t it be nice to leverage empty disk space for caching? […]

DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There’s no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.

It promise thread-safety and multiprocess-safety. For scientific cluster computing, where persistent servier processes are hard, but caching is easy, this is typically what I want.

2 LMDB

LMDB is not python-specific, but it possibly does the job?

LMDB | symas

Symas LMDB is an extraordinarily fast, memory-efficient database we developed for the OpenLDAP Project. With memory-mapped files, LMDB has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.

Bottom line, with only 32KB of object code, LMDB may seem tiny. But it’s the right 32KB. Compact and efficient are two sides of a coin; that’s part of what makes LMDB so powerful.

Ordered-map interface

keys are always sorted; range lookups are supported

Fully transactional

full ACID semantics with MVCC

Reader/writer transactions

readers don’t block writers; writers don’t block readers

Fully serialized writers

writes are always deadlock-free

Extremely cheap read transactions

can be performed using no mallocs or any other blocking calls

Multi-thread and multi-process concurrency supported

environments may be opened by multiple processes on the same host

Multiple sub-databases may be created

transactions cover all sub-databases

Memory-mapped

allows for zero-copy lookup and iteration

Maintenance-free

no external process or background cleanup or compaction required

Crash-proof

no logs or crash recovery procedures required

No application-level caching

LMDB fully exploits the operating system’s buffer cache

32KB of object code and 6KLOC of C

fits in CPU L1 cache for maximum performance

Has a python API:

It seems to target small records, of 1-2 kilobytes, and bigger stuff is less efficient.

3 Dogpile.cache

Previous frontrunner.

dogpile.cache:

Dogpile consists of two subsystems, one building on top of the other.

dogpile provides the concept of a “dogpile lock”, a control structure which allows a single thread of execution to be selected as the “creator” of some resource, while allowing other threads of execution to refer to the previous version of this resource as the creation proceeds; if there is no previous version, then those threads block until the object is available.

dogpile.cache is a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism of dogpile. Source[…]

Included backends feature three memcached backends (python-memcached, pylibmc, bmemcached), a Redis backend, a backend based on Python’s anydbm, and a plain dictionary backend.

Pro: it is definitely smart and modern about locking.

Con: plain file disk persistence is not supported by default. The dbm-type wrappers can probably get us something acceptable for data that serialises well in a dbm database; presumably dogpile makes them safe for concurrent writes. Not sure how it would go with heavy numerical work.

~~There is an advanced version for higher-performance redis.~~

4 joblib.Memory

The joblib cache looks convenient:

Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.

I can’t work out if it’s multi-write safe, or supposed to be only invoked from some master process and thus not need locking. Surely, as part of joblib, it should be multi-process safe?

It only supports function memoization, so if you want to access results some other way or access partial results it can get convoluted unless you can naturally factor your code into function memoizations.

5 Klepto

klepto (source) is also scientific computation focussed, (part of the pathos project):

klepto extends Python’s lru_cache to utilize different keymaps and alternate caching algorithms, such as lfu_cache and mru_cache. While caching is meant for fast access to saved results, klepto also has archiving capabilities, for longer-term storage. klepto uses a simple dictionary-sytle interface for all caches and archives, and all caches can be applied to any Python function as a decorator. Keymaps are algorithms for converting a function’s input signature to a unique dictionary, where the function’s results are the dictionary value. Thus for y = f (x), y will be stored in cache [x] (e.g. {x:y}).

klepto provides both standard and “safe” caching, where “safe” caches are slower but can recover from hashing errors. klepto is intended to be used for distributed and parallel computing, where several of the keymaps serialize the stored objects. Caches and archives are intended to be read/write accessible from different threads and processes. klepto enables a user to decorate a function, save the results to a file or database archive, close the interpreter, start a new session, and reload the function and it’s cache.

Given the use cases, one woudl assume this is safe for concurrent writes from multiple processes, but I cannot find info in the documentation.

6 Incoming

cachetools extends the python3 lru_cache reference implementation.