The Living Thing / Notebooks :

Debugging and profiling and accelerating Julia code

Usefulness: 🔧 🔧 🔧
Novelty: 💡
Uncertainty: 🤪 🤪
Incompleteness: 🚧 🚧

Debugging and profiling and performance hacks in Julia.

Going fast isn’t magical

Don’t believe that wild-eyed evangelical gentleman at the Julia meet-up. (and so far there has been one at every meet-up I have been to.) He doesn’t write actual code; he’s too busy talking about it. You still need to optimize and hint to get the most out of your work.

Active Julia contributor Chris Rackauckas, in the comments of this blog, is at pains to point out that type-hinting is rarely required in function definitions, and that often even threaded vectorisation annotation is not needed, which is amazing. (The manual is more ambivalent about that latter point, but anything in this realm is a mild bit of magic.)

Not to minimize how excellent that is, but bold coder, still don’t get lazy! This of course doesn’t eliminate role that performance annotations and careful code structuring typically play, and I also do not regard it as a bug that you occasionally annotate code to inform the computer what you know about what can be done with it. Julia should not be spending CPU cycles proving non-trivial correctness results for you that you already know. That would likely be solving redundant, possibly super-polynomially hard problems to speed up polynomial problems, which is a bad bet. It is a design feature of Julia that it allows you to be clever; it would be a bug if it tried to be clever in a stupid way. You will need to think about array views versus copies. And coding so that, See, say, blog posts on how you structure algorithms for fancy GPU computing.

Profiling

Here is a convenient overview of timer functions to help you work out what to optimize.

The built-in timing tool is called time. It can even time arbitrary begin/end blocks.

Going deeper, try Profile the statistical profiler:

julia> using Profile
julia> @profile myfunc()

ProfileView is advisable for making sense of the output. It doesn’t work in jupyterlab, but from jupyterlab (or anywhere) )ou can invoke the Gtk interface by setting PROFILEVIEW_USEGTK = true in the Main module before using ProfileView.

It’s kind of the opposite of profiling, but you can give use feedback about what’s happening using a ProgressMeter.

Near-magical performance annotations

@simd
equivocally endorsed in the manual, as it may or may not be needed to squeeze best possible performance out of the inner loop. Caution says try it out and see.
@inbounds
disables bounds checking, but be very careful with your indexing. Bounds checking is expensive.
@fastmath
violates strict IEEE math setup for performance, in a variety of ways some of which sound benign and some not so much.

Shunting to GPUs or other fancy architecture

Coding on GPUs for Julia is, at least in principle, easy. Thus you can possibly glean some nearly-free acceleration by using, for example, CuArrays to push computation onto GPUs. sdanisch has written the simplest examples of this. AFAICT the hippest interface is the vendor-neutral GPUArrays which, e.g. (I think) for CUDA will transparently use CUArrays and CUDANative, for e.g. an NVIDIA GPU. You can also invoke them in a vendor-specific way, which looks easy. There is even activist support for google cloud TPU compilation. How well this vectorises in practice is not clear to me yet. GPUify loops provides “Support for writing loop-based code that executes both on CPU and GPU”. Simon Danisch writes on GPUs usage in julia.

Debugging

There is a tracing debug ecosystem based on JuliaInterpreter. This provides such useful commands as @bp to inject breakpoints and @enter to execute a function in a step debugger. Doesn’t work from IJulia, but works fine from the console and also from Juno. (Juno is still, for my tastes, unpleasant enough that this is still not worth it..)

Additionally, there is a somewhat maintained stack debugger, Gallium.jl.

If you want to know how you made some weird mistake in type design, try Cthulhu

Linter, Lint.jl can identify code style issues. (also has an atom linter plugin).

Caching

Doing heavy calculations? Want to save time? Caching notionally helps, but of course is a quagmire of bikeshedding.

The Anamemsis creator(s) have opinion about this:

Julia is a scientific programming language which was designed from the ground up with computational efficiency as one of its goals. As such, even ad hoc and “naive” functions written in Julia tend to be extremely efficient. Furthermore, any memoization implementation is liable to have some performance pain points: in general they contain type unstable code (even if they have type stable output) and they must include a value cache which is accessible from the same scope as the function itself so that global objects are potentially involved. Because of this, memoization comes with a significant overhead, even with Julia’s highly efficient AbstractDict implementations.

My own opinion of caching is that I’m for it. Despite the popularity of the automatic memoization pattern I’m against that as a rule. I usually end up having to think carefully about caching, and magically making functions cache never seems to help with the hard part of that. Caching logic should, for projects I’ve worked on at least, be explicit, opt-in, and that should be an acceptable design overhead because I should only be doing it for a few expensive functions, rather than creating magic cache functions everywhere.

LRUCache is the most commonly recommended cache. It provides a simple, explicit, opt-in, in-memory, typed cache structure, so it should be performant. It promises sensible, if not granular, locking for multi-threaded access: one thread at a time can access any key of the cache. AFAICT this meets 100% of my needs.

Invocations look like

function foo(a::Float64, b::Float64)
    sleep(100)
    result = a * b
end

const lru = LRU{Tuple{Float64, Float64}, Float64}()
function cached_foo(a::Float64, b::Float64)
    get!(lru, (a, b)) do
      foo(a,b)
    end
end

Memoize provides an automatic caching macro but primitive in-memory caching using plain old dictionaries. AFAICT it is possibly compatible with LRUCache, but weirdly I could not find discussion of that. However, I don’t like memoisation, so that’s the end of that story.

Anamemsis has advanced cache decorators and lots of levers to pull to alter caching behaviour.

Caching is especially full-featured with macros, disk persistence and configurable evictions etc. It wants type annotations to be used in the cached function definitions.

Cleverman Simon Byrne notes that you can roll your own fancy caching using wild Julia introspection stunts via Cassette.jl, but I don’t know how battle-tested this is. I am not clever enough for that much magic.