Getting your computer to tell you the gradient of a function, without resorting to finite difference approximation, or coding an analytic derivative by hand. We usually mean this in the sense of automatic forward or reverse mode differentiation, which is not, as such, a symbolic technique, but symbolic differentiation gets an incidental lookin, and these ideas do of course relate.
Infinitesimal/Taylor series formulations, the related dual number formulations, and even fancier hyperdual formulations. Reversemode, a.k.a. Backpropagation, versus forwardmode etc. Computational complexity of all the above.
There is a beautiful explanation of reversemode the basics by Sanjeev Arora and Tengyu Ma.
You might want to do this for ODE quadrature, or sensitivity analysis, or for optimisation, either batch or SGD, especially in neural networks, matrix factorisations, variational approximation etc. This is not news these days, but it took a stunningly long time to become common since its inception in theâ€¦ 1970s? See, e.g.Â Justin Domke, who claimed automatic differentiation to be the most criminally underused tool in the machine learning toolbox. (That escalated quickly.) See also a timely update by Tim Viera.
Related: symbolic mathematical calculators.
There are many ways you can do automatic differentiation, and I wonâ€™t attempt to comprehensively introduce the various approaches here. This is a wellploughed field. There is much of good material out there already with fancy diagrams and the like. Symbolic, numeric, dual/forward, backwards modeâ€¦ Notably, you donâ€™t have to choose between them  e.g.Â you can use forward differentiation to calculate an expedient step in the middle of backward differentiation, for example.
To do: investigate unorthodox methods such as BenoĂ®t Pasquierâ€™s F1 Method. (source)
This package implements the F1 algorithm described [â€¦] It allows for efficient quasiautodifferentiation of an objective function defined implicitly by the solution of a steadystate problem.
See, e.g. Mike Innesâ€™ handon introduction, or his terse, opinionated introductory paper, Innes (2018). There is a wellestablish terminoligy for sensititvity analysis discussing adjoints, e.g. Steven Johnsonâ€™s class notes, and his references (Johnson 2012; Errico 1997; Cao et al. 2003).
Software

can automatically differentiate native Python and Numpy code. It can handle a large subset of Pythonâ€™s features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives. It uses reversemode differentiation (a.k.a. backpropagation), which means it can efficiently take gradients of scalarvalued functions with respect to arrayvalued arguments. The main intended application is gradientbased optimization.
This is the most pythonic of the choices here; not as fast as tensorflow but simple to use and can differentiate more general things than Tensorflow.
autogradforward will mingle forwardmode differentiation in to calculate Jacobianvector products and Hessianvector products for scalarvalued loss functions, which is useful for classic optimization. AFAICT there are no guarantees about computational efficiency for these.
jax is a successor to autograd.
JAX is Autograd and XLA, brought together for highperformance machine learning research.
With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reversemode differentiation (a.k.a. backpropagation) via grad as well as forwardmode differentiation, and the two can be composed arbitrarily to any order.
Whatâ€™s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs. Compilation happens under the hood by default, with library calls getting justintime compiled and executed. But JAX also lets you justintime compile your own Python functions into XLAoptimized kernels using a onefunction API, jit. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python.
Dig a little deeper, and youâ€™ll see that JAX is really an extensible system for composable function transformations. Both grad and jit are instances of such transformations. Another is vmap for automatic vectorization, with more to come.
This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!
stan is famous for Monte Carlo, but also does deterministic optimisation using automatic differentiation. this is a luxurious option; But it is computationally expensive and ugly to invoke purely for the gradients unless you are using their inference loop, so it does not count as a general purpose autodiff library.
Theano, (python) supports autodiff as a basic feature and had a massive user base, although it is now discontinued in favour of the next twoâ€¦
Tensorflow and FYI there is an interesting discussion of its workings in the tensorflow jacobians ticket request

Another neuralnet style thing like tensorflow, but with dynamic graph construction as in autograd.
Symbolic math packages such as Sympy, MAPLE and Mathematica can all do actual symbolic differentiation, which is different again, but sometimes leads to the same thing. I havenâ€™t tried Sympy or MAPLE, but Mathematicaâ€™s support for matrix calculus is weak.
autodiff
, which is usually referred to as audi for the sake of clarity, offers light automatic differentiation for MATLAB. I think MATLAB now has a whole deep learning toolkit built in which surely supports something natively in this domain.
allows you to differentiate functions implemented as computer programs by using Algorithmic Differentiation (AD) techniques in the forward and reverse mode. The forward mode propagates univariate Taylor polynomials of arbitrary order. Hence it is also possible to use AlgoPy to evaluate higherorder derivative tensors.
Speciality of AlgoPy is the possibility to differentiate functions that contain matrix functions as +,,*,/, dot, solve, qr, eigh, cholesky.
Looks sophisticated, and indeed supports differentiation elegantly; but not so actively maintained, and the source code is hard to find.
CasADi (Python, C++, MATLAB)
a symbolic framework for numeric optimization implementing automatic differentiation in forward and reverse modes on sparse matrixvalued computational graphs. It supports selfcontained Ccode generation and interfaces stateoftheart codes such as SUNDIALS, IPOPT etc. It can be used from C++, Python or Matlab
[â€¦]CasADi is an opensource tool, written in selfcontained C++ code, depending only on the C++ Standard Library. It is developed by Joel Andersson and Joris Gillis at the Optimization in Engineering Center, OPTEC of the K.U. Leuven under supervision of Moritz Diehl. CasADi is distributed under the LGPL license, meaning the code can be used royaltyfree even in commercial applications.
Documentation is minimal; probably should read the source or the published papers to understand how well this will fit your needs and, e.g.Â which arithmetic operations it supports.
It might be worth it for such features as graceful support for 100fold nonlinear composition, for example. But the price you pay is a weird DSL that you must learn to use it.
ADOLC is a popular C++ differentiation library with python binding. Looks clunky from python but tenable from c++.
ad, which is based off uncertainties (and therefore python) also does it.
ceressolver, (C++), the google least squares solver, seems to be pretty good at this although mostly focussed on leastsquares losses.
Julia
Julia has an embarrassment of different methods of autodiff (Homoiconicity and introspection makes this comparatively easy.) and itâ€™s not always clear the comparative selling points of each.
The juliadiff project produces ForwardDiff.jl and ReverseDiff.jl which do what I would expect, namely autodiff in forward and reverse mode respectively. ForwardDiff claims to be very advanced. ReverseDiff works but is abandoned.
ForwardDiff implements methods to take derivatives, gradients, Jacobians, Hessians, and higherorder derivatives of native Julia functions
In my casual tests it seems to be a slow for my purposes, due to constantly needing to create a new closure with a single argument it and differentiate it all the time. Or maybe Iâ€™m doing it wrong, and the compiler will deal with this if I set it up right? Or maybe most people are not solving my kind of problems, e.g.Â finding many different optima in similar sub problems. I suspect this difficulty would vanish if you were solving one big expensive optimisation with many steps, as with neural networks. update: I has doing it wrong. This gets faster if you avoid type ambiguity by, e.g setting up your problem in a function to avoid type ambiguities. Iâ€™m not sure if there is any remaining overhead in this closurebased system, but itâ€™s not so bad.
In forward mode (desirable when, e.g.Â I have few parameters with respect to which I must differentiate), when do I use DualNumbers.jl? Probably never; it seems to be deprecated in favour of a similar system in ForwardDiff.jl. But ForwardDiff is well supported. It seems to be fast for functions with lowdimensional arguments. It is not clearly documented how one would provide custom derivatives, but apparently you can still use method extensions for Dual types, of which there is an example in the issue tracker. The recommended way is extending DiffRules.jl
which is a little circuitous if you are building custom functions to interpolate. It does not seem to support Wirtinger derivatives yet.
Related to this forward differential formalism is Luis Benet and David P. Sandersâ€™ TaylorSeries.jl, which is satisfyingly explicit, and seems to generalise in several unusual directions.
TaylorSeries.jl is an implementation of highorder automatic differentiation, as presented in the book by W. Tucker (2011). The general idea is the following.
The Taylor series expansion of an analytical function \(f(t)\) with one independent variable \(t\) around \(t_0\) can be written as
\[ f(t) = f_0 + f_1 (tt_0) + f_2 (tt_0)^2 + \cdots + f_k (tt_0)^k + \cdots, \] where \(f_0=f(t_0)\), and the Taylor coefficients \(f_k = f_k(t_0)\) are the \(k\)th normalized derivatives at \(t_0\):
\[ f_k = \frac{1}{k!} \frac{{\rm d}^k f} {{\rm d} t^k}(t_0). \]
Thus, computing the highorder derivatives of \(f(t)\) is equivalent to computing its Taylor expansion.â€¦ Arithmetic operations involving Taylor series can be expressed as operations on the coefficients.
It has a number of functionalapproximation analysis tricks. đźš§
HyperDualNumbers, promises cheap 2nd order derivatives by generalizing Dual Numbers to HyperDuals. (ForwardDiff claims to support Hessians by Dual Duals, which are supposed to be the same as HyperDuals.) I am curious which is the faster way of generating Hessians out of ForwardDiff
â€™s DualofDual and HyperDualNumbers
. HyperDualNumbers
has some very nice tricks. Look at the HyperDualNumbers
homepage example, where we are evaluating derivatives of f
at x
by evaluating it at hyper(x, 1.0, 1.0, 0.0)
.
> f(x) = â„Ż^x / (sqrt(sin(x)^3 + cos(x)^3))
> t0 = Hyper(1.5, 1.0, 1.0, 0.0)
> y = f(t0)
4.497780053946162 + 4.053427893898621Ďµ1 +
4.053427893898621Ďµ2 + 9.463073681596601Ďµ1Ďµ2
The first term is the function value, the coefficients of both Ďµ1 and Ďµ2 (which correspond to the second and third arguments of hyper) are equal to the first derivative, and the coefficient of Ďµ1Ďµ2 is the second derivative.
Really nice. However, AFAICT this method does not actually get you a Hessian, except in a trivial sense, because it only seems to return the right answer for scalar functions of scalar arguments. This is amazing, if you can reduce your function to scalar parameters, in the sense of having a diagonal Hessian. But that skips lots of interesting cases. One useful case it does not skip, if that is so, is diagonal preconditioning of tricky optimisations.
Pro tip: the actual manual is the walkthrough which is not linked from the purported manual.
Another curiosity: BenoĂ®t Pasquierâ€™s (n.d.) (F1 Method) Dual MAtrix Tools and Hyper Dual Matrix Tools. which extend this to certain implicit derivatives arising in something or other.
How about Zygote.jl then? Thatâ€™s an alternative AD library from the creators of the aforementioned Flux. It usually operates in reverse mode and does some zany compilation tricks to get extra fast. It also has forward mode. Has many fancy features including compiling to Google Cloud TPUs. Hessian support is â€śsomewhatâ€ť. Flux itself does not yet default to Zygote, using its own specialised reversemode autodiff Tracker, but promises to switch transparently to Zygote in the future. In the interim Zygote is still attractive has many luxurious options, such as defining optimised custom derivatives easily, as well as weird quirks such as occasionally bizarre error messages and failures to notice source code updates.
One could roll oneâ€™s own autodiff system using the basic diff definitions in DiffRules. There is also the very fancy planned Capstan, which aims to use a tape system to inject forward and reverse mode differentiation into even very hostile code, and do much more besides. However it also doesnâ€™t work yet, and depends upon Julia features that also donâ€™t work yet, so donâ€™t hold your breath. (Or: help them out!)
See also XGrad which does symbolic differentiation. It prefers to have access to the source code as text rather than as an AST. So I think that makes it similar to Zygote, but with worse PR?
Refs
Baydin, Atilim Gunes, and Barak A. Pearlmutter. 2014. â€śAutomatic Differentiation of Algorithms for Machine Learning,â€ť April. http://arxiv.org/abs/1404.7456.
Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2015. â€śAutomatic Differentiation in Machine Learning: A Survey,â€ť February. http://arxiv.org/abs/1502.05767.
Baydin, AtÄ±lÄ±m GĂĽneĹź, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. â€śTricks from Deep Learning,â€ť November. http://arxiv.org/abs/1611.03777.
Cao, Y., S. Li, L. Petzold, and R. Serban. 2003. â€śAdjoint Sensitivity Analysis for DifferentialAlgebraic Equations: The Adjoint DAE System and Its Numerical Solution.â€ť SIAM Journal on Scientific Computing 24 (3): 1076â€“89. https://doi.org/10.1137/S1064827501380630.
Carpenter, Bob, Matthew D. Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. 2015. â€śThe Stan Math Library: ReverseMode Automatic Differentiation in C++.â€ť arXiv Preprint arXiv:1509.07164. http://arxiv.org/abs/1509.07164.
Errico, Ronald M. 1997. â€śWhat Is an Adjoint Model?â€ť Bulletin of the American Meteorological Society 78 (11): 2577â€“92. https://doi.org/10.1175/15200477(1997)078<2577:WIAAM>2.0.CO;2.
Fike, Jeffrey, and Juan Alonso. 2011. â€śThe Development of HyperDual Numbers for Exact SecondDerivative Calculations.â€ť In 49th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition. Orlando, Florida: American Institute of Aeronautics and Astronautics. https://doi.org/10.2514/6.2011886.
Fischer, Keno, and Elliot Saba. 2018. â€śAutomatic Full Compilation of Julia Programs and ML Models to Cloud TPUs,â€ť October. http://arxiv.org/abs/1810.09868.
Giles, Mike B. 2008. â€śCollected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.â€ť In Advances in Automatic Differentiation, edited by Christian H. Bischof, H. Martin BĂĽcker, Paul Hovland, Uwe Naumann, and Jean Utke, 64:35â€“44. Berlin, Heidelberg: Springer Berlin Heidelberg. http://eprints.maths.ox.ac.uk/1079/.
Gower, R. M., and A. L. Gower. 2016. â€śHigherOrder Reverse Automatic Differentiation with Emphasis on the ThirdOrder.â€ť Mathematical Programming 155 (12): 81â€“103. https://doi.org/10.1007/s1010701408274.
Griewank, Andreas, and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Haro, A. 2008. â€śAutomatic Differentiation Methods in Computational Dynamical Systems: Invariant Manifolds and Normal Forms of Vector Fields at Fixed Points.â€ť IMA Note. http://www.maia.ub.es/~alex/admcds/admcds.pdf.
Innes, Michael. 2018. â€śDonâ€™t Unroll Adjoint: Differentiating SSAForm Programs,â€ť October. http://arxiv.org/abs/1810.07951.
Johnson, Steven G. 2012. â€śNotes on Adjoint Methods for 18.335,â€ť 6.
Laue, Soeren, Matthias Mitterreiter, and Joachim Giesen. 2018. â€śComputing Higher Order Derivatives of Matrix and Tensor Expressions.â€ť In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, 2750â€“9. Curran Associates, Inc. http://papers.nips.cc/paper/7540computinghigherorderderivativesofmatrixandtensorexpressions.pdf.
Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. â€śGradientBased Hyperparameter Optimization Through Reversible Learning.â€ť In ICML, 2113â€“22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.
Neidinger, R. 2010. â€śIntroduction to Automatic Differentiation and MATLAB ObjectOriented Programming.â€ť SIAM Review 52 (3): 545â€“63. https://doi.org/10.1137/080743627.
Neuenhofen, Martin. 2018. â€śReview of Theory and Implementation of HyperDual Numbers for First and Second Order Automatic Differentiation,â€ť January. http://arxiv.org/abs/1801.03614.
Pasquier, B, and F Primeau. n.d. â€śThe F1 Algorithm for Efficient Computation of the Hessian Matrix of an Objective Function Defined Implicitly by the Solution of a SteadyState Problem.â€ť SIAM Journal on Scientific Computing, 10. https://www.bpasquier.com/publication/pasquier_primeau_sisc_2019/.
Rall, Louis B. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science 120. Berlin ; New York: SpringerVerlag.
Revels, Jarrett, Miles Lubin, and Theodore Papamarkou. 2016. â€śForwardMode Automatic Differentiation in Julia,â€ť July. http://arxiv.org/abs/1607.07892.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. â€śLearning Representations by BackPropagating Errors.â€ť Nature 323 (6088): 533â€“36. https://doi.org/10.1038/323533a0.
Tucker, Warwick. 2011. Validated Numerics: A Short Introduction to Rigorous Computations. Princeton: Princeton University Press. http://public.eblib.com/choice/publicfullrecord.aspx?p=683309.