Graphical table of content
Introduction
A large number of numerical methods require derivatives to be calculated. Classical examples include the maximization of an objective function using gradient descent or Newton’s method. In machine learning, gradients play a crucial role in the training of neural networks
Widrow Lehr (1990). Modelers have also applied derivatives to sample the highdimensional posteriors of Bayesian models, using Hamiltonian Monte Carlo samplers Neal (2010); Betancourt (20171) or approximate these posteriors with variational inference Kucukelbir . (2016). As probabilistic models and algorithms gain in complexity, computing derivatives becomes a formidable task. For many problems, we cannot calculate derivatives analytically, but only evaluate them numerically at certain points. Even when an analytical solution exists, working it out by hand can be mathematically challenging, time consuming, and error prone.There exist three alternatives that automate the calculation of derivatives: (1) finite differentiation, (2) symbolic differentiation, and (3) automatic differentiation (AD). The first two methods can perform very poorly when applied to complex functions for a variety of reasons summarized in Table 1; AD on the other hand, escapes many limitations posed by finite and symbolic differentiation, as discussed by Baydin . (2018). Several packages across different programing languages implement AD and allow differentiation to occur seamlessly, while users focus on other programing tasks. For an extensive and regularly updated list of AD tools, the reader may consult www.autodiff.org. In certain cases, AD libraries are implemented as black boxes which support statistical and machine learning softwares, such as the python package PyTorch Paszke . (2017) or the probabilistic programing language Stan Carpenter . (2017).
Automatic differentiation does not, despite its name, fully automate differentiation and can yield inefficient code if naively implemented. For this reason, AD is sometimes referred to as algorithmic rather than automatic differentiation. We here review some important techniques to efficiently implement AD and optimize its performance. The article begins with a review of the fundamentals of AD. The next section presents various computational schemes, required for an efficient implementation and deployed in several recently developed packages. We then discuss strategies to optimize the differentiation of complex mathematical algorithms. The last section presents practical and open problems in the field.
Throughout the paper, we consider several performance metrics. Accuracy is, we argue, the most important one, but rarely distinguishes different implementation schemes of AD, most of which are exact up to arithmetic precision. On the other hand, differences arise when we compare run time and memory usage; run time, being easier to measure, is much more prominent in the literature. Another key consideration is applicability: does a method solve a broad or only a narrow class of problems? Finally, we discuss ease of implementation and readability, more subjective but nevertheless essential properties of a computer code. We do not consider compile time, mostly because the statistical applications of AD we have in mind compile a program once, before using it thousands, sometimes millions of times.
Technique  Advantage(s)  Drawback(s) 
Handcoded analytical  Exact and often fastest  Time consuming to code, error prone, and not 
derivative  method.  applicable to problems with implicit solutions. 
Not automated.  
Finite differentiation  Easy to code.  Subject to floating point precision errors and slow, 
especially in high dimensions, as the method requires  
at least evaluations, where is the number of  
partial derivatives required.  
Symbolic differentiation  Exact, generates symbolic  Memory intensive and slow. Cannot handle 
expressions.  statements such as unbounded loops.  
Automatic differentation  Exact, speed is comparable  Needs to be carefully implemented, alhough this 
to handcoding derivatives,  is already done in several packages.  
highly applicable. 
How automatic differentiation works
Given a target function , the corresponding Jacobian matrix has component:
This matrix contains the partial derivatives of all the outputs with respect to all the inputs. If has a onedimensional output, as is the case when is an objective function, the Jacobian matrix is simply the gradient. In practice we may care only about the partial derivatives with respect to some of the inputs and calculate a reduced Jacobian matrix. The partial derivatives with respect to these inputs, respectively the corresponding columns of the Jacobian matrix, are called sensitivities.
Now suppose is a composite function: , with , and . Applying the chain rule and elementary matrix multiplication:
with element:
More generally, if our target function is the composite expression of functions,
(1) 
the corresponding Jacobian matrix verifies:
(2) 
To apply AD, we first need to express our target function using a computer program. AD can then operate in one of two modes: forward or reverse. Let . One application or sweep of forwardmode AD numerically evaluates the action of the Jacobian matrix on , . As prescribed by Equation 2,
(3) 
where the ’s verify the recursion relationship
(4) 
Hence, given a complex function
, we can break down the action of the Jacobian matrix on a vector, into simple components, which we evaluate sequentially. Even better, we can choose splitting points in equation
1 (correspondingly equation 2) to generate efficient implementations of AD. We will take advantage of this fact when we consider checkpoints and super nodes in later sections of the paper.Consider now a vector in the output space, . A sweep of reversemode AD computes the action of the transpose of the Jacobian matrix on , . The reasoning we used for forward mode applies here too and allows us to break down this operation into a sequence of simple operations. Often times, an AD program evaluates these simple operations analytically.
To further illustrate the mechanisms of AD, we consider a typical example from statistics^{1}^{1}1Based on an example from Carpenter . (2015).. We will compute the gradient of a log likelihood function, for an observed variable
sampled from a normal distribution. The likelihood function is:
with corresponding log likelihood function:
(5) 
Our goal is to compute the sensitivities of and , evaluated at , , and . The above function can be broken up into a sequence of maps, yielding the composite structure of Equation 1:
(6) 
Equation 4, which gives us the sequential actions of Jacobian matrices on vectors, then applies. Note that we have broken into functions which only perform an elementary operation, such as / or log (division or logarithm). For example, the second map is , and constructs a two dimensional vector by applying the division and the identity operators. By linearity of Equation 4, the identity operation is preserved. This means that, in our example, the second elements of and are identical. Hence, rather than explicitly computing every element in Equation 4, i.e. propagating derivatives through the maps outlined in Equation 6, it suffices to focus on individual operations, such as / and log, and moreover, any operation other than the identity operation.
To do this, Equation 5 can be topologically sorted into an expression graph (Figure 1). At the top of the graph, we have the final output, , and at the roots the input variables , , and . The nodes in between represent intermediate variables, obtained through elementary operations. A link between variables on the graph indicates an explicit dependence. We then know how to analytically differentiate operators on the expression graph and can get partial derivatives of composite expressions with the chain rule. For example:
We now describe the algorithmic mechanisms of the forward and reverse modes of AD.
Forward mode
Consider the direction, or initial tangent, . One forward sweep computes . This corresponds to the total derivative of with respect to , or to be more precise to the directional or directed partial derivative. The complexity of this operation is linear in the complexity of . Using fusedmultiply adds, denoted OPS, as a metric for computational complexity, (see chapter 4 of Griewank Walther (2008)).
The right choice of allows us to compute one column of the Jacobian matrix, that is the partial derivatives of all the outputs with respect to one input. Specifically, let be 1 for the input and 0 for all the other inputs. Then:
(7) 
Here denotes the column of the matrix . We can hence compute a full Jacobian matrix in forward sweeps. Naturally, we do not compute Equation 7 by doing a matrix operation, as this would require us to already know . Instead, we proceed as follows.
Forward mode AD computes a directional derivative at the same time as it performs a forward evaluation trace. In the case where is the above defined initial tangent, the software evaluates a variable and then calculates its partial directed derivative with respect to one root variable. Let denote the directed partial derivative of with respect to that one root variable. Consider , which is connected to and . As the program sweeps through it evaluates (i) its value and (ii) its directional derivative. The latter can be calculated by computing the partial derivatives of with respect to and , and plugging in numerical values for , , and their directed partial derivatives and , which have already been evaluated. The code applies this procedure from the roots to the final output (Table 2).
Forward evaluation trace  Forward derivative trace 

Reverse mode
Now suppose instead of computing derivatives with respect to an input, we compute the adjoints with respect to an output. The adjoint of a variable with respect to another variable is defined as:
Then for an initial cotangent vector, , one reverse mode sweep computes , with complexity (see again chapter 4 of Griewank Walther (2008)).
The right choice of allows us to compute one row of the Jacobian matrix, i.e. the adjoints of all the inputs with respect to one output. If the function has a onedimensional output, this row corresponds exactly to the gradient. If we pick to be 1 for the element and 0 for all other elements, one sweep computes the row of :
(8) 
and we get the full Jacobian matrix in reverse sweeps.
To calculate the righthand side of Equation 8, the algorithm proceeds as follows. After executing a forward evaluation trace, as was done for forward mode, the program makes a reverse pass to calculate the adjoints. We start with the final output, setting its adjoint with respect to itself to 1, and compute successive adjoints until we reach the root variables of interest (Table 3).
This procedure means the program must go through the expression graph twice: one forward trace to get the value of the function and the intermediate variables; and one reverse trace to get the gradient. This creates performance overhead because the expression graph and the values of the intermediate variables need to be stored in memory. One way of doing this efficiently is to have the code make a lazy evaluation of the derivatives. A lazy evaluation does not evaluate a statement immediately when it appears in the code, but stores it in memory and evaluates it when (and only if) the code explicitly requires it. As a result, we only evaluate expressions required to compute the gradient, or more generally the object of interest. There are several other strategies to efficiently handle the memory overhead of reversemode AD, which we will discuss in the section on computational implementation.
Reverse adjoint trace 
Forward or reverse mode?
For a function , suppose we wish to compute all the elements of the Jacobian matrix: which mode of AD should we use? Ignoring the overhead of building the expression graph, reverse mode, which requires sweeps, performs better when . With the relatively small overhead, the performance of reversemode AD is superior when , that is when we have many inputs and few outputs. This is the case when we map many model parameters to an objective function, and makes reverse mode highly applicable to highdimensional modeling.
However, if forward mode performs better. Even when we have a comparable number of outputs and inputs, as was the case in the lognormal density example, forward mode can be more efficient, since there is less overhead associated with storing the expression graph in memory in forward than in reverse mode. Baydin . (2018) compare the run time of forward and reversemode AD when applied to a statistical mechanics problem. The differentiated function is a many to one map, that is . For , forwardmode AD proves more efficient, but the result flips as increases (Table 4).
Dimension of input,  1  8  29  50 

Relative runtime  1.13  0.81  0.45  0.26 
While a certain mode may overall work best for a function, we can also look at intermediate functions that get called inside a target function. Such intermediate functions may be best differentiated using another mode of AD. Phipps Pawlowski (2012) calculate derivatives of intermediate expressions with reverse mode in an overall forward mode AD framework. They consider two forwardmode implementations; both use expression templates, a computational technique we will discuss in a few sections, and one of them uses caching. Caching stores the values and partial derivatives of intermediate expressions, and insure they only get evaluated once. As a case study, Phipps Pawlowski (2012) consider an application to fluid mechanics, and report a 30% reduction in runtime with intermediate reversemode, compared to standard forward AD without caching. The performance gain when forward AD already uses caching is however minor (less than 10%).
Another example where forward and reverse mode AD are both applied is the computation of higherorder derivatives such as Hessians or Hessian vector products (see for example Pearlmutter (1994)). We will dwell a little longer on this in our discussion on higherorder derivatives.
Computational implementation
A computationally naive implementation of AD can result in prohibitively slow code and excess use of memory. Careful considerations can mitigate these effects.
There exist several studies which measure the performance of different implementation schemes but these studies present limitations. First, they are indirect tests, as they often compare packages which have several distinguishing features. Secondly, computer experiments usually focus on a limited class of problems; there exists no standard set of problems spanning the diversity of AD applications and the development of a package is often motivated by a specific class of problems. Nevertheless, these test results work well as proof of concepts, and are aggregated in our study of computational techniques.
Our focus is mainly on C++ libraries, which have been compared to one another by Hogan (2014) and Carpenter . (2015). A common feature is the use of operator overloading rather than source transformation, two methods we will discuss in the upcoming sections. Beyond that, these libraries exploit different techniques, which will be more or less suited depending on the problem at hand. A summary of computational techniques is provided at the end of this section (Table 6).
Source transformation
A classic approach to compute and execute AD is source transformation, as implemented for example in the Fortran package ADIFOR Bischof . (1996) and the C++ packages TAC++ Voßbeck . (2008) and Tapenade Hascoet Pascual (2013). We start with the source code of a computer program that implements our target function. A preprocessor then applies differentiation rules to the code, as prescribed by elementary operators and the chain rule, and generates new source code, which calculates derivatives. The source code for the evaluation and the one for differentiation are then compiled and executed together.
A severe limitation with source transformation is that it can only use information available at compile time, which prevents it from handling more sophisticated programming statements, such as while loops, C++ templates, and other objectoriented features Voßbeck . (2008); Hogan (2014). There exist workarounds to make source transformation more applicable, but these end up being similar to what is done with operator overloading, albeit significantly less elegant Bischof Bücker (2000).
Moreover, it is commonly accepted in the AD community that source transformation works well for Fortran or C. Hascoet Pascual (2013) argue it is the choice approach for reversemode AD, although they also admit that, from a developer’s perspective, implementing a package that deploys source transformation demands a considerable amount of effort. Most importantly, source transformation cannot handle typical C++ features. For the latter, operator overloading is the appropriate technology.
Operator overloading
Operator overloading is implemented in many packages, including the C++ libraries: Sacado Gay (2005), Adept Hogan (2014), Stan Math Carpenter . (2015), and CoDiPack Sagebaum . (2017). The key idea is to introduce a new class of objects, which contain the value of a variable on the expression graph and a differential component. Not all variables on the expression graph will belong to this class. But the root variables, which require sensitivities, and all the intermediate variables which depend (directly or indirectly) on these root variables, will. In a forward mode framework, the differential component is the derivative with respect to one input. In a reverse mode framework, it is the adjoint with respect to one output^{2}^{2}2Provided we use, for forward and reverse mode, respectfully the right initial tangent and cotangent vector.. Operators and math functions are overloaded to handle these new types. We denote standard and differentiable types real and var respectively.
Memory management
AD requires care when managing memory, particularly in our view when doing reversemode differentiation and implementing operator overloading.
In forward mode, one sweep through the expression graph computes both the numerical value of the variables and their differential component with respect to an initial tangent. The memory requirement for that sweep is then simply twice the requirement for a function evaluation. What is more, a var object, which corresponds to a node on the expression graph, needs only to be stored in memory temporally. Once all the nodes connected to that object have been evaluated, the var object can be discarded.
When doing a reverse mode sweep, matters are more complicated because the forward evaluation trace happens first, and only then does the reverse AD trace occur. To perform the reverse trace, we need to access (i) the expression graph and (ii) the numerical values of the intermediate variables on that expression graph, required to calculate derivatives. These need to be stored in a persistent memory arena or tape. Note further that neither (i) nor (ii) is known at compile time if the program we differentiate includes loops and conditional statements, and, in general, the memory requirement is dynamic.
Retaping
In certain cases, a tape can be applied to multiple AD sweeps. An intuitive example is the computation of an Jacobian matrix, where and . Rather than reconstructing the expression graph and reevaluating the needed intermediate variables at each sweep, we can instead store the requisite information after the first sweep, and reuse it. Naturally, this strategy only works because the expression graph and the intermediate variables do not change from sweep to sweep.
If we evaluate derivatives at multiple points, the intermediate variables almost certainly need to be reevaluated and, potentially, stored in memory. On the other hand, the expression graph may not change. Thence a single tape for the expression graph can be employed to compute AD sweeps at multiple points. This scheme is termed retaping; implementations can be found in ADOLC Griewank . (1999); Walther Griewank (2012) and CppAD Bell (2012).
The conservation of the expression graph from point to point does however not hold in general. Our target function may contain conditional statements, which control the expression graph our program generates. In such cases, new conditions require new tapes. Retaping can then be used, provided we update the expression graph as soon as the control flow changes.
Checkpoints
Checkpointing reduces peak memory usage and more generally trades runtime and memory cost. Recall our target function is a composite of, say, simpler functions . Combining equations 2 and 8, the action of the transpose Jacobian matrix, , computed by one reverse sweep of AD verifies
(9) 
where is the initial cotangent. We can split into a sequence of components
(10) 
which can be sequentially evaluated by a reverse AD sweep, starting from the last line. The index function tells us where in the composite function the splits occur.
Correspondingly, the program that implements can be split into sections, with a checkpoint between each section. We then have several options, the first one being the recomputeall approach. Under this scheme, the program begins with a forward evaluation sweep, but here is the catch: we do not record the expression graph and the intermediate values, as we would in previously discussed implementations of reversemode AD, until we reach the last checkpoint. That is we only record from the last checkpoint onward. When we begin the reverse sweep, the size of the tape is relatively small, thence the reduction in peak memory cost. Of course, we can only perform the reverse sweep from the output to the last checkpoint. Once we reach the latter, we rerun the forward evaluation sweep from the beginning, and only record between the secondtolast and the last checkpoint, which gives us the requisite information for the next reverse sweep. And so on. The reason we call this approach recomputeall is because we rerun a partial forward evaluation trace, with partial recording, for each checkpoint (figure 2). Moreover, the reduction in peak memory comes at the cost of doing multiple forward evaluation traces.
It may be inefficient to rerun forward sweeps from the start every time. We can instead store the values of intermediate variables at every checkpoint, and then perform forward sweeps only between two checkpoints (figure 3). Under this scheme, the forward evaluation trace is effectively run twice. This variant of checkpointing is termed storeall. Storing intermediate results at each checkpoint may for certain problems be too memory expensive. Fortunately, recomputeall and storeall constitute two ends of a spectrum: in between methods can be used by selecting a subset of the checkpoints where the values are stored. Such checkpoints constitute snapshots, and give us more control on peak memory usage.
A natural question is where should we place the checkpoints? Ideally, we would like, given a memory constraint, to reduce runtime. There is in general, as argued by Hascoet Pascual (2013), no optimal placement and users should be given the freedom to place checkpoints by hand. In the case where we can split the the program for into sequential computational steps, with comparable computational costs, placing the checkpoints according to a binomial partitioning scheme Griewank (1992) produces an optimal placement Grimm . (1996).
There are several packages which implement checkpointing: for example AdolC and CppAD couple the method with operator overloading, while it is used with source transformation in Tapenade.
We conclude this section by drawing a connection between checkpointing and parallelizing AD. Suppose a section of can be split into segments we can evaluate in parallel. Specifically, each segment is evaluated on a separate computer core. Conceptually, we can place checkpoints around these segments, and create separate tapes for each of these segments. As a result, memory is kept local and does not need to be shared between cores. An example of such a procedure is the map reduce functionality in Stan Math, describe in the Stan book Stan Development Team (2018). Note this routine is not explicitly described as a checkpointing method.
Region Based Memory
If we allocate and free memory one intermediate variable at a time, the overhead cost becomes quite severe. A better alternative is to do regionbased memory management (see for example Gay Aiken (2001)). Memory for the calculation of derivatives is allocated element wise on a custom stack. Once the derivatives have been evaluated, the entire stack is destroyed and the associated memory freed. This spares the program the overhead of freeing memory one element at a time. Gay (2005) implements this technique in the C++ package RAD (Reverse AD, the reverse mode version of Sacado), and benchmarks its performance against the C/C++ package ADOLC. The results show RAD runs up to twice as fast, when differentiating quality mesh functions with varying levels of complexity.
Expression templates
Rather than overloading an operator such that it returns a custom object, we can code the operator to return an expression template Veldhuizen (1995). This technique, which builds on lazy evaluation, is widely used in packages for vector and matrix operations and was first applied to AD in the Fad package Aubert . (2001). Phipps Pawlowski (2012) recommend several techniques to improve expression templates and showcase its implementation in Sacado. Hogan (2014) and more recently Sagebaum . (2017) apply the method to do full reversemode AD, respectively in Adept and CoDiPack.
Under the expression template scheme, operators return a specialized object that characterizes an intermediate variable as a node in the expression graph.
As an example, consider the following programing statement:
var x;
real a;
var b = cos(x);
var c = b * a;
Take the multiply or * operator, which is a map .
Rather than returning a var object, we overload * to return a template object of type
multiply<T_a, T_b>, where T_a and T_b designate the types of
and . T_a or T_b can be real, var, or expression
templates. In the above example, , is a var, and a real.
Then is assigned an object of type multiply<cos<var>, real>, rather than
an evaluated var object (Table 5).
The nodes in the expression graph hence contain references. An optimizing compiler can
then combine the nodes to generate more efficient code, which would be functionally equivalent
to:
c.val_ = cos(x).val_ * a;
x.adj_ = c.adj_ * a * ( sin(x).val_)
where val_ and adj_ respectively denote the value and the adjoint stored in
the var object. This operation is called partial evaluation. Notice that b does not appear
in the above programing statement. Moreover, the use of expression templates removes temporary
var objects, which saves memory and eliminates the overhead associated with constructing
intermediate variables. Furthermore, the code can eliminate redundancies, such as the addition and
later subtraction of one variable.
variable  type 

x  var 
a  real 
b  cos<var> 
c  multiply<cos<var>, real> 
Algorithm  b = cos(x); 
c = b * a; 
Hogan (2014) evaluates the benefits of coupling expression templates and AD by benchmarking Adept
against other AD packages that use operator overloading, namely AdolC, CppAD, and Sacado.
His computer experiment focuses on an application in fluid dynamics, using a simple linear scheme and a more complex nonlinear method.
In these specific instances, Adept outperformances other packages, both for forward and reverse mode AD:
it is 2.6  9 times faster and 1.3  7.7 times more memory efficient, depending on the problem and the benchmarking package at hand.
Carpenter . (2015) corroborate the advantage of using expression templates by testing two implementations of a function that
returns the log sum of exponents.
The first implementation looks as follows:
for (int = 0; i < x.size(); i++)
total = log(exp(total) + exp(x(i)));
return total;
This code is inefficient because the log function gets called multiple times, when it could be called once, as below:
for (int = 0; i < x.size(); i++)
total += exp(x(i));
return log(total);
Stan Math requires 40% less time to evaluate and differentiate the function, when we use the second implementation instead of the first one.
On the other hand Adept achieves optimal speed with both implementations.
When applied to the second algorithm, both packages perform similarly.
Hence, while careful writing of a function can match the results seen with expression templates, this requires more coding effort from the user.
Method  Benefits  Limitations  Packages 
Source  Can be optimized by hand or with  Can only differentiate functions  AdiFor, TAC++ 
transformation  a compiler.  which are fully defined at compile  Tapenade 
time.  
Operator  Can handle most programing  Memory handling requires  All the packages 
Overloading  statements.  some care.  listed below. 
Retaping  Fast when we differentiate through  Applicable only if the expression  AdolC, CppAd 
the same expression graph  graph is conserved from point  
multiple times.  to point. When the graph  
changes a new tape can be  
created.  
Checkpointing  Reduces peak memory usage.  Longer runtime. An optimal  AdolC, CppAD 
placement is binary partition, but  (Tapenade for source  
this scheme does not always  transformation).  
apply.  
Regionbased  Improves speed and memory.  Makes the implementation less  Adept, Sacado, 
memory  transparent. Some care is  Stan Math  
required for nested AD.  
Expression  Removes certain redundant operations,  Adept, CoDiPack  
templates  whereby the code becomes faster and  Sacado (with  
more memory efficient, and allows  some work).  
user to focus on code clarity. 
Mathematical implementation
We now study several mathematical considerations users can make to optimize the differentiation of a complex function. The here discussed techniques are generally agnostic to which AD package is used; they may however already be implemented as builtin features in certain libraries. We exemplify their use and test their performance using Stan Math and reversemode AD^{3}^{3}3Similar results are expected with forward mode.. The driving idea in this section is to gain efficiency by reducing the size of the expression graph generated by the code.
Reducing the number of operations in a function
A straightforward way of optimizing code is to minimize the number of operations in a function. This can be done by storing the result of calculations that occur multiple times inside intermediate variables. When evaluating a function, introducing intermediate variables reduces runtime but increases memory usage. With AD, eliminating redundant operations also reduces the generated expression graph, which means intermediate variables both improve runtime and memory usage.
Consider for example the implementation of a function that returns a
matrix exponential. The matrix exponential extends the concept of a scalar exponential to matrices and can be used to solve linear systems of ordinary differential equations (ODEs). There exist several methods to approximate matrix exponentials, the most popular one being the Padé approximation, coupled with scaling and squaring; for a review, see
Moler Van Loan (2003). For the matrix,the matrix exponential can be worked out analytically (see Rowland Weisstein ()). Requiring to be real, we have:
(11) 
where
The above expression only contains elementary operators, and can be differentiated by any standard AD library. A “direct” implementation in C++ might simply translate the above mathematical equations and look as follows:
Note we use the matrix library Eigen Guennebaud . (2010). T represents a template type which, in need, will either be a real or a var. An optimized implementation removes redundant operations and introduces variables to store intermediate results:
The two codes output the same mathematical expression, but the second implementation is about a third faster, as a result of producing an expression graph with less nodes (Table 7). This is measured by applying both implementations to 1000 randomly generated matrices, labelled “”. We compute the run time by looking at the wall time required to evaluate and differentiate it with respect to every elements in (thereby producing 16 partial derivatives). The experiment is repeated 100 times to quantify the variation in the performance results.
Unfortunately the optimized approach requires more coding effort, produces more lines of code, and ultimately hurts clarity, as exemplified by the introduction of a variable called ad_sinh_half_delta. Whether the tradeoff is warranted or not depends on the user’s preferences. Note expression templates do not perform the herediscussed optimization, as a compiler may fail to identify expressions which get computed multiple times inside a function.
Implementation  Number of nodes  Run time (ms)  Relative run time 

Standard  41  1.0  
Optimized  25  0.68 
Optimizing with super nodes
There exist more sophisticated approaches to reduce the size of an expression graph. This is particularly useful, and for practical purposes even necessary, when differentiating iterative algorithms. Our target function, , may embed an implicit function, , such as the solution to an equation we solve numerically. Numerical solvers often use iterative algorithms to find such a solution. The steps of these algorithms involve simple operations, hence it is possible to split into elementary functions, here indexed 1, …, :
Note , the number of elementary functions we split into, depends on the number of steps the numerical solver takes, and hence varies from point to point. Using the above we can then easily deploy our usual arsenal of tools to do forward or reversemode AD. But doing so produces large expression graphs, which can lead to floating point precision errors, excessive memory usage, and slow computation. To mitigate these problems, we can elect to not split , and use a custom method to compute ’s contribution to the action of the Jacobian matrix. A more algorithmic way to think about this is to say we do not treat as a function generating a graph, but as a standalone operator which generates a node or super node. A super node uses a specialized method to compute Jacobian matrices and often produces a relatively small expression graph.
As a case study, we present several implementations of a numerical algebraic solver, and run performance tests to measure run time. Ease of implementation is also discussed. In particular, we examine a wellknown strategy which consists in using the implicit function theorem. For a more formal discussion of the subject, we recommend Bell Burke (2008) and chapter 15 of Griewank Walther (2008).
Another example where super nodes yield a large gain in performance is the numerical solving of ODEs, where the iterative algorithm is used to simultaneously evaluate a solution and compute sensitivities. Section 13 of Carpenter . (2015) provides a nice discussion on this approach. Finally, Giles (2008) summarizes several results to efficiently compute matrix derivatives, including matrix inverses, determinants, matrix exponentials, and Cholesky decompositions.
Differentiating a numerical algebraic solver
We can use a numerical algebraic solver to find the root of a function and solve nonlinear algebraic equations. Formally, we wish to find such that:
where is a fixed vector of auxiliary parameters, or more generally a fixed vector of variables for which we require sensitivities. That is we wish to compute the Jacobian matrix, , with entry:
Consider, for demonstrating purposes, an implementation of Newton’s method. For convenience, we note the Jacobian matrix, which contains the partial derivatives with respect to .

Start with an initial guess .

While a convergence criterion has not been met:

Return , where is the number of steps required to reach convergence, or a warning message stating convergence could not be reached.
In the case where an analytical expression for is provided, we can readily apply regular AD to differentiate the algorithm. If we are working with overloaded operators, this simply means templating the solver so that it can handle var types. Unfortunately, if the number of steps required to reach convergence is high, the algorithm produces a large expression graph. The problem is exacerbated when is not computed analytically. Then AD must calculate the higherorder derivative:
Coding a method that returns such a derivative requires the implementation of nested AD, and a careful combination of the reverse and forward mode chain rule. Tackling this problem with bruteforce AD hence demands a great coding effort and is likely to be prohibitively slow.
An alternative approach is to use the implicit function theorem which tells us that, under some regularity conditions in the neighborhood of the solution:
(12) 
where, consistent with our previous notation, is the Jacobian matrix of with respect to . We have further made the dependence on and explicit. Note the righthand side is evaluated at the solution .
This does not immediately solve the problem at hand but it significantly simplifies it. Computing and is, in practice, straightforward. In simple cases, these partials can be worked out analytically, but more generally, we can use AD. Many algorithms, such as Newton’s method and Powell’s dogleg solver Powell (1970) already compute the matrix to find the root, a result that can then be reused. Note furthermore that with Equation 12, calculating reduces to a firstorder differentiation problem.
Ideally, super nodes are already implemented in the package we plan to use. If this is not the case, the user can attempt to create a new function with a custom differentiation method. The amount of effort required to do so depends on the AD library at hand and the user’s coding expertise; the paper further discusses this point in its section on Extensibility. For now, our advice is to build new operators when brute force AD is prohibitively slow or when developing general purpose tools.
To better inform the use of super nodes, we compare the performance of two dogleg solvers. The first one uses a templated version of the algorithm and applies regular AD. The second one uses Equation 12. For simplicity, we provide in analytical form; for the second solver, is computed with AD. The solvers are applied to an archetypical problem in pharmacometrics, namely the modeling of steady states for patients undergoing a medical treatment. The details of the motivating problem are described in the appendix. The number of states in the algebraic equation varies from 2 to 28, and the number of parameters which require sensitivities respectively from 4 to 30. We use Stan Math’s builtin algebraic solver as a benchmark, a dogleg algorithm which also uses Equation 12 but computes with AD.
The experiment shows using the implicit function theorem is orders of magnitudes faster than relying on standard AD (Figure 4 and table 8). This also holds for the builtin algebraic solver which automatically differentiates with respect to . Access to in its analytical form yields a minor gain in performance, which becomes more obvious when we increase the number of states in the algebraic equation (Figure 5).
number of  super node  

states  standard  + analytical  builtin 
4  
12  
20  
28 
, and highlights the results for a selected number of states. The uncertainty is computed by running the experiment 100 times and computing the sample standard deviation.
Open and practical problems
Many AD packages are still being actively developed; hence we expect some of the issues we present to be addressed in the years to come.
General and specialized code
The first problem we note is that no single package comprehensively implements all the optimization techniques outlined in this paper – perhaps for several good reasons. First of all, two methods may not be compatible with one another. In most cases however, there is simply a tradeoff between development effort and impact. This is a tricky question for strategies that are not generalizable, but very effective at solving a specific class of problems. For example, retaping is not helpful when the expression graph changes from point to point and we need to evaluate the gradient at several points; but it is well suited for computing a single Jacobian matrix.
A similar argument applies to specialized math functions. For some problems, such as solving ODEs and algebraic equations, we can take advantage of custom differentiation methods, which are agnostic to which solver gets used. This does not hold in general. Counterexamples arise when we solve partial differential equations or work out analytical solutions to fieldspecific problems. The coding burden then becomes more severe, and developers must work out which routines are worth implementing.
Extensibility
Making a package extensible aims to reduce the burden on the developers and give users the coding flexibility to create new features themselves. Opensource projects greatly facilitate this process. As an example, Torsten^{4}^{4}4The package is still under development. See https://github.com/metrumresearchgroup/examplemodels. extends Stan Math for applications in pharmacometrics Margossian Gillespie (2016). In our experience however, extending a library requires a great deal of time, coding expertise, and detailed knowledge of a specific package. The problem is worse when the AD package works as a black box.
One solution is to allow users to specify custom differentiation methods when declaring a function. Naturally users will have to be cautioned against spending excessive time coding derivatives for minor performance gains. PyTorch provides such a feature in the highlevel language Python, namely for reversemode AD^{5}^{5}5See https://pytorch.org/docs/stable/notes/extending.html.. CasADi Andresson . (2018) also offers the option to write custom derivative functions in all its interfaces (including Python and Matlab), though this feature is described as “experimental” and still requires a technical understanding of the package and its underlying structure in C++^{6}^{6}6See http://casadi.sourceforge.net/api/html/.. Another solution, which is well underway, is to educate users as to the inner workings of AD packages. This is in large part achieved by papers on specific packages, review articles such as this one, and well documented code.
Higherorder differentiation
Algorithms that require higherorder differentiation are less common, but there are a few noteworthy examples. Riemannian Hamiltonian Monte Carlo (RHMC) is a Markov chain Monte Carlo sampler that requires the second and third order derivatives of the log posterior distribution, and can overcome severe geometrical pathologies firstorder sampling methods cannot
Girolami . (2013); Betancourt (2013). A more classical example is Newton’s method, which requires a Hessian vector product when used for optimization. Much work has been done to make the computation of higherorder derivatives efficient. In general, this task remains significantly more difficult than computing firstorder derivatives and can be prohibitively expensive, which is why we label it as a “practical problem”. Practitioners are indeed often reluctant to deploy computational tools that require higherorder derivatives. One manifestation of this issue is that RHMC is virtually never used, whereas its firstorder counter part has become a popular tool for statistical modeling.As previously touched upon, most packages compute secondorder derivatives by applying AD twice: first to the target function, thereby producing firstorder derivatives; then by sorting the operations which produced these firstorder derivatives into a new expression graph, and applying AD again. Often times, as when computing a Newton step, we are not interested in the entire Hessian matrix, but rather in a Hessian vector product.
For simplicity, consider the scalar function . To compute a Hessian vector product, one can first apply a forward sweep followed by a reverse sweep Pearlmutter (1994); Griewank Walther (2008). For secondorder derivatives, AD produces an object of the general form
where designates the gradient vector and the Hessian matrix, and where , , and are initialization vectors (possibly of length 1). The Hessian vector product is then obtained by setting , , and , an efficient operation whose complexity is linear in the complexity of . The full Hessian matrix is computed by repeating this operation times and using basis vectors for . As when computing Jacobian matrices, the same expression graph can be stored in a tape and reused multiple times. It is also possible to exploit the structural properties of the Hessian as is for example done by Gebremedhin . (2009) with the package AdolC.
Consider now the more general case, where we wish to compute higherorder derivatives for a function . We can extend the above procedure: apply AD, sort the AD operations into a new expression graph, and repeat until we get the desired differentiation order.
Whether or not the above approach is optimal is an open question, most recently posed by Betancourt (20172, 2018), who argues recursively applying AD leads to inefficient and sometimes numerically unstable code. There exists two alternatives to compute higherorder derivatives. We only briefly mention them here and invite the interested reader to consult the original references. The first alternative uses univariate Taylor series and begins with the observation that the coefficient of the series corresponds to the derivative of our target function; more is discussed in Bischof . (1993); Griewank . (2000). Particular attention is given to the computation of Hessian matrices and ways to exploit their structure.
For the second alternative, Betancourt (2018) proposes a theoretical framework for higherorder differential operators and derives these operators explicitly for the second and third oder cases. In some sense, this approach imitates the scheme used for firstorder derivatives: create a library of mathematical operations for which higherorder partial derivatives are analytically worked out and compute sensitivities using “higherorder chain rules”. Doing so is a tricky but doable task that requires a good handle on differential geometry. The here described strategy is prototyped in the C++ package Nomad Betancourt (20172), which serves as a proof of concept. At this point however Nomad is not optimized to compute firstorder derivatives and more generally needs to be further developed in order to be extensively used^{7}^{7}7Personal communication with Michael Betancourt..
Conclusion
In recent years, AD has been successfully applied to several areas of computational statistics and machine learning. One notable example is Hamiltonian Monte Carlo sampling, in particular its adaptive variant the NoUTurn sampler (NUTS) Hoffman Gelman (2014). NUTS, supported by AD, indeed provides the algorithmic foundation for the probabilistic languages Stan Carpenter . (2017) and PyMC3 Salvatier . (2016). AD also plays a critical part in automatic differentiation variational inference Kucukelbir . (2016), a gradient based method to approximate Bayesian posteriors. Another major application of AD is neural networks, as done in TensorFlow Abadi . (2016).
We expect that in most cases AD acts as a black box supporting a statistical or a modeling software; nevertheless, a basic understanding of AD can help users optimize their code, notably by exploiting some of the mathematical techniques we discussed. Advanced users can edit the source code of a package, for example to incorporate an expression templates routine or write a custom derivative method for a function.
Sometimes, a user must pick one of the many tools available to perform AD. A first selection criterion may be the programming language in which a tool is available; this constraint is however somewhat relaxed by packages that link higherlevel to lowerlevel languages, such as RCpp which allows coders to use C++ inside R Eddelbuettel Balamuta (2017). Passed this, our recommendation is to first look at the applications which motivated the development of a package and see if these align with the user’s goals. Stan Math
, for instance, is developed to compute the gradient of a logposterior distribution, specified by a probabilistic model. To do so, it offers an optimized and flexible implementation of reverse AD, and an extensive library of mathematical functions to do linear algebra, numerical calculations, and compute probability densities. On the other hand, its forwardmode is not as optimized and well tested as its reversemode AD
^{8}^{8}8Personal communication with Bob Carpenter.. More generally, we can identify computational techniques compatible with an application of interest (see Table 6), and consider packages that implement these techniques. Another key criterion is the library of functions a package provides: using builtin functions saves time and effort, and often guarantees a certain degree of optimality. For more esoteric problems, extensibility may be of the essence. Other considerations, which we have not discussed here, include a package’s capacity to handle more advanced computational features, such as parallelization and GPUs.Acknowledgments
I thank Michael Betancourt, Bob Carpenter, and Andrew Gelman for helpful comments and discussions.
Appendix: computer experiments
This appendix complements the section on Mathematical implementations and describes in more details the presented performance tests. The two computer experiments are conducted using the following systems:

Hardware: Macbook Pro computer (Retina, early 2015) with a 2.7 GHz Intel Core i5 with 8 GB of 1867 MHz DDR3 memory

Compiler: clang++ version 4.2.1

Libraries: Eigen 3.3.3, Boost 1.66.0, Stan Math 2.17  develop^{9}^{9}9Versioned branch test/autodiff_review.
The compiler and libraries specifications are those used when runing unit tests in Stan Math. We measure runtime by looking at the wall clock before and after evaluating a target function and calculating all the sensitivities of interest. The wall time is computed using the C++ function std::chrono::system_clock::now(). The code is available on GitHub.
Differentiating the Matrix Exponential
We obtain
matrices by generating four elements from a uniform distribution
, that is a uniform with lower bound 1 and upper bound 10. This conservative approach insures the term in Equation 11 is real (i.e. has no imaginary part). We generate the random numbers using std::uniform_real_distribution<T> unif(1, 10).Differentiating a numerical algebraic solver
To test the performance of an algebraic solver, we consider a standard problem in pharmacometrics, namely the computation of steady states for a patient undergoing a medical treatment. We only provide an overview of the scientific problem, and invite the interested reader to consult Margossian (2018) or any standard textbook on pharmacokinetics.
In our example, a patient orally receives a drug which diffuses in their body and gets cleared over time. In particular, we consider the one compartment model with a firstorder absorption from the gut. The following ODE system then describes the drug diffusion process:
where

is the drug mass in the gut

the drug mass in a central compartment (often times organs and circulatory systems, such as the blood, into which the drug diffuses rapidly)

and represent diffusion rates

and is time.
This system can be solved analytically. Given an initial condition and a time interval , we can then define an evolution operator which evolves the state of a patient, as prescribed by the above ODEs.
Often times, a medical treatment undergoes a cycle: for example, a patient recieves a drug dose at a regular time interval, . We may then expect that, after several cycles, the patient reaches a state of equilibrium. Let be the drug mass at the beginning of a cycle, the drug mass (instantaneously) introduced in the gut by a drug intake, and . Then equilibrium is reached when:
This algebraic equation can be solved analytically, but for demonstrating purposes, we solve it numerically. We then compute sensitivities for and .
To increase the number of states, we solve the algebraic equations for patients, yielding 2 states. The coefficients in the ODEs for the patient are . These coefficients vary from patient to patient, according to , where is a population parameter and a random scaling factor, uniformly sampled between 0.7 and 1.3. Hence, for states, we have parameters that require sensitivities.
References
 Abadi . (2016) TensorFlow:2016Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C.Zheng, X. 2016. TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems Tensorflow: Largescale machine learning on heterogeneous distributed systems. CoRRabs/1603.04467. http://arxiv.org/abs/1603.04467
 Andresson . (2018) Andersson:2018Andresson, JAE., Gillis, J., Horn, G., Rawlings, JB. Diehl, M. 2018. CasADi – A software framework for nonlinear optimization and optimal control CasADi – A software framework for nonlinear optimization and optimal control. Mathematical Programming Computation.
 Aubert . (2001) Aubert:2001Aubert, P., Di Cesare, N. Pironneau, O. 2001. Automatic Differentiation in C++ Using Expression Templates and Application to a Flow Control Problem Automatic differentiation in c++ using expression templates and application to a flow control problem. Computing and Visualization in Science. https://doi.org/10.1007/s007910000048
 Baydin . (2018) BaydinEtAl:2015Baydin, AG., Pearlmutter, BA., Radul, AA. Siskind, JM. 2018. Automatic differentiation in machine learning: a survey Automatic differentiation in machine learning: a survey. CoRRabs/1502.05767. http://arxiv.org/abs/1502.05767
 Bell (2012) Bell:2012Bell, BM. 2012. CppAD: a package for C++ algorithmic differentiation Cppad: a package for c++ algorithmic differentiation. Computational Infrastructure for Operations Research.
 Bell Burke (2008) Bell:2008Bell, BM. Burke, JV. 2008. Algorithmic Differentiation of Implicit Functions and Optimal Values Algorithmic differentiation of implicit functions and optimal values. 64. https://doi.org/10.1007/9783540689423_17
 Betancourt (2013) Betancourt:2013Betancourt, M. 2013. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo A general metric for riemannian manifold hamiltonian monte carlo. https://arxiv.org/pdf/1212.4693.pdf;
 Betancourt (20171) Betancourt:2017Betancourt, M. 20171January. A Conceptual Introduction to Hamiltonian Monte Carlo A conceptual introduction to hamiltonian monte carlo. arXiv:1701.02434v1.
 Betancourt (20172) NomadBetancourt, M. 20172. Nomad: A HighPerformance Automatic Differentiation Package Nomad: A highperformance automatic differentiation package []. https://github.com/standev/nomad.
 Betancourt (2018) Betancourt:2018Betancourt, M. 2018. A Geometric Theory of HigherOrder Automatic Differentiation A geometric theory of higherorder automatic differentiation. https://arxiv.org/pdf/1812.11592.pdf
 Bischof Bücker (2000) Bischof:2000Bischof, C. Bücker, H. 2000. Computing Derivatives of Computer Programs Computing derivatives of computer programs. J. Grotendorst (), Modern Methods and Algorithms of Quantum Chemistry: Proceedings, Second Edition Modern methods and algorithms of quantum chemistry: Proceedings, second edition ( 3, 315–327). JülichNICDirectors. https://juser.fzjuelich.de/record/44658/files/Band_3_Winterschule.pdf
 Bischof . (1993) Bischof:1993Bischof, C., Corliss, G. Griewank, A. 1993. Structured secondand higherorder derivatives through univariate Taylor series Structured secondand higherorder derivatives through univariate taylor series. Optimization Methods and Softwares. https://doi.org/10.1080/10556789308805543
 Bischof . (1996) Bischof:1996Bischof, C., Khademi, P., Mauer, A. Carle, A. 1996. Adifor 2.0: automatic differentiation of Fortran 77 programs Adifor 2.0: automatic differentiation of fortran 77 programs. Computational Science Engineering, IEEE331832. 10.1109/99.537089
 Carpenter . (2017) Stan:2017Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M.Riddel, A. 2017. Stan: A Probabilistic Programming Language Stan: A probabilistic programming language. Journal of Statistical Software. 10.18637/jss.v076.i01
 Carpenter . (2015) Carpenter:2015Carpenter, B., Hoffman, MD., Brubaker, MA., Lee, D., Li, P. Betancourt, MJ. 2015. The Stan Math Library: ReverseMode Automatic Differentiation in C++ The stan math library: Reversemode automatic differentiation in c++. https://arxiv.org/abs/1509.07164
 Eddelbuettel Balamuta (2017) RcppEddelbuettel, D. Balamuta, JJ. 2017aug. Extending extitR with extitC++: A Brief Introduction to extitRcpp Extending extitR with extitC++: A Brief Introduction to extitRcpp. PeerJ Preprints5e3188v1. https://doi.org/10.7287/peerj.preprints.3188v1 10.7287/peerj.preprints.3188v1
 Gay (2005) Gay:2005Gay, D. 2005. Semiautomatic differentiation for efficient gradient computations. Semiautomatic differentiation for efficient gradient computations. HM. Buecker, GF. Corliss, P. Hovland, U. Naumann B. Norris (), Automatic Differentiation: Applications, Theory, and Implementations Automatic differentiation: Applications, theory, and implementations ( 50, 147–158). Springer, New York.
 Gay Aiken (2001) Gay:2001Gay, D. Aiken, A. 200105. Language Support for Regions Language support for regions. SIGPLAN Not.36570–80. http://doi.acm.org/10.1145/381694.378815 10.1145/381694.378815
 Gebremedhin . (2009) Gebremedhin:2009Gebremedhin, AH., Tarafdar, A., Pothen, A. Walther, A. 2009. Efficient Computation of Sparse Hessians Using Coloring and Automatic Differentiation Efficient computation of sparse hessians using coloring and automatic differentiation. INFORMS Journal on Computing21209  223. https://doi.org/10.1287/ ijoc.1080.0286
 Giles (2008) Giles:2008Giles, M. 2008. An extended collection of matrix derivative results for forward and reverse mode algorithmic differentiation An extended collection of matrix derivative results for forward and reverse mode algorithmic differentiation. The Mathematic Institute, University of Oxford, Eprints Archive.
 Girolami . (2013) Girolami:2013Girolami, M., Calderhead, B. Chin, SA. 2013. Riemannian Manifold Hamiltonian Monte Carlo Riemannian manifold hamiltonian monte carlo. arXiv:0907.1100. https://arxiv.org/abs/0907.1100
 Griewank (1992) Griewank:1992Griewank, A. 1992. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods and Software113554. 10.1080/10556789208805505
 Griewank . (1999) Griewank:1999Griewank, A., Juedes, D. Utke, J. 1999. ADOLC: A package for the automatic differentiation of algorithms written in C/C++ Adolc: A package for the automatic differentiation of algorithms written in c/c++. ACM Transac tions on Mathematical Software22. http://www3.math.tuberlin.de/Vorlesungen/SS06/AlgoDiff/adolc110.pdf

Griewank . (2000)
Griewank:2000Griewank, A., Utke, J. Walther, A.
2000.
Evaluating Higher Derivative Tensors by forward propagation of univariate Taylor series Evaluating higher derivative tensors by forward propagation of univariate taylor series.
Mathematics of computation692311117  1130. 10.1090/S0025571800011200  Griewank Walther (2008) Griewank:2008Griewank, A. Walther, A. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Evaluating derivatives: Principles and techniques of algorithmic differentiation. Society for Industrial and Applied Mathematics (SIAM)2. https://doi.org/10.1137/1.9780898717761
 Grimm . (1996) Grimm:1996Grimm, J., Pottier, L. RostaingSchmidt, N. 1996. Optimal Time and Minimum SpaceTime Product for Reversing a Certain Class of Programs Optimal time and minimum spacetime product for reversing a certain class of programs. INRIA. inria00073896
 Guennebaud . (2010) Eigen:2013Guennebaud, G., Jacob, B. . 2010. Eigen v3. Eigen v3. http://eigen.tuxfamily.org.
 Hascoet Pascual (2013) Hascoet:2013Hascoet, L. Pascual, V. 2013. The Tapenade Automatic Differentiation tool: principles, model, and specification The tapenade automatic differentiation tool: principles, model, and specification. ACM Transaction on Mathematical Software. 10.1145/2450153.2450158
 Hoffman Gelman (2014) Hoffman:2014Hoffman, MD. Gelman, A. 2014April. The NoUTurn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo The nouturn sampler: Adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research.
 Hogan (2014) Hogan:2014Hogan, RJ. 2014. Fast ReverseMode Automatic Differentiation using Expression Templates in C++ Fast reversemode automatic differentiation using expression templates in c++. ACM Transactions on Mathematical Software404. 10.1145/2560359
 Kucukelbir . (2016) Kucukelbir:2016Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A. Blei, DM. 2016. Automatic Differentiation Variational Inference Automatic differentiation variational inference. https://arxiv.org/abs/1603.00788
 Margossian (2018) Margossian:2018Margossian, CC. 2018January. Computing Steady States with Stan’s Nonlinear Algebraic Solver Computing steady states with stan’s nonlinear algebraic solver. Stan Conference 2018 California. Stan conference 2018 california.
 Margossian Gillespie (2016) Torsten:2016Margossian, CC. Gillespie, WR. 2016October. Stan Functions for Pharmacometrics Modeling Stan functions for pharmacometrics modeling. Journal of Pharmacokinetics and Pharmacodynamics Journal of pharmacokinetics and pharmacodynamics ( 43).
 Moler Van Loan (2003) Moler:2003Moler, C. Van Loan, C. 2003March. Nineteen Dubious Ways to Compute the Exponential of a Matrix, TwentyFive Years Later Nineteen dubious ways to compute the exponential of a matrix, twentyfive years later. SIAM Review.
 Neal (2010) Radford:2010Neal, RM. 2010. MCMC using Hamiltonian Dynamics Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo. Handbook of markov chain monte carlo. Chapman & Hall / CRC Press.
 Paszke . (2017) Paszke:2017Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z.Lerer, A. 2017. Automatic differentiation in PyTorch Automatic differentiation in pytorch.
 Pearlmutter (1994) Pearlmutter:1994Pearlmutter, BA. 1994. Fast Exact Multiplication by the Hessian Fast exact multiplication by the hessian. Neural Computation. https://doi.org/10.1162/neco.1994.6.1.147
 Phipps Pawlowski (2012) Phipps:2012Phipps, E. Pawlowski, R. 2012. Efficient Expression Templates for Operator Overloadingbased Automatic Differentiation Efficient expression templates for operator overloadingbased automatic differentiation. arXiv:1205.3506.
 Powell (1970) Powell:1970Powell, MJD. 1970. A Hybrid Method for Nonlinear Equations A hybrid method for nonlinear equations. P. Rabinowitz (), Numerical Methods for Nonlinear Algebraic Equations. Numerical methods for nonlinear algebraic equations. Gordon and Breach.
 Rowland Weisstein () ToddWeissteinRowland, T. Weisstein, EW. . Matrix Exponential. Matrix exponential. MathWorld. http://mathworld.wolfram.com/MatrixExponential.html
 Sagebaum . (2017) Sagebaum:2017Sagebaum, M., Albring, T. Gauger, NR. 2017. HighPerformance Derivative Computations using CoDiPack Highperformance derivative computations using codipack. https://arxiv.org/pdf/1709.07229.pdf
 Salvatier . (2016) PyMC3:2016Salvatier, J., Wiecki, TV. Fonnesbeck, C. 2016. Probabilistic programming in Python using PyMC3 Probabilistic programming in python using pymc3. PeerJ Computer Science. https://doi.org/10.7717/peerjcs.55
 Stan Development Team (2018) StanBookStan Development Team. 2018. Bayesian Statistics using Stan, version 2.18.1 Bayesian statistics using stan, version 2.18.1. https://mcstan.org/docs/2_18/stanusersguide/index.html
 Veldhuizen (1995) Veldhuizen:1995Veldhuizen, T. 1995. Expression templates Expression templates . C++ Report 7.
 Voßbeck . (2008) Vossbeck:2008Voßbeck, M., Giering, R. Kaminski, T. 2008. Development and First Applications of TAC++ Development and first applications of tac++. 64. https://doi.org/10.1007/9783540689423_17
 Walther Griewank (2012) Walther:2012Walther, A. Griewank, A. 2012. Getting started with ADOLC Getting started with adolc. U. Naumann O. Schenk (), Combinatorial Scientific Computing Combinatorial scientific computing ( 181–202). ChapmanHall CRC Computational Science.
 Wickham (2009) ggplot2:2009Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis ggplot2: Elegant graphics for data analysis. http://ggplot2.orgSpringerVerlag New York.

Widrow Lehr (1990)
Widrow:1990Widrow, B. Lehr, MA.
1990Sep.
30 years of adaptive neural networks: perceptron, Madaline, and backpropagation 30 years of adaptive neural networks: perceptron, madaline, and backpropagation.
Proceedings of the IEEE78914151442. 10.1109/5.58323
Comments
There are no comments yet.