# QuantLib User Meeting 2015

I gave two talks at this year’s QuantLib User Meeting. They were on

• Negative rates in QuantLib.

You can read Luigi’s full report on the conference here.

# Adjoint Greeks VI – Source Code Transformation Pt. II

I will be talking about Automatic differentiation beyond typedef and operator overloading at this year’s QuantLib User Conference. Registration is free, but seats are limited. The blog post today is sort of a teaser, so do not miss to be in Düsseldorf on November 30th / December 1st.

To recall what has been done to make QuantLib AD-ready: We have two approaches. One is by CompatibL which in essence changes the typedef for Real in the library from double to some active type of an operator overloading tool, like CppAD::AD. Of course this doesn’t work just by relinking the typedef, but in effect it describes well what is going on.

The other approach is to transform the library’s code and introduce template parameters whenever we might potentially be interested in calculating derivatives. You can read more about this by selecting the tag Automatic Differentiation in the sidebar that can be expanded by clicking on the three horizontal lines in the upper right corner when you scroll to the top of this page.

Backward compatibility in the template approach is achieved by typedef’ing the original class names to template instantiations with T = Real, so existing code compiles without any changes. Well let’s say 99% of it, some few modifications have to be applied, but really not many (from my experience with the original examples and the test-suite).

The transformation is not applied to the full library (yet), but the vanilla IRD stuff works, so coupons, yield term structures, volatility termstructures, the GSR model. You can get it (together with some working examples) from my adjoint branch at https://github.com/pcaspers/quantlib/tree/adjoint.

Both approaches have clear advantages and disadvantages. One potential problem with the typedef approach is, that this might kill your performance in parts of the code where you do not even want to use AD. For example,

    Matrix_t<T> A(1024, 1024);
Matrix_t<T> B(1024, 1024);
...
Matrix_t<T> C = A * B;


will take 764 ms on my machine with T=double, but 8960 ms with T=CppAD::AD. This seems to be due to compiler optimizations (like using SIMD instructions) that are no longer applicable when switching from native double to the custom AD type. This is a penalty of 11.7x and you do not get anything for that. Note that I do not record any tape or something, I just use the double functionality wrapped in the AD type.

Of course there might be other operator overloading tools that are more efficient. To get a lower bound for the penalty factor I replaced T by a minimal wrapper, just storing a double and an additional native pointer and forwarding the operators +,*,+= in question here to the respective double operations. With this wrapper the penalty is 2.0x. This is much better. The perfect operator overloading tool will probably lie somewhere between 2.0 and 11.7. But even 2.0 is not good, if you can avoid it, is it?

In any case the template approach allows for selective activation of variables where AD is needed and use native instances of the library classes elsewhere. You can also mix both. For example you could do a model calibration with the double-instance of a class, then extract the calibrated parameters and plug them into the AD – version of the same model, which then maybe generates Monte Carlo paths for some CVA calculations in which you want to calculate an adjoint gradient.

Still there are use cases left where even the template approach will not help and this is when you want to apply AD to code sections like the one above, where a lot of dense double calculations take place, enabling the compiler to apply optimizations that are not possible with custom types.

We saw an example in https://quantlib.wordpress.com/2015/04/14/adjoint-greeks-iv-exotics. The time equivalent of around 80 NPV calculations (holy cow!) is consumed for one adjoint gradient calculation of a Bermudan swaption in the Hull White model. From theory we would expect a complexity of 4x for the adjoint sweep, in practice we are probably happy with any factor up to 10x. But with 80 NPV calculations you can do a lot of things and even if your gradient vector has 100 – 200 entries, the application of AD would merely result in a speed up of around 2x compared to a naive bump and revalue approach. This would not really justify the whole AD-thing.

One possible way out of this is to write the adjoint code by hand.

Another approach is to use source code transformation. I started with a toy example in one of my last posts.

In the meantime I implemented a Bermudan swaption engine that produces an AD gradient w.r.t. the $H$ and $\zeta$ vectors describing an LGM model and the relevant discount factors of an interest rate curve. For this I extracted the computational core of the Gaussian1dSwaptionEngine (which relies on numerical integration) and rewrote it in Fortran. Then I applied OpenAD/F to the code to generate adjoint code and compiled everything into a library. You can have a look at the project here: https://github.com/pcaspers/quantlib/tree/master/QuantLibOAD/lgm.

On the QuantLib side I wrote a pricing engine acting as a wrapper for the Fortran core: It can be used just as any other pricing engine, but provides the adjoint gradient vector as an additional result. The gradient is computed w.r.t. the model parameters, but it is easy to tranform this into the usual market variables’ sensitivities. This is done exactly the same way as in the post I mentioned earlier.

Testing the engine on a 10y Bermudan swaption with yearly calls (using 49 grid points per expiry for the numerical integration) yields the following results (you can find the example code here)

• single pricing on a Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz, single threaded, non-transformed code: 4.2 ms
• pricing + gradient (which is 105 dimensional in this example): 25.6 ms
• transformation of gradient w.r.t. model parameters: 6.2 ms

Looks good, no? In particular, because no extra optimization of the input or transformed code was necessary to achieve the good multiple of 6.1, it just came out of OpenAD/F this way.

As a side note when implementing it as described here you can use the power of AD even from the original / “classic” QuantLib just through specialized pricing engines (and by linking against the Fortran support library of course). Or you can mix operator overloading in the template – QuantLib with source code transformation applied to critical sections of the code which slow down too much with operator overloading. Maybe this hybrid solution is the way to go.

More on this and other topics in Düsseldorf.

# Adjoint Greeks V – Source Code Transformation

In my last post on automatic differentiation we saw that the operator overloading approach in some situations (quite naturally) comes with a dramatic performance breakdown.

An alternative approach that might do better is called source code transformation. Here the code is differentiated at compile time. While there are quite a few operator overloading tools that work on existing code bases more or less out of the box, this does not seem to be the case for source code transformation tools.

I looked around quite a while but did not really find a mature framework for C++. There is one under development (ADIC 2.0) that should be mentioned. There is also quite an old version of this same project for C (ADIC 1.x).

There is a Fortran framework though called OpenAD/F. Quite interestingly this framework separates the abstract ad engine (“OpenAD”) from language specific parts (“/F”), using a XML representation for the computational core to be differentiated. So in princinple you can implement a front end for C++ replacing the Fortran front end and reuse the algorithms of OpenAD. This is actually what ADIC is doing.

Still I have this faint idea of using the intermediate representation (IR) of the LLVM compiler infrastructure to interface to OpenAD. There are already backends that can translate back IR to C, so the “only” thing left to do would roughly be the transformation of IR to XAIF and back and the implementation of a run time support library. Very roughly. I certainly do not have enough experience in both AD and the IR to really overlook this and also not the time to delve into this. Anyway.

I played around with OpenAD/F some months ago. I solved a toy pde for american option pricing with Fortran code written from scratch. The results were promising in terms of performance. So here is the plan how to benefit from that in a productive setting:

The idea is that a typical core pricing algorithm is actually not too big and can effectively be excavated from the rest of the code.

If this is the case we could reimplement this computational core in Fortran, apply OpenAD/F to differentiate it and use the differentiated code via a plain interface from QuantLib pricing engines to retrieve adjoint greeks.

This post aims to present a minimal example for that procedure to prove the technical feasibility. Later I will try to apply this to the Bermudan Swaption example I mentioned above.

The libraries are organized as the usual QuantLib shared object library and a Fortran shared object library that contains the differentiated code for the numerical cores. The minimal example library is called simplelibad (simple lib ad). An application will then typcially be an usual C++ program linking against QuantLib and the new ad library.

You can find the example code here. It comes with a make file that by default builds a differentiated version of the library functions. You can also run the target plain to build only the original functions without AD, which seems useful for testing the ported code against the original one before doing anything fancy.

Actually only one function is provided in our simplelib wich computes a discount factor from a given continuously compounded zero yield and a day count fraction. The coding looks as follows (sorry, no syntax highlighting)

subroutine discount(zeroyield, dcf, df)
implicit none
double precision:: zeroyield, dcf, df
!$openad INDEPENDENT(zeroyield) df = exp(-zeroyield * dcf) !$openad DEPENDENT(df)
end subroutine discount


To keep the interface simple no return parameters are present, the result is written to the last input parameter instead.

There are two OpenAD – specific lines starting with !\$openad which declare the active variables. The make file invokes a wrapper script openad that is shipped with OpenAD as a simple driver for the actual tools

openad -c -m f simplelib.f90


This produces the differntiated version of the above code. The option -m f sets the mode to forward. The resulting code is then compiled into an object file

gfortran -g -O3 -o simplelib.pre.xb.x2w.w2f.post.o -c simplelib.pre.xb.x2w.w2f.post.f90 -fpic


The same is done with some additional runtime support files provided by OpenAD

gfortran -g -O3 -o w2f__types.o -c w2f__types.f90 -fpic


Finally we need a driver that acts as the interface to QuantLib later on

subroutine discount_ad(zeroyield, dcf, df, ddf)
implicit none
external discount
double precision:: zeroyield, dcf, df, ddf
df = res%v
ddf = res%d


This returns the derivative together with the original result of the computation and does nothing more than invoking the differentiated code. We compile the driver as well

gfortran -g -O3 -o driver_simplelib.o -c driver_simplelib.f90 -fpic


and then everything is linked into a shared object library simplelibad

gfortran -shared -g -O3 -o libsimplelibad.so w2f__types.o OAD_active.o driver_simplelib.o simplelib.pre.xb.x2w.w2f.post.o


In the minimal example here we actually do not use any QuantLib classes, but directly talk to the ad library from our example application, like this

#include <iostream>
#include <cmath>

extern "C" void discount_ad_(double *zeroyield, double *dcf, double *df,
double *ddf);

int main() {

double zeroyield = 0.02;
double dcf = 10.0;
double df, ddf;

std::cout << "result1 = " << df
<< " should be: " << std::exp(-zeroyield * dcf) << std::endl;

std::cout << "result2 = " << ddf
<< " should be: " << -dcf * std::exp(-zeroyield * dcf)
<< std::endl;
}


The application can be compiled as any other C++ code, it just needs to be linked against simplelibad. Running the code yields the expected output

result1 = 0.818731 should be: 0.818731
result2 = -8.18731 should be: -8.18731


Now of course the real work starts … hope to be back with a meaningful example soon.

# Backward Automatic Differentiation Explained

Today I will try to explain how the forward and backward mode in automatic differentiation work. I will only cover the principle, not actual algorithms and the optimizations they apply. While the so called forward mode is quite intuitive, it is not so easy to wrap your head around the backward mode. I will try to go through all steps and not leave out anything seemingly trivial.

We consider the computation of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ with independent variables $x_1, \dots , x_n$ and dependent variables $y_1, \dots , y_m$. The ultimate goal is to compute the Jacobian

$J = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & & \dots & & \frac{\partial y_1}{\partial x_n} \\ & \ddots & & & \\ \vdots & & \frac{\partial y_i}{\partial x_j} & & \vdots \\ & & & \ddots & \\ \frac{\partial y_m}{\partial x_1} & & \dots & & \frac{\partial y_m}{\partial x_n} \end{pmatrix}$

We view the function $f$ as a composite of elementary operations

$u_k = \Phi_k( \{u_\kappa\}_{(\kappa,k) \in \Lambda})$

for $k > n$ where we set $u_k = x_k$ for $k=1,\dots,n$ (i.e. we reserve these indices for the start values of the computation) and $u_k = y_k$ for $k=K-m+1, \dots, K$ (i.e. these are the final results of the computation). The notation should suggest that $u_k$ depends on prior results $u_\kappa$ with $(\kappa,k)$ in some index set $\Lambda$. Note that if $(k,l)\in\Lambda$ this refers to a direct dependency of $u_l$ on $u_k$, i.e. if $u_k$ depends on $u_j$, but $u_j$ does not enter the calculation of $u_k$ directly then $(j,l) \notin \Lambda$.

As an example consider the function

$f(x_1, x_2) = x_1 + x_2^2$

for which we would have $u_1 = x_1$, $u_2 = x_2$, $u_3 = u_2^2$, $u_4 = y_1 = u_1+u_3$. The direct dependencies are $(1,4)$, $(2,3)$ and $(3,4)$, but not $(2,4)$, because $x_2=u_2$ does not enter the expression for $u_4$ directly.

We can view the computation chain as a directed graph with vertices $u_k$ and edges $(k,l)$ if $(k,l)\in\Lambda$. There are no circles allowed in this graph (it is a acyclic graph) and it consists of $K$ vertices.

We write $|i,j|$ for the length of the longest path from $u_i$ to $u_j$ and call that number the distance from $i$ to $j$. Note that this is not the usual definition of distance normally being the length of the shortest path.

If $u_j$ is not reachable from $u_i$ we set $|i,j| = \infty$. If $u_j$ is reachable from $u_i$ the distance is finite, since the graph is acyclic.

We can compute a partial derivative $\partial u_m / \partial u_k$ using the chain rule

$\frac{\partial u_m}{\partial u_k} = \sum_{l|(l,m)\in\Lambda} \frac{\partial u_m}{\partial u_l} \frac{\partial u_l}{\partial u_k}$

This suggest a forward propagation scheme: We start at the initial nodes $u_1, ... , u_n$. For all nodes $u_l$ with maximum distance $1$ from all of these nodes we compute

$c_l = \sum_{i=1,\dots,n} \frac{\partial u_l}{\partial u_i} c_{i}$

where we can choose $c_i$ for $i=1,\dots,n$ freely at this stage. This assigns the dot product of the gradient of $u_l$ w.r.t. $x_1, \dots, x_n$ and $(c_1,\dots,c_n)$ to the node $u_l$.

If we choose $c_k=1$ for one specific $k\in\{1,\dots,n\}$ and zero otherwise, we get the partial derivative of $u_l$ by $u_k$, but we can compute any other directional derivatives using other vectors $(c_1,\dots,c_n)$. (Remember that the directional derivative is the gradient times the direction w.r.t. which the derivative shall be computed.)

Next we consider nodes with maximum distance $2$ from all nodes $u_1,\dots,u_n$. For such a node $u_l$

$c_l = \sum_{i=1,\dots,n} \frac{\partial u_l}{\partial u_i} c_i = \sum_{i=1,\dots,n} \sum_{k|(k,l)\in\Lambda} \frac{\partial u_l}{\partial u_k} \frac{\partial u_k}{\partial u_i} c_i = \sum_{k|(k,l)\in\Lambda} \frac{\partial u_l}{\partial u_k} c_k$

where we can assume that the $c_k$ were computed in the previous step, because their maximum distance to all initial nodes $u_1,\dots,u_n$ muss be less than $2$, hence $1$.

Also note that if $k \in \{1,\dots,n\}$, which may be the case, $\partial u_l / \partial u_k = 1$ if $k=l$ and zero otherwise, so $\sum_{i=1,\dots,n} \partial u_k / \partial u_i c_i = c_k$ trivially. Or seemingly trivial.

The same argument can be iterated for nodes with maximum distance $3, 4, \dots$ until we reach the final nodes $u_{K-m+1},\dots,u_K$. This way we can work forward through the computational graph and compute the directional derivative we seek.

In the backward mode we do very similar things, but in a dual way: We start at the final nodes and compute for all nodes $u_l$ with maximum distance $1$ from all of these nodes

$\overline{c_l} = \sum_{i=K-m+1,\dots,K} \frac{\partial u_i}{\partial u_l} \overline{c_i}$

Note that we compute a weighted sum in the dependent variables now. By setting a specific $c_k$ to $1$ and the rest to zero again we can compute the partial derivatives of a single final variable. Again using the chain rule we can compute

$\overline{c_l} = \sum_{i=K-m+1,\dots,K} \frac{\partial u_i}{\partial u_l} \overline{c_i} = \sum_{i=K-m+1,\dots,K}\sum_{k|(l,k)\in\Lambda} \frac{\partial u_i}{\partial u_k}\frac{\partial u_k}{\partial u_l} \overline{c_i} = \sum_{k|(l,k)\in\Lambda} \frac{\partial u_k}{\partial u_l} \overline{c_k}$

for all nodes $u_l$ with maximum distance of $2$ from all the final nodes.

Note that the chain rule formally requires to include all variables $u_k$ on which $u_i$ depends. Howvever if $u_k$ does not depend on $u_l$ the whole term will effectively be zero, so we can drop these summands from the beginning. Also we may include indices $k$ on which $u_i$ does not depend in the first place, which is not harmful for the same reason.

As above we can assume all $\overline{c_k}$ to be computed in the previous step, so that we can iterate backwards to the inital nodes to get all partial derivatives of the weighted sum of the final nodes w.r.t. the initial nodes.

# Adjoint Greeks IV – Exotics

Today I feature a sequel to the adjoint greeks drama (see my prior posts on this).

Before I start I would like to point you to a new and excellent blog authored by my colleague Matthias https://ipythonquant.wordpress.com/. You will want to follow his posts, I am certain about that.

I am still on my way to convert the essential parts of the library to a template version. Since this is boring work I created a small friend who helps me. No I did not go mad. It is a semi-intelligent emacs-lisp script that does some regular expression based search-and-replacements which speeds up the conversion a lot. Emacs is so cool.

Today I am going to calculate some derivatives of a bermudan swaption in the Gsr / Hull White model. This is the first post-vanilla application and admittedly I am glad that it works at last.

One point I have to make today is that AD is slow. Or to put it differently, doubles can be tremendously fast. Lookit here:

void multiply(T *a, T *b, T *s) {
for (int i = 0; i < N; ++i) {
for (int k = 0; k < N; ++k) {
for (int j = 0; j < N; ++j) {
s[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
}


This code multiplies two matrices a and b and stores the result in s. It is the same algorithm as implemented in QuantLib for the Matrix class.

When feeding two 1000 x 1000 double matrices the code runs 1.3s on my laptop if compiled with gcc 4.8.2 using -O1. With -O3 I get 950ms. With clang 3.7.0 (fresh from the trunk) I get 960ms (-O1) and 630ms (-O3). With T=CppAD::AD on the other hand the running time goes up to at least (-O3) 10.9s (gcc) and 14.3s (clang). Code optimization seems to be a delicate business.

These timings refer to the situation where we use the AD type without taping, i.e. only as a wrapper for the double in it. This seems to indicate that at least for specific procedures it is not advisable to globally replace doubles by their active counterpart if one is not really interested in taping their operations and calculating the derivatives. Performance may break down dramatically.

What is the reason behind the huge difference ? I am not really the right person to analyze this in great detail. The only thing I spotted when looking into the generated assembler code is that with double there are SIMD (single instruction multiple data) instructions for adding and multiplying in the nested loops (like addpd and mulpd, it’s a long time since I programmed in assembler and it was on a 6510 so I am not really able to read a modern x86 assembler file).

With CppAD::AD there doesn’t seem to be such instructions around. So part of the perfomance loss may be due to the inability of streaming AD calculations. For sure this is not the only point here.

The second point to make today is that AD has pitfalls that may in the end lead to wrong results, if one uses the AD framework blindly. Let’s come back to our specific example. The underlying source code can be found here if you are interested or want to run it by yourself.

It is a bermudan swaption, ten years with yearly exercise dates. The model for pricing will be the Gsr or Hull White model. We just want to compute the bucket vegas of the bermudan, i.e. the change in its NPV when the implied market volatility of the canonical european swaptions used for the model calibration is increased by one percent.

The rate level is set to $3\%$ and the strike is out of the money at $5\%$. The volatilities are produced by SABR parameters $\alpha=3\%$, $\beta=60\%$, $\nu=12\%$, $\rho=30\%$. All of this is arbitrary, unrealistic and only for explanatory purposes …

The naive AD way of doing this would be to declare the input implied volatilities as our quantities of interest

   CppAD::Independent(inputVolAD);


and then go through the whole model calibration

   gsrAD->calibrateVolatilitiesIterative(


and pricing

   yAD[0] = swaptionAD->NPV();


to get the derivative d bermudan / d impliedVol

    CppAD::ADFun<Real> f(inputVolAD, yAD);
vega = f.Reverse(1, w);


When I first wrote about AD I found it extremely attractive and magical that you could compute your sensitivities like this. That you can actually differentiate a zero search (like in the case of yield curve bootstrapping) or an optimization (like the Levenberg-Marquardt algorithm which is used here).

However there are dark sides to this simplicity, too. Performance is one thing. The calibration step and the pricing including the gradient calculation takes

AD model calibration = 1.32s


We can also do it differently, namely calibrate the model in an ordinary way (using ordinary doubles), then compute the sensitivity of the bermudan to the model’s sigma and additionally compute the sensitivity of the calibration instruments to the model’s sigma. Putting everything together yields the bermudan’s bucketed vega again. I will demonstrate how below. First I report the computation time for this approach:

model calibration = 0.40s


This leaves us with a performance gain of around 15 percent (7.32s vs 8.43s). This is not really dramatic, still significant. And there is another good reason to separate the calibration from the greek calculation which I will come to below.

Note also, that the pricing which takes 5.95s with AD (including derivatives) is much faster without AD where it only consumes 0.073s. This is a factor of 80 which is much worse than the theoretical factor of 4 to 5 mentioned in earlier posts (remember, we saw 4.5 for plain vanilla interest rate swap npv and delta computation). This is again due to optimization issues obviously.

The background here is that the swaption pricing engine uses cubic spline interpolation and closed-form integration of the resulting cubic polynominals against the normal density for the roll back. Again a lot of elementary calculations, not separated by OO-things that would hinder the compiler from low level optimizations. You would surely need quite a number of sensitivities to still get a performance gain.

But lets follow the path to compute the bucket vegas further down. First I print the bucket vegas coming from the naive AD way above – direct differentiation by the input implied volatilities with the operation sequence going through the whole model calibration and the pricing.

vega #0: 0.0153%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0185%
vega #6: 0.0140%
vega #7: 0.0124%
vega #8: 0.0087%


This looks plausible. We started the calibration from a constant sigma function at $1\%$. This is actually far away from the target value (which is around $0.40\%$), so the optimizer can walk around before he settles at the minimum. But we could have started with a sigma very close to the optimal solution. What would happen then ? With the target value as an initial guess (which is unlikely to have in reality, sure) we get

vega #0: 0.0000%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0448%
vega #6: 0.0000%
vega #7: 0.0000%
vega #8: 0.0000%


Some vegas are zero now. vega #5 is even completely different from the value before. This is a no go, because in a productive application you wouldn’t notice if some deals are contributing a zero sensitivity or a false value.

What is happening here is that the function we differentiate depends on more input variables than only the primary variables of interest (the implied vol), like the initial guess for the optimization. These might alter the derivative drastically even if the function value (which is the model’s sigma function on an intermediate level, or ultimately the bermudan’s npv) stays the same.

For example if the initial guess is so good that the optimizers tolerance is already satisfied with it, the output will stay the same, no matter if the input is pertubed by an infinitesimal shift $dx$. Here pertubed does not really mean bumped and $dx$ really is infinitesimal small. It is only a concept to get a better intuition of what is going on during the process of automatic differentiation.

Another example is the bisection method for zero searching. This method will always yield zero AD derivatives in the sense if $x$ is an input (e.g. a vector of swap quotes) and $y$ is the output (e.g. a vector of zero rates) linked by a relation

$f(x,y) = 0$

the iff $x$ is pertubed by $dx$ then the checks in the bisection algorithm will yield exactly the same results for $x+dx$ as for $x$. Therefore the computed $y$ will be exactly the same, thus the derivative zero.

I don’t know if this explanation is good, but this is how I picture it for myself. It seems to be like this: If it feels too magical what you do with AD, you better don’t do it.

What is the way out ? We just avoid all optimization and zero searches, if possible. And it is for interest rate deltas and vegas: Just calibrate your curve in the usual way, then apply AD to compute sensitivities to zero rates (which then does not involve a zero search any more).

If you want market rate deltas, compute the Jacobian matrix of the market instruments in the curve by the zero rates as well, invert it and multiply it with the zero delta vector. You do not even have to invert it, but just only have to solve one linear equation system. We make this procedure explicit with our example above.

The first step would be to compute d bermudan NPV / d model’s sigma. This is done on the already calibrated model. Technically we separate the calibration step (done with the usual doubles) and the AD step (done on a copy of the model feeded with the model parameters from the first model, but not itself calibrated again). This is what we get for d bermudan / d sigma

vega #0: 1.6991
vega #1: 1.2209
vega #2: 0.8478
vega #3: 0.5636
vega #4: 0.4005
vega #5: 0.2645
vega #6: 0.1573
vega #7: 0.0885
vega #8: 0.0347


Next we compute the Jacobian d helperNpv / d sigma

297.4%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
179.7%	183.3%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
126.9%	129.4%	132.2%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
90.8%	92.6%	94.6%	96.7%	0.0%	0.0%	0.0%	0.0%	0.0%
64.9%	66.2%	67.6%	69.1%	71.2%	0.0%	0.0%	0.0%	0.0%
47.5%	48.4%	49.5%	50.6%	52.1%	53.3%	0.0%	0.0%	0.0%
31.9%	32.6%	33.3%	34.0%	35.0%	35.9%	36.3%	0.0%	0.0%
19.2%	19.6%	20.0%	20.4%	21.0%	21.5%	21.8%	22.3%	0.0%
8.7%	8.9%	9.0%	9.2%	9.5%	9.8%	9.9%	10.1%	10.5%


helperVega = f.Jacobian(sigmas);


This is a lower triangular matrix, because the ith calibration helper depends only on the sigma function up to its expiry time. The inverse of this matrix is also interesting, although we wouldn’t need it in its full beauty for our vega calculation, representing d sigma / d helperNpv

33.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
-33.0%	54.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	-53.4%	75.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	-74.0%	103.5%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	-100.4%	140.4%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	-137.2%	187.5%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	-185.1%	275.3%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	-269.1%	448.1%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	-431.6%	953.1%


This says how the different sigma steps go up when the helper belonging to the step goes up (these are the positive diagonal elements) and how the next sigma step would need to go down when the same helper as before goes up, but the next helper stays the same. The contra-movement is roughly by the same amount.

Next we need d bermudan / d sigma. AD delivers

vega #1: 1.6991
vega #2: 1.2209
vega #3: 0.8478
vega #4: 0.5636
vega #5: 0.4005
vega #6: 0.2645
vega #7: 0.1573
vega #8: 0.0885
vega #9: 0.0347


And the last ingredient would be the calibration helpers’ vegas, which can be computed analytically (this is the npv change when shifting the input market volatility by one percent up), d helperNpv / d marketVol

vega #1: 0.0904%
vega #2: 0.1117%
vega #2: 0.1175%
vega #3: 0.1143%
vega #4: 0.1047%
vega #5: 0.0903%
vega #6: 0.0720%
vega #7: 0.0504%
vega #8: 0.0263%


Now multiplying everything together gives

d bermudan / d impliedVol = d bermudan / d sigma x d sigma / d helperNpv x d helperNpv / d impliedVol

which is

vega #0: 0.0153%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0185%
vega #6: 0.0140%
vega #7: 0.0124%
vega #8: 0.0087%


same as above. Partly because I just copy-pasted it here. But it actually comes out of the code also, just try it.

# Adjoint Greeks III – Vanilla Swaps

Time for a post because next week I will be in Dublin over the weekend. Three days with my beloved wife, just the two of us. Without our four little ones who will have a great time too, with their granny. But also without my laptop, which makes me a bit nervous to be honest. Anyway. Good news from my adjoint greeks project, I reached the first bigger milestone, which is compute the delta vector for a plain vanilla swap.

Or rather for a whole portfolio of vanilla swaps. The setting should be realistic and allow for first performance checks, so I look at portfolios of 1, 10, 100, 1000, 5000 and 10000 swaps with random maturities in a horizon of 10, 20, 30, 50 and 70 years, giving 30 test cases for all combinations in total.

The underlying curve (only one, but it would work with separate discounting and forwarding curves as well) consists of 10 deposits, 5 forward rate agreements and as many swaps as the horizon suggests with yearly spacing, i.e. we have 25, 35, 45, 65 and 85 points on the curve for each of the above horizon scenarios respecticely.

The goal is to

• compute the NPV of the swap portfolio
• compute bucket deltas with respect to all instrument in the underlying curve

both in the traditional way (shifting the market quotes of the curve instruments and revalue the portfolio on the new curve) and with AD.

You can see the whole example code here https://github.com/pcaspers/quantlib/blob/adjoint/QuantLib/Examples/AdjointSwapDeltas/adjoint.cpp. Note that no special engine is used to generate the adjoint deltas. We just do a vanilla swap pricing (almost) as always, collecting the market variables of interest in a vector xAD and the result (the NPV) in yAD and then say

CppAD::ADFun<Real> f(xAD, yAD);


Let’s start looking at one single swap. The delta pillars are numbered from 1 to 45 simply, meaning overnight, tomnext, spotnext, spotweek, 1m, … , 6m, 1m-7m fra, 2m-8m, … , 5m-11m, 1y swap, 2y, … , 30y. The random swap below apparently has a maturity between 23y and 24y from today. The notional of the swap is 100 million (in general the portfolio’s total notional is 100 million in all cases regardless of the number of swaps in it).

The finite difference deltas in the column labeled double are computed with a step size of 1E-6 here. The AD values are in the next column and the rightmost column displays the difference between these two. The CppAD framework once more does not disappoint us and gives reliable results. Note that the deltas are expressed w.r.t. a one basis point (1E-4) shift.

results:                double     AD<double>     difference
NPV            -64037022.66   -64037022.66           0.00
Delta #    1          53.94          53.93           0.00
Delta #    2          17.98          17.98           0.00
Delta #    3           0.00          -0.00           0.00
Delta #    4           0.00           0.00          -0.00
Delta #    5           0.00           0.00          -0.00
Delta #    6           2.28           2.28          -0.00
Delta #    7       -2439.17       -2439.17           0.00
Delta #    8           0.00          -0.00           0.00
Delta #    9           0.00           0.00          -0.00
Delta #   10           0.00          -0.00           0.00
Delta #   11           0.00           0.00          -0.00
Delta #   12           7.13           7.13          -0.00
Delta #   13         182.54         182.54          -0.00
Delta #   14           0.00          -0.00           0.00
Delta #   15           0.00           0.00          -0.00
Delta #   16          36.26          36.26           0.00
Delta #   17         495.51         495.51          -0.00
Delta #   18         731.54         731.54          -0.00
Delta #   19         994.67         994.68          -0.00
Delta #   20        1238.63        1238.63          -0.00
Delta #   21        1488.41        1488.42          -0.00
Delta #   22        1745.32        1745.33          -0.01
Delta #   23        2013.84        2013.85          -0.01
Delta #   24        2234.96        2234.97          -0.01
Delta #   25        2535.77        2535.79          -0.01
Delta #   26        2769.90        2769.92          -0.02
Delta #   27        3040.39        3040.41          -0.02
Delta #   28        3321.43        3321.45          -0.02
Delta #   29        3543.09        3543.12          -0.03
Delta #   30        3861.57        3861.65          -0.08
Delta #   31        4109.57        4109.51           0.06
Delta #   32        4364.59        4364.68          -0.09
Delta #   33        4659.00        4659.04          -0.05
Delta #   34        4958.68        4958.73          -0.05
Delta #   35        5166.84        5166.90          -0.06
Delta #   36        5529.58        5529.64          -0.06
Delta #   37        5745.63        5745.69          -0.07
Delta #   38       60419.04       60419.23          -0.19
Delta #   39      167048.46      167051.01          -2.55
Delta #   40           0.00           0.00           0.00
Delta #   41           0.00           0.00           0.00
Delta #   42           0.00           0.00           0.00
Delta #   43           0.00           0.00           0.00
Delta #   44           0.00           0.00           0.00
Delta #   45           0.00           0.00           0.00


Now let’s look at timings. In a bigger example, with 70y horizon (85 curve pillars) and 10000 swaps in the portfolio. On my laptop,

maximum maturity           70 years
portfolio size          10000 swaps
delta vector size          85 pillars

timings (ms)            double     AD<double>         factor  eff#NPVs
pricing                1070           2690
deltas                99030           2700
total                100100           5390        18.5714   4.57692


This says that the pricing takes 1070ms with standard double computations and 2690ms with AD. Approximately, because you can not really trust single timings, in particular not for small problems.

The delta computation under the naive bump and revalue approach takes another fatty 99 seconds, while CppAD only consumes 2.7 seconds on top of the 2.7 seconds from the pricing. Obviously more is done in the pricing step compared to a double computation, later on used in the reverse sweep, so ultimately we have to look at the sum of pricing and delta computation. For this we get a speed up of 18.5x with AD.

The last number eff#NPVs is the effective number of NPV calculations during the AD run which is the total time of the AD calculation divided by the average time for one pricing with classic double‘s. From theory we know that this should be bounded by 5, here it is 4.6.

The magic of AD is that with bigger problems, the 4.6 stays constant, not the speed up factor. If we look at 115 pillars instead of the 85 for example,

maximum maturity          100 years
portfolio size          10000 swaps
delta vector size         115 pillars

timings (ms)            double     AD<double>         factor  eff#NPVs
pricing                1610           4060
deltas               204630           4100
total                206240           8160        25.2745   4.55004


Note that swaps with longer term are generated for this example, so a single pricing takes longer than before. Otherwise we stay at 4.6 times one NPV computation for now 115 bucket deltas and get a speed up of over 25 !

It does not matter how many greeks you want, you just get them for free. I already understood that I thought, but seeing it in reality is quite cool. Let your problem grow as it wants, your complexity stays constant. In other circumstances you are glad to be linear. Sometimes you are lucky-logarithmic. Here complexity is constant. Crazy.

Running all the examples mentioned above and plotting the effective number of NPV calculations against the problem size (defined as the product of the delta vector and the portfolio size) we get this picture.

The effective number of NPV calculations seems to stabilize slightly above 4.5. Cough, probably we should produce more points first … but with a sharp eye …

Next steps: Volatility term structures (which I already scratched involuntarily when converting the ibor coupon pricers), Black pricers, Hull White and Markov functional model. The latter both being slow and totally and utterly inaccessible for bump and revalue sensitivities. And of course filling in all the holes I left open so far. Still, help on the conversion process is highly appreciated (including not only pull requests you can send, but also expensive gifts, plain money (no bitcoins please), or a too-good-to-reject QuantLib job offer).

Welcome back. I hope you all had some nice Christmas days. I surely had. Still some time to quantlib a bit. In my last post (https://quantlib.wordpress.com/2014/12/20/adjoint-greeks/) I wrote about automatic differentiation (AD) and applied it to delta and vega computation in the Black76 model. Since the results were promising I started to work more seriously on my adjoint branch (https://github.com/pcaspers/quantlib/tree/adjoint). It is still in a proof of concept state. The first milestone would be an engine that computes adjoint interest rate deltas for QuantLib::Swap instruments. And guess what, I am almost there. It is hell of a lot of work though and if someone is interested to help, please contact me. You will get intimate with large parts of the QuantLib code and get your hands dirty at the same time with a specific goal to reach.

The essential work is to replace all quantities w.r.t. which we potentially want sensitivities by a generic type T. Then we can plug in a double if we want to work in classic mode or the active type of an AD framework like the one we already know from the CppAD library

CppAD::AD<double>


in order to leverage automatic differentiation in QuantLib. By the way CppAD’s author Brad Bell has replaced the error function with the one from the std:: namespace (it has become part of the C++11 standard). He helped me also with some cmake problems, very nice. At the moment I only work with CppAD, but the changes in the library are not specific to this framework, so other tools (at least those relying on operator overloading) could be used as well. An important point to keep in mind, don’t make yourself dependent on a single third party library. It is enough that we are all dependent on QuantLib, isn’t it. Anyway, all the CppAD specific things I added are currently collected in one, small, single header file ql/qlcppad.hpp.

To get a flavour of what it means to template’ize QuantLib, look at the example of a simple quote. The public part now looks as follows

template <class T = Real> class SimpleQuote_t :
public Quote_t<T> {
public:
SimpleQuote_t(T value = Null<Real>());
T value() const;
bool isValid() const;
T setValue(T value = Null<Real>());
void reset();
...


The templated version of the class is market by an underscore t. Actually the old class does not exist any more, but is rather retrieved by a typedef

typedef SimpleQuote_t<Real> SimpleQuote;
...


to keep existing code compiling without any changes (well, not quite in all cases, unfortunately, but see below for this). On another hand you can now instantiate a simple quote using an AD type with another typedef

typedef SimpleQuote_t<CppAD::AD<double>> SimpleQuoteAD;
...


We can then put this T – simple quote into a T – rate helper and create a T – piecewise yield curve, which will bootstrap the curve using a T – brent solver so that you can finally retrieve zero rates as AD doubles (using a T – linear interpolation, not to forget) and at the same time the sensitivity of zero rates to the input market quotes.

This is what I am trying to demonstrate today. It is not exactly the milestone example from above, but technically very close. The many T’s from above already suggest that along the way you have to adapt a lot of code. And this is not always funny. If you have an error in your templated code, the compiler will often just vomit a big flood of long, unreadable error messages over your compiler window. If you are lucky, at some secret point in that ugly thick soup there is a tender hint of what might actually be the true source of the problem. Then once the code compiles with double‘s this does not at all mean that it does so with AD<double>‘s.

What is more is that we have a lot of special things in QuantLib, like the Null type which for double‘s is implicitly converted to a certain value (namely std::numeric_limits<float>::max) to mark invalid values. This has to be made consistent with an AD double null type. Taking this example we have to introduce a Null<CppAD::AD<double>> which starts not too difficult with

template <class Base> class Null<CppAD::AD<Base> > {
public:
Null() {}
}
...


but since inside CppAD conversion from any type T to CppAD::AD<Base> (replace Base by double if you want) is done like this

template <class Base>
template <class T>
{	return *this = Base(t); }


i.e. via the Base type, we need also a conversion from our new Null type above to Base, which can be done as follows

..
// this is needed, because in ad_assign.hpp line 124ff
// assignment from T to AD<Base> is done via conversion from T
// to Base and then to AD<Base>. If for example
// to double so that then conversion to AD<double> from this
// works.
operator Base() const { return static_cast<Base>(Null<Base>()); }
...


and which does the trick in the end.

Some structural problems arise, because when templating you effectively turn parts of the library into a header-only version, that is, the implementation moves from the cpp to the hpp files. This is because the templated code can only be instantiated when compiling actual client code. Or you instantiate a list of specific types in the library itself, but then you tie yourself to exactly these implementations (e.g. support for CppAD in our context).

At least for iborcoupon.hpp, cmscoupon.hpp and couponpricer.hpp I arrived at circle references, which were not solvable by forward declarations. So I had to split these files into two parts each, one base part, that is included internally at some places in the library instead of the whole thing and a second part with the same name as the whole thing, which in turn includes the base part, to stay backward compatible.

Another source of many headaches was the already highly generic territory around the piecewise yield curve. I really admire this part of the library, it is beautiful and works incredibly robust. But it is challenging to adapt this piece of art to enable AD. In the end I still wanted to write something like


PiecewiseYieldCurve<ZeroYield,
Linear,
IterativeBootstrap,
instruments,
Actual365Fixed());
...


to initialize a curve. The only change is the last template parameter, which can be set optionally to an AD double type (if not given, it is just assumed double, so replicating the old behaviour).

One real non backward consistent change I introduced here is that factory classes for interpolations are now templated. The expression Linear therefore has a different meaning than before, namely taking a template type T that should be used for the calculations. The pro is that in generic classes like the piecewise yield curve we can just still write Linear and behind the scenes the type T (taken from the last template parameter of PiecewiseYieldCurve) is plugged into the factory template so that it can be done what has to be done. The con is that if the factory is explicitly used to create an interpolation object, we have to write

Linear<Real>


now instead of Linear alone. But assuming that the factory classes are mostly used as template parameters and not to actually instantiate classes in client code, this is hopefully not a huge change. Well, let’s see what Luigi has to say about this … 😉

Here is the new piecewise yield curve declaration, compared to the original version now an even nicer example of what you can do with templates in C++ …

template <template <class> class Traits, template <class> class Interpolator,
template <class, class> class Bootstrap = IterativeBootstrap,
class T = Real>
class PiecewiseYieldCurve
: public Traits<T>::template curve<Interpolator>::type,
public LazyObject {
...


Not only the Interpolator, but also the Traits (which are the Discount, Forward, ZeroYield kind of specifications) and the Boostrap class (IterativeBootstrap or LocalBootstrap) now have a new template parameter.

But to summarize, from outside everything looks and is used as before. If you want to use an AD double type you just have to add a new template parameter to the PiecewiseYieldCurve construction, that’s all.

Another limitation I came across is that there are not templated typedefs in C++. So for example the RateHelper typedef changes to

template <class T> struct RateHelper_t {
typedef BootstrapHelper<YieldTermStructure_t<T>, T> Type;
};


and the final rate helper types (classic and AD’ized versions respectively) are retrieved via

typedef RateHelper_t<Real>::Type RateHelper;
...


i.e. with an additional ::Type suffix.

Regarding templated functions, the type can generally be deduced from the function parameters, so that we can still write code like

Real z = 1.0;
if (close(z, 0.0)) {}


(an exceptionally dumb code example), although the declaration of close has changed to

template<class T> bool close(T x, T y);


However this is not totally backward compatible, because the expression close(z,0) was legal code before and now the compiler goes “hey, I need to deduce a type T here and you are giving me an int and a double, I am confused, I can not work like this, please focus a bit on what you are doing, will you”. This was silently resolved before by an implicit conversion of 0 to a double. There was actually one line in the test suite (or the examples, I don’t remember) which exactly had this problem. Easy to solve (change 0 to 0.), but code that compiled before now doesn’t which is always a major thing. We should say that the code was illegal ever since, but accidentally compiled in the past.

A deeper and more interesting issue in the same direction is this one here:

template <class T = Real>
void setCouponPricer(
const typename Leg_t<T>::Type &leg,
const boost::shared_ptr<FloatingRateCouponPricer_t<T> > &pricer) {
...


An utility function which sets coupon pricers. You are already not scared anymore by the first argument’s typename and ::Type stuff, are you ? Now look in the Gaussian1dModel.cpp example file, lines 549ff:

...
const Leg &leg0 = underlying4->leg(0);
const Leg &leg1 = underlying4->leg(1);
boost::shared_ptr<CmsCouponPricer> cmsPricer =
boost::make_shared<LinearTsrPricer>(swaptionVol, reversionQuote);
boost::shared_ptr<IborCouponPricer> iborPricer =
boost::make_shared<BlackIborCouponPricer>();

setCouponPricer<Real>(leg0, cmsPricer);
setCouponPricer<Real>(leg1, iborPricer);
...


Everything looks usual here, altohugh Leg, CmsCouponPricer, IborCouponPricer, BlackIborCouponPricer are again all merely typedef‘s for the new template versions with T = Real as above. Maybe interesting to mention, LinearTsrPricer is not converted yet, so old code can be mixed with new one, even if the both worlds are connected by an inheritance relationship. Which is good.

But why does setCouponPricer need the explicit specification of the double type ? This is because the compiler does not recognize that a boost::shared_ptr<U> is a specialization of a boost::shared_ptr<V>, even if U (CmsCouponPricer) specializes V (FloatingRateCouponPricer). He (or she ? … C++ compilers are more likely female creatures, aren’t they ?) doesn’t know that boost::shared_ptr is some kind of a pointer, eh, it is just some user defined class. You may wonder (at least I did), why this kind of code worked until now anyway in other, non-templated-function-contexts. The reason is that implicit conversions are defined for these cases in the boost::shared_ptr class. But these are not even noticed by the compiler when infering template types. Bad luck, she is a bit picky here. Actually this problem can be worked around by using a generic boost::shared_ptr<G> second argument in setCouponPricer (which accepts everything that is a boost shared pointer) and using some very fancy generic programming stuff to check that G is actually a FloatingRateCouponPricer_t during compile time. This is something for later thoughts and at leat one blog post alone.

Time for the full code example. I bootstrap a toy curve from 3 deposit quotes with maturities 1m, 2m, 3m and compute the sensitivity of the 3m zero rate to the underlying deposits’ market quotes. Depending on the self documenting macro YES_I_WANT_TO_USE_AD either AD or a simple finite difference approximation will be used in our example.

#include <ql/quantlib.hpp>
#include <boost/assign/std/vector.hpp>

using namespace QuantLib;
using namespace boost::assign;

// comment or uncomment this macro

#endif

int main() {

// define the double type to be used

std::cout << "Example with AD enabled" << std::endl;
#else
std::cout << "Example with AD disabled, use finite differences"
<< std::endl;
typedef double dbltype;
#endif

// some typedefs to keep notation simple

// the reference date

Date referenceDate(2, January, 2015);
Settings::instance().evaluationDate() = referenceDate;

// declare the independent variables (sample deposit quotes)
std::vector<dbltype> x(3);
x[0] = 0.0035;
x[1] = 0.0090;
x[2] = 0.0121;

#endif

Real h = 1e-4;
auto rate1mp = boost::make_shared<SimpleQuoteAD>(x[0] + h);
auto rate2mp = boost::make_shared<SimpleQuoteAD>(x[1] + h);
auto rate3mp = boost::make_shared<SimpleQuoteAD>(x[2] + h);
#endif

// build a piecewise curve

quote1m, 1 * Months, 2, TARGET(), ModifiedFollowing, false,
Actual360());

quote2m, 2 * Months, 2, TARGET(), ModifiedFollowing, false,
Actual360());

quote3m, 3 * Months, 2, TARGET(), ModifiedFollowing, false,
Actual360());

instruments += dp1m, dp2m, dp3m;

PiecewiseYieldCurve<ZeroYield, Linear, IterativeBootstrap, dbltype> curve(
referenceDate, instruments, Actual365Fixed());

std::vector<dbltype> y(1);

Real t3 = curve.timeFromReference(dp3m->latestDate());
//Real t3 = 2.5 / 12.0;

y[0] = curve.zeroRate(t3, Continuous).rate();

std::cout << std::setprecision(16);
std::cout << "zero rate = " << y[0] << std::endl;

// define the operation sequence
std::vector<Real> dw(3), w(1, 1.0);
dw = f.Reverse(1, w);
std::cout << "gradient = (" << dw[0] << "," << dw[1] << "," << dw[2] << ")"
<< std::endl;
#else
// finite difference values
Real y0_1 = curve.zeroRate(t3, Continuous).rate();
Real y0_2 = curve.zeroRate(t3, Continuous).rate();
Real y0_3 = curve.zeroRate(t3, Continuous).rate();
std::cout << "gradient = (" << (y0_1 - y[0]) / h << "," << (y0_2 - y[0]) / h
<< "," << (y0_3 - y[0]) / h << ")" << std::endl;
#endif

return 0;
}


Note that the bootstrap process itself (which is a zero search using the Brent solver) is differentiated when using AD !

The output of this code with and without AD enabled is


Example with AD disabled, use finite differences
zero rate = 0.01188296345959088

zero rate = 0.01188296345959088



The interpretation is that the 3m (continuously compounded Actual/365Fixed) zero rate grows by 0.04, 0.0, 0.97 basispoints if the 1m, 2m, 3m deposit market quotes (simply compounded, Actual/360) grow by 1 basispoint respectively. Note that the sensitivity on the 1m quote is coming from the two fixing days involved in the deposits.

Lets check if the linear interpolation is correctly taken into account in the derivatives, too. This is also something that really requires AD: While you could do it for linear interpolations by hand, you wouldn’t really want to differentiate the value of say a cubic spline at some arbitrary point by the splines’ values on its pillars, even if this is still realistic to write down. For our test, we replace t3 by two and a half month (see the commented line in the code). We get


Example with AD disabled, use finite differences
zero rate = 0.01003550346419732