Adjoint Greeks

In this second post I want to write about my very first steps in automatic differentiation (AD). Of course applied to quantitative finance where derivatives are called greeks. And with QuantLib of course. AD is a topic Ferdinando Ametrano mentioned during our dinner during the workshop in Düsseldorf a few weeks ago and indeed it sounded interesting: Take a swap which can have hundreds of deltas nowadays and compute all of them in just one sweep with the complexity of at most 4 (four, vier, quattro) present value calculations. Sounds like sorcery, but actually is nothing more than an application of the chain rule of differentiation. This says

\frac{d}{dx} (f \circ g) = \left( \frac{d}{dy} f \circ \frac{d}{dx} g \right)

which means if you want to compute a derivative of the result of a chain of computations by an input variable you can do so by computing the derivatives of the single computations and combining the results appropriately. Note that the formula is more general than it might look at first sight. It is true for functions of several variables, possibly with several outputs also. Derivatives are then matrices (Jacobi matrices) and the \circ means matrix multiplication.

Actually my analysis professor at university introduced the derivative as a bounded linear operator between Banach spaces (which can have infinite dimensions, i.e. infinitely many input and output variables, which do not need to be countable even, boom, brain overflow in the second year …) approximating the function in question with order at least o(\lVert h \rVert). Pure fun, only outperformed by his colleague who started the first lesson in linear algebra by defining what a semi-group is. It is only now that I am working in banks for more than 15 years that I really appreciate this kind of stuff and wished I’d have stayed in academia. Well, my problem, not yours.

Anyway. There a a lot of AD frameworks for C++ and a good starting point is surely the website http://www.autodiff.org/. What they all do is taking your (C++) program, looking at each of the operations and how they are chained together and then compute derivatives using exact formulas in each step.

This is exactly what we all learned at school, namely how to differentiate sums, products, quotients of functions, what the derivative of x^n is, or how to differentiate more complicated functions like e^x or \sin(x) and so on. And how to put everything together using the chain rule! In AD language this is more precisely called forward mode differentiation. There is also a backward mode working from the outside to the inside of a chain of functions. This is a bit unusual and and it useful to work through some examples to get the idea, but in the end it is also nothing more than applying the chain rule. The decision what mode should be used is dependent on the dimensions n and m of the function

f: \mathbb{R}^n \rightarrow \mathbb{R}^m

If m is big and n is small, you should use the forward mode to compute the derivatives of the m output values by the n input values in the most efficient way. If m is small and n is big, you should use the reverse mode. In our application above, computing hundreds of interest rate deltas for a swap, m is one and n is a few hundreds, so this is a problem for the reverse mode.

There are two main procedures how the frameworks do automatic differentiation: One way is source code transformation (SCT). I did not look into this, but as far as I understood the idea is that your source code is enriched in order to gather the necessary information for derivatives computation. The other way is operator overloading (OO). This means that your standard numeric type, typically our beloved 53bit

double

is replaced by a special type, say (notation stolen from the framework I will introduce below)

AD<double>

and each operation (like

+, -, *, / 

is overloaded for this new type, so that during computation of the original code, the operation sequence can be taped and afterwards used for derivatives computation. For the younger readers among you who do not know what “taping” means, this refers to this beautiful kind of device http://upload.wikimedia.org/wikipedia/commons/c/c7/TG1000.ogv (I could watch this movie for hours …). The red button at the lower right corner of the device is for recording I guess. Usually you would have to press two buttons “record” and “play” at the same time to start recording. Only pressing the record button would not work. You had to press the record button a little bit earlier than the play button, because in play mode the record button was blocked.

Now the hard part of this post begins, I am trying to get some AD running for a relevant example. The framework of my choice is CppAD (http://www.coin-or.org/CppAD/). An exciting yet easy enough first example is probably the Black76 formula. This is implemented in blackformula.hpp and the interface looks as follows

/*! Black 1976 formula
    \warning instead of volatility it uses
             standard deviation,
             i.e. volatility*sqrt(timeToMaturity)
*/
Real blackFormula(Option::Type optionType, Real strike, 
                  Real forward,
                  Real stdDev, 
                  Real discount = 1.0,
                  Real displacement = 0.0);

The first step is to do something to allow for our AD-double-type. A possible solution is to turn the original implementation into a templated one like this

/* ! Black 1976 formula, templated */
template <class T = Real>
T blackFormula(Option::Type optionType, T strike, 
                T forward, T stdDev,
                T discount = 1.0, 
                T displacement = 0.0) {
/* ... */
}

That’s not all, unfortunately. In the function body we have a line

T d1 = log(forward / strike) / stdDev + 0.5 * stdDev;

In order to have the logarithm differentiated correctly we have to make sure that if T is the AD type, the log function is taken to be the special implementation in the CppAD library (and the std implementation otherwise). To do so I made both implementations visible by importing them into the current namespace by

using std::log;
using CppAD::log;

Now depending on T being the standard double or the AD type, the appropriate implementation is used. First problem solved.

There is another function, the cumulative normal, used in the black formula, with no implementation in CppAD. Well no problem, there is the error function at least, so I can just replace the cumulative normal with the error function. The first tests were disappointing. For a simple call I got slightly different premia (about 0.01 basispoints different), depending on whether using the conventional double or the AD double. The reason became clear soon (after some hours of testing in the wrong direction): In the CppAD implementation, the error function uses an approximation, which is fast, but inaccurate (relative error of 10^{-4}).

Not that I very much care for super precise results in the context of finance, but the acceptance of the implementation would probably suffer a bit when Black prices are not matching reference results I guess …

Ok, I wrote an email the author of CppAD and complained. He promised to do something about it. In the meantime I decided to help myself by converting the QuantLib implementation of the error function into a templated version as well and using this instead of the CppAD error function. The conversion process already felt a bit more standard now, so I guess the rest of the library will just go smoothly. The code snippet in the Black formula then looks as follows:

ErrorFunction<T> erf;
T nd1 = 0.5 + 0.5 * erf(optionType * d1 * M_SQRT1_2);
T nd2 = 0.5 + 0.5 * erf(optionType * d2 * M_SQRT1_2);

Note the error function also gets the template parameter T, which will be double or AD double in the end. Now before we start to write some client code using the new black formula there is some subtlety I want to mention. Functions are often not simple chains of operations but contain conditional expressions sometimes, like

z = x > y ? a : b;

The max function is another example of an (disguised) expression of this kind. Loops also, but they are even a more complex variant actually.

The thing about these conditional expressions is the following: You can just leave them as they are under automatic differentiation, no problem. But as I already mentioned the process is two step: The first is to tape the operation sequence. This is done during an evaluation of the function with certain fixed input parameters. The second step is the derivative computation. The result is the derivative evaluated at the original input parameters.

But you can evaluate the derivative at a different point of input parameters without retaping the operation sequence! This is possible only if the function evaluation is completely captured by the CppAD tape recorder, including all possible paths taken depending on conditional expressions. This is supported, but you need to adapt your code and replace the conditionals by special functions. In the black function we have for example

if (stdDev==0.0)
      return std::max((forward-strike)*optionType,
                      Real(0.0))*discount;

This needs to be replaced by

T result0a = max<T>((forward - strike) * optionType,
                     0.0) * discount;

(in particular not just terminating with returning the value for the border case of zero standard deviation) and at the end of the function

T ret = CondExpEq(stdDev, T(0.0), result0a,
                  CondExpEq(T(strike), T(0.0), result0b,
                            result));
return ret;

Here another conditional is nested for the special case of zero strike. Finally the max function above is also implemented using the CppAD conditional CondExpEq, because not present natively in CppAD:

namespace CppAD {

    template<class Base> CppAD::AD<Base> 
           max(CppAD::AD<Base> x,CppAD::AD<Base> y) {
                return CppAD::CondExpGt(x,y,x,y);
           }

}

In order to keep the code running without CppAD types we have to add an implementation for CondExpEq for regular double types, like this

// used for cppad-ized function implementations
inline double CondExpGt(double x, double y, 
       double a, double b) { return x >= y ? a : b; }
inline double CondExpEq(double x, double y, 
       double a, double b) { return x == y ? a : b; }

The nice thing of all this is that later on we could tape the operation sequence once and store it for reuse in all following computations of the same function.

I just started trying to make QuantLib cooperate with CppAD, so the things above are initial ideas to keep QuantLib backward compatible on the one hand and avoid code duplication on the other hand. And it is hell of a lot of work I guess with many more complications than can be seen so far in this little example, to be realistic.

But let’s try out what we have done so far. Here is a code example that works with my new adjoint branch https://github.com/pcaspers/quantlib/tree/adjoint

#include <iostream>
#include <vector>
#include <ql/quantlib.hpp>

#include <ql/qlcppad.hpp>

using namespace QuantLib;

int main(void) {

    // parameters

    Real tte = 10.0;
    Real sqrtt = std::sqrt(tte);
    Real strike = 0.03, forward = 0.03, volatility = 0.20;

    // declare the inpedendent variables for which
    // we want derivatives

    std::vector<CppAD::AD<Real>> x(2);
    x[0] = forward;
    x[1] = volatility * sqrtt;
    CppAD::Independent(x);

    // tape the operation sequence

    std::vector<CppAD::AD<Real>> y(1);
    y[0] = blackFormula2<CppAD::AD<Real>>(Option::Call, 
                                        strike, x[0], x[1]);
    CppAD::ADFun<Real> f(x, y);

    // compute the partial derivatives using reverse mode

    std::vector<Real> dw(2), w(1, 1.0);
    dw = f.Reverse(1, w);

    // output the results

    std::cout << "price = " << y[0] << " delta = " << dw[0]
              << " vega = " 
              << dw[1] / sqrtt * std::sqrt(100.0) 
              << std::endl;

    // cross check the results against the classic 
    // implementation

    BlackCalculator c(Option::Call, strike, forward, 
                      volatility * sqrtt);
    std::cout << "price = " << c.value() << " delta = " 
              << c.deltaForward()
              << " vega = " << c.vega(tte) << std::endl;

    return 0;
}

I kept the original blackFormula function as is for the moment (for testing purposes) and implemented the new, templated version as blackFormula2. I declared the forward and the standard deviation input as independent variables w.r.t. which I want partial derivatives (i.e. forward delta and vega). The strike input parameter (as well as the discounting and displacement which are invisible here) is kept as a regular double variable. Note that in the black formula implementation it is converted to the CppAD double type, but not included as an independent variable into the operation sequence. We could include it though, getting then the derivative of the call price by the strike (which is useful also, because this is the negative digital price).

The results are compared against the original implementation in BlackCalculator. The output of the test code is

price = 0.007445110978624521 delta = 0.6240851829770754 vega = 0.03600116845290408
price = 0.007445110978624521 delta = 0.6240851829770755 vega = 0.03600116845290408

Good, works. And we did not need to implement the analytical formulas for forward delta and vega to get exactly the same result up to machine epsilon. Sorcery. Needless to say that CppAD can compute higher order derivatives as well, so gamma, vanna, volga, speed, charm and color are just a stone’s throw away.

Have a nice christmas everyone, see you back healthy in 2015 !

Adjoint Greeks

Simulated Annealing Optimization

First of all, as a warning, if you want to read a good blog about QuantLib you should probably go to other places like this one http://implementingquantlib.blogspot.de. However I thought it could be fun to relax from coding all the time by also writing a bit about it. For this first post I take one of the “Erlkönige” from my presentation at the QuantLib User Meeting 2014 http://quantlib.org/qlws14.shtml (hey, why am I not on the picture ?), namely a new optmizer class https://github.com/lballabio/quantlib/blob/master/QuantLib/ql/experimental/math/simulatedannealing.hpp. (If you don’t know what an Erlkönig is, look at slide two of my presentation.) But let’s start with defining a target function. This is done by deriving from CostFunction and implementing (at least) the value and values methods (which are quite similar in our one dimensional example here)

    class target : public CostFunction {
      public:
        target() {}
        Real value(const Array &a) const {
            Real x = a[0];
            return x * x / 100.0 + 0.3 * sin(x) * sin(x);
        }
        Disposable<Array> values(const Array &x) const {
            Array res(1);
            res[0] = value(x);
            return res;
        }
    } aTarget;

This function looks like this (you should click on the image for a better view)

simann_target

It has many local minima, e.g. some near -6, -3, +3, +6 and one, unique global minimum at 0. Many classic optimizers wouldn’t find this global minimum unless you give them a good start value which lies nearby the solution. Our new experimental global optimizer in QuantLib follows Chapter 10.9 in Numerical Recipes in C. This is based on the Nelder Mead algorithm (aka Simplex in the library, not to be confused with the linear programming thing). This adds some exponential distributed noise to the target function which does the trick of finding the global minimum with probability one when applied carefully.

But let’s start with the deterministic version of the optimizer. You can create the optimizer with

SimulatedAnnealing<> sa(0.2,t0,decay,steps);

where I set t0=0.0, decay=0.0 and steps=1 (which just means no noise). The first parameter is the Simplex lambda used to construct the vertices of the simplex which in general are the start value of the optimization (see below), this value plus (1,0,0,…) times the lambda, this value plus (0,1,0,…) times lambda and so on. Since we have a one dimensional problem our simplex consists of two vertices only.

I start the optimization at x=4.8, a bad start value as we will see soon.

    Array initial(1);
    initial[0] = 4.8;

I put no constraint on the problem

    NoConstraint nc;

allow for a lot of iterations (1000000, that’s 1 million, yes !), require 500 stationary state iterations before assuming convergence and set the epsilon for root, function value and gradient to 1E-8

    EndCriteria ec(1000000,500,1E-8,1E-8,1E-8);

As a side note, it is up to the actual optimizer what of these criteria (if any at all !) are used. The simulated annealing optimizer checks for max and stationary state iterations by comparing the simplex size with the root epsilon and ignores function and gradient epsilon. As another example, the Levenberg Marquardt implementation takes not less than three tolerance parameters in its own constructor while only using max iterations and function epsilon from the end criteria and ignoring stationary state iterations and the other two tolerances.

Now create a problem out of the target function, the constraint and the start value

    Problem p(aTarget,nc,initial);

and we are ready to run the optimization

    EndCriteria::Type ret = sa.minimize(p,ec);

You can retrieve the best found value for the independent variable by p.currentValue() and the function value by p.functionValue(). The variable ret contains the status of the optimization which can be one of the following enum values defined in EndCriteria

        enum Type {None,
                   MaxIterations,
                   StationaryPoint,
                   StationaryFunctionValue,
                   StationaryFunctionAccuracy,
                   ZeroGradientNorm,
                   Unknown};

self explaining in a way, aren’t they.

We visualize the optimization run by plotting one (the first) vertex of the simplex (plotting both wouldn’t be instructive) for each step in the optimization.

simann_det

I only plot the first 25 steps of the optimization because after that it gets boring (you see the pattern, don’t you). As I said the starting value was bad, just a bit left from the top of the second hill to the right from the global minimum. Function values are colored here, the minimum (at 0) is black, the next best local minimum blue (near +-3), the next purple (near +-6). The algorithm (the green dots) walks over the top of the hill and then runs into the valley right from +6, there converging nicely, but to a local minimum only.

The green dots lie exactly on the surface of the target function, indicating that the function value without any add on is used for optimization.

Now the idea of simulated annealing comes into play. Instad of zero noise we start with a temperature of 0.05. This means “noise” is added to the target function value during optimization. The noise is defined to be expoentially distributed with parameter 1 / temperature, i.e. with expectation and standard deviation equal to the temperature.

This lifts our green dots from the surface in a random way and in particular helps them get out of valleys that might turn out to be local ones only. During the optimization the temperature is decreased, converging to zero, so that the deeper valleys become visible to the algorighm while the more shallow ones are still filled by the noise. If we start with a temperature high enough and decrease it slow enough we can hope to find the global minimum this way. Of course the art here is to find a good scheme for the cooling and balancing this scheme against the number of necessary function evaluations.

Back to our first try with starting temperature 0.05. We choose to decrease this temperature by a factor of 1 – decay, decay = 0.0001 every step (step = 1). We plot the first 2000 steps of the algorithm’s run (wow, note how many steps are performed compared to the deterministic case).

simann_1

After around 500 steps thanks to the noise the dots jump over to the main valley and stay there, thus finding the global minimum ! The light blue grid floating over the target function surface visualizes the expected noise added to the original target function.

I should mention here how you can explicity specify the random number generator to generate the noise (it has to be one with a next() method giving a U(0,1) random variable).

    SimulatedAnnealing<> sa(0.2,t0,decay,steps,MersenneTwisterUniformRng(42));

would choose the Mersenne twister rng (which is also the default if nothing is given), with seed 42. The default seed is 0 meaning that the system timer is used for initialization, which is bad if you want to reproduce results.

Let’s do the same again, but now cooling down the whole thing a bit faster (decay = 0.0003). The result is as follows

simann_2

Oops. The dots jump over again, but only the the next valley, only giving a local minimum again. We cooled down too fast to find the global minimum in this case.

In practice one would probably start with a much higher temperature, like 0.25. With that and the same decay rate of 0.0001 per step we get

simann_3

We are converging to the global minimum again ! However we already have 30000 steps plotted here and we need even more to actually converge (like 400000 with these parameters !). We should surely fine tune our parameters before using them in a real life application.

Looking at the same picture more from a bird’s view we see the structure more clearly

simann_4

While all 5 valleys are covered in the first few hundred steps, after a while only the three inner ones are visited and finally the algorithm stays in the inner, best valley.

That’s all for today, it was fun to write this. Anyone read until here ?

Appendix. Here is the full code example, if you want to try it by yourself.

#include <ql/quantlib.hpp>

using namespace QuantLib;

int main(int, char *[]) {

    class target : public CostFunction {
      public:
        target() {}
        Real value(const Array &a) const {
            Real x = a[0];
            return x * x / 100.0 + 0.3 * sin(x) * sin(x);
        }
        Disposable<Array> values(const Array &x) const {
            Array res(1);
            res[0] = value(x);
            return res;
        }
    };

    target aTarget;

    double t0 = 0.25;
    double dec = 0.0001;
    int steps = 1;

    SimulatedAnnealing<> sa(0.2, t0, dec, steps, MersenneTwisterUniformRng(42));

    Array initial(1);
    initial[0] = 4.8;

    NoConstraint nc;
    EndCriteria ec(1000000, 500, 1E-8, 1E-8, 1E-8);

    Problem p(aTarget, nc, initial);

    EndCriteria::Type ret = sa.minimize(p, ec);

    std::cout << "found x=" << p.currentValue()[0]
              << " / y=" << p.functionValue() << " with ret code " << ret
              << std::endl;

    return 0;
}

The output running this should look like this

found x=-7.10543e-14 / y=1.5651e-27 with ret code StationaryPoint
Simulated Annealing Optimization