The Pippi Langstrumpf Principle

I received some complaints that my blog lacks seriousness here and there (female compilers), contains misleading references (Bourianochevsky and Jarulamparabalam formula) or presents non-quant-related topics (Brad Mehldau’s CDs). The title of the blog itself is offensive. Quant work is neither fun nor a game. I apologize for these incidents. That won’t happen again.

Pippi Langstrumpf has many things in common with us quants. She has superpowers. She is incredibly rich. And if she doesn’t like the real world as it is she just changes it. That’s what we do too. Pippi calls that “ich mach mir die Welt, widewide wie sie mir gefällt”.

For example we do not like negative rates, no problem we just pretend as if they were one percent (EUR swaptions as of April 2015) or three percent (now the shift for EUR caps!) higher as they actually are. That was a topic of one of my recent posts.

And we do other incredible things. We can change the probability of events. For example we can change the probability of winning in the lottery from way-too-low to say 90% or 95% (we don’t actually do this, because we are notoriously overpaid and have enough money already). That’s pretty close to Pippi “zwei mal drei macht vier”.

The only thing we can not do is turn impossible events into possible ones. All this is thanks to the Radon-Nikodym theorem, the Girsanov Theorem, the Novikov condition. All these names sound Russian, but Nikodym was Polish and Radon Austrian in fact.

This changing of measures is totally usual in pricing. But there are some places where the result of a computation still depends on the choice of measure. One such place is the computation of potential future exposure. Another is the extraction of probabilities for the exercise of calls.

I would like to illustrate the latter point. Let’s fix a setting and write $\nu = P(0,T) E^T \left( \frac{\Pi(t)}{P(t,T)} \right)$

as a pricing formula under the so called T-forward-measure for a payoff $\Pi$. What is always happening in pricing is that you express the value of the payoff in terms of a numeraire, you choose a measure that makes this expression a martingale and then take the expectation to get the price. Again in terms of the numeraire, so you have to multiply by the numeraire to get the price in the usual metric. The numeraire in the T-forward-measure is the zero bond with maturity T.

Now assume $\Pi$ represents a long option with expiry t. Sometimes people compute $p := P^T( \Pi(t) > 0)$

and label this exercise probability for the option. Does that make sense? Example:

We consider the valuation date 30-04-2013 and price an european at the money payer swaption with expiry in $5$ years and underlying swap of $10$ years.
The implied volatility is $20\%$ and the yield curve is flat forward at $2\%$ (continously compounded rate). The mean reversion in a Hull White
model is set to $1\%$ and the volatility is calibrated to $0.00422515$ in order to reprice the market swaption.

Now we compute $p$ for different choices of $T$. We get 48% for T=16, 36% for T=60, 21% for T=250. Of course we use QuantLib for this purpose. I will not give code examples today…

Obviously the measures associated to each T are very different.

One can try to fix this. We define a measure-invariant exercise probability as $\pi := \frac{N(0)}{P(0,t)} E^N \left( \frac{1_{\Pi(t)>0}}{N(t)} \right)$

which is a forward digital price for a digital triggered by $\Pi(t) > 0$. This formula is for an arbitraty numeraire $N$. The naive exercise probability in this notation would be $p = p_N := E^N(1_{\Pi(t)>0})$

on the other hand, a quantity dependent on $N$ as denoted by the subscript. Actually both notions coincide for the special choice of the t-forward-measure: $\pi = E^t ( 1_{\Pi(t)>0} )$

We compare $\pi$ and $p$ for different T-forward-measure in the example above.

  T      \nu         p       \pi
16  0.0290412  0.478684  0.481185
30  0.0290385  0.435909  0.484888
60  0.0290381  0.364131  0.480800
100  0.0290412  0.30056   0.480948
250  0.0290415  0.211921  0.479532


First of all, $\nu$ is the same up to numerical differences under all pricing measures, as expected. The naive exercise probability $p$ on the other hand, is obviously not well defined. And different measures do not imply slight differences in the results, they are completely different. The alternative definition $\pi$ again is measure independent.

The following picture shows the underlying npv as a function of the Hull White’s state variable (expressed in normalized form, i.e. with zero expectation and unit variance). For bigger T the payoff is shifted to the right, and the option is exercised less often (only for bigger z’s).

How is that made up for in pricing. This picture shows the factor that is multiplied with $\Pi$ before taking the expectation to get the final NPV. The correction is rather mild for T=16 i.e. around 1. For T=250 on the other hand model states belonging to bigger z’s are weighted with much higher factors.

So can we take $\pi$ instead of $p$ as a definition for the exercise probability ? Not really, because for a bermudan option the exercise probabilities for the different expiries and the probability for no exercise at all will not add up to one. This is evident as well, because we use different measures for the single expiries namely the t-forward-measures (using the representation for $\pi$ as the t-forward $p$ above).

Sure we could fix that as well by renormalizing the vector of $p$‘s again. But I would just stop here for the moment. Similarly a potential future exposure defined as the quantile of an underlying npv distribution will not be well defined under pricing measures.

Today I feature a sequel to the adjoint greeks drama (see my prior posts on this).

Before I start I would like to point you to a new and excellent blog authored by my colleague Matthias https://ipythonquant.wordpress.com/. You will want to follow his posts, I am certain about that.

I am still on my way to convert the essential parts of the library to a template version. Since this is boring work I created a small friend who helps me. No I did not go mad. It is a semi-intelligent emacs-lisp script that does some regular expression based search-and-replacements which speeds up the conversion a lot. Emacs is so cool.

Today I am going to calculate some derivatives of a bermudan swaption in the Gsr / Hull White model. This is the first post-vanilla application and admittedly I am glad that it works at last.

One point I have to make today is that AD is slow. Or to put it differently, doubles can be tremendously fast. Lookit here:

void multiply(T *a, T *b, T *s) {
for (int i = 0; i < N; ++i) {
for (int k = 0; k < N; ++k) {
for (int j = 0; j < N; ++j) {
s[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
}


This code multiplies two matrices a and b and stores the result in s. It is the same algorithm as implemented in QuantLib for the Matrix class.

When feeding two 1000 x 1000 double matrices the code runs 1.3s on my laptop if compiled with gcc 4.8.2 using -O1. With -O3 I get 950ms. With clang 3.7.0 (fresh from the trunk) I get 960ms (-O1) and 630ms (-O3). With T=CppAD::AD on the other hand the running time goes up to at least (-O3) 10.9s (gcc) and 14.3s (clang). Code optimization seems to be a delicate business.

These timings refer to the situation where we use the AD type without taping, i.e. only as a wrapper for the double in it. This seems to indicate that at least for specific procedures it is not advisable to globally replace doubles by their active counterpart if one is not really interested in taping their operations and calculating the derivatives. Performance may break down dramatically.

What is the reason behind the huge difference ? I am not really the right person to analyze this in great detail. The only thing I spotted when looking into the generated assembler code is that with double there are SIMD (single instruction multiple data) instructions for adding and multiplying in the nested loops (like addpd and mulpd, it’s a long time since I programmed in assembler and it was on a 6510 so I am not really able to read a modern x86 assembler file).

With CppAD::AD there doesn’t seem to be such instructions around. So part of the perfomance loss may be due to the inability of streaming AD calculations. For sure this is not the only point here.

The second point to make today is that AD has pitfalls that may in the end lead to wrong results, if one uses the AD framework blindly. Let’s come back to our specific example. The underlying source code can be found here if you are interested or want to run it by yourself.

It is a bermudan swaption, ten years with yearly exercise dates. The model for pricing will be the Gsr or Hull White model. We just want to compute the bucket vegas of the bermudan, i.e. the change in its NPV when the implied market volatility of the canonical european swaptions used for the model calibration is increased by one percent.

The rate level is set to $3\%$ and the strike is out of the money at $5\%$. The volatilities are produced by SABR parameters $\alpha=3\%$, $\beta=60\%$, $\nu=12\%$, $\rho=30\%$. All of this is arbitrary, unrealistic and only for explanatory purposes …

The naive AD way of doing this would be to declare the input implied volatilities as our quantities of interest

   CppAD::Independent(inputVolAD);


and then go through the whole model calibration

   gsrAD->calibrateVolatilitiesIterative(


and pricing

   yAD = swaptionAD->NPV();


to get the derivative d bermudan / d impliedVol

    CppAD::ADFun<Real> f(inputVolAD, yAD);
vega = f.Reverse(1, w);


When I first wrote about AD I found it extremely attractive and magical that you could compute your sensitivities like this. That you can actually differentiate a zero search (like in the case of yield curve bootstrapping) or an optimization (like the Levenberg-Marquardt algorithm which is used here).

However there are dark sides to this simplicity, too. Performance is one thing. The calibration step and the pricing including the gradient calculation takes

AD model calibration = 1.32s


We can also do it differently, namely calibrate the model in an ordinary way (using ordinary doubles), then compute the sensitivity of the bermudan to the model’s sigma and additionally compute the sensitivity of the calibration instruments to the model’s sigma. Putting everything together yields the bermudan’s bucketed vega again. I will demonstrate how below. First I report the computation time for this approach:

model calibration = 0.40s


This leaves us with a performance gain of around 15 percent (7.32s vs 8.43s). This is not really dramatic, still significant. And there is another good reason to separate the calibration from the greek calculation which I will come to below.

Note also, that the pricing which takes 5.95s with AD (including derivatives) is much faster without AD where it only consumes 0.073s. This is a factor of 80 which is much worse than the theoretical factor of 4 to 5 mentioned in earlier posts (remember, we saw 4.5 for plain vanilla interest rate swap npv and delta computation). This is again due to optimization issues obviously.

The background here is that the swaption pricing engine uses cubic spline interpolation and closed-form integration of the resulting cubic polynominals against the normal density for the roll back. Again a lot of elementary calculations, not separated by OO-things that would hinder the compiler from low level optimizations. You would surely need quite a number of sensitivities to still get a performance gain.

But lets follow the path to compute the bucket vegas further down. First I print the bucket vegas coming from the naive AD way above – direct differentiation by the input implied volatilities with the operation sequence going through the whole model calibration and the pricing.

vega #0: 0.0153%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0185%
vega #6: 0.0140%
vega #7: 0.0124%
vega #8: 0.0087%


This looks plausible. We started the calibration from a constant sigma function at $1\%$. This is actually far away from the target value (which is around $0.40\%$), so the optimizer can walk around before he settles at the minimum. But we could have started with a sigma very close to the optimal solution. What would happen then ? With the target value as an initial guess (which is unlikely to have in reality, sure) we get

vega #0: 0.0000%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0448%
vega #6: 0.0000%
vega #7: 0.0000%
vega #8: 0.0000%


Some vegas are zero now. vega #5 is even completely different from the value before. This is a no go, because in a productive application you wouldn’t notice if some deals are contributing a zero sensitivity or a false value.

What is happening here is that the function we differentiate depends on more input variables than only the primary variables of interest (the implied vol), like the initial guess for the optimization. These might alter the derivative drastically even if the function value (which is the model’s sigma function on an intermediate level, or ultimately the bermudan’s npv) stays the same.

For example if the initial guess is so good that the optimizers tolerance is already satisfied with it, the output will stay the same, no matter if the input is pertubed by an infinitesimal shift $dx$. Here pertubed does not really mean bumped and $dx$ really is infinitesimal small. It is only a concept to get a better intuition of what is going on during the process of automatic differentiation.

Another example is the bisection method for zero searching. This method will always yield zero AD derivatives in the sense if $x$ is an input (e.g. a vector of swap quotes) and $y$ is the output (e.g. a vector of zero rates) linked by a relation $f(x,y) = 0$

the iff $x$ is pertubed by $dx$ then the checks in the bisection algorithm will yield exactly the same results for $x+dx$ as for $x$. Therefore the computed $y$ will be exactly the same, thus the derivative zero.

I don’t know if this explanation is good, but this is how I picture it for myself. It seems to be like this: If it feels too magical what you do with AD, you better don’t do it.

What is the way out ? We just avoid all optimization and zero searches, if possible. And it is for interest rate deltas and vegas: Just calibrate your curve in the usual way, then apply AD to compute sensitivities to zero rates (which then does not involve a zero search any more).

If you want market rate deltas, compute the Jacobian matrix of the market instruments in the curve by the zero rates as well, invert it and multiply it with the zero delta vector. You do not even have to invert it, but just only have to solve one linear equation system. We make this procedure explicit with our example above.

The first step would be to compute d bermudan NPV / d model’s sigma. This is done on the already calibrated model. Technically we separate the calibration step (done with the usual doubles) and the AD step (done on a copy of the model feeded with the model parameters from the first model, but not itself calibrated again). This is what we get for d bermudan / d sigma

vega #0: 1.6991
vega #1: 1.2209
vega #2: 0.8478
vega #3: 0.5636
vega #4: 0.4005
vega #5: 0.2645
vega #6: 0.1573
vega #7: 0.0885
vega #8: 0.0347


Next we compute the Jacobian d helperNpv / d sigma

297.4%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
179.7%	183.3%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
126.9%	129.4%	132.2%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
90.8%	92.6%	94.6%	96.7%	0.0%	0.0%	0.0%	0.0%	0.0%
64.9%	66.2%	67.6%	69.1%	71.2%	0.0%	0.0%	0.0%	0.0%
47.5%	48.4%	49.5%	50.6%	52.1%	53.3%	0.0%	0.0%	0.0%
31.9%	32.6%	33.3%	34.0%	35.0%	35.9%	36.3%	0.0%	0.0%
19.2%	19.6%	20.0%	20.4%	21.0%	21.5%	21.8%	22.3%	0.0%
8.7%	8.9%	9.0%	9.2%	9.5%	9.8%	9.9%	10.1%	10.5%


helperVega = f.Jacobian(sigmas);


This is a lower triangular matrix, because the ith calibration helper depends only on the sigma function up to its expiry time. The inverse of this matrix is also interesting, although we wouldn’t need it in its full beauty for our vega calculation, representing d sigma / d helperNpv

33.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
-33.0%	54.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	-53.4%	75.6%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	-74.0%	103.5%	0.0%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	-100.4%	140.4%	0.0%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	-137.2%	187.5%	0.0%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	-185.1%	275.3%	0.0%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	-269.1%	448.1%	0.0%
0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	-431.6%	953.1%


This says how the different sigma steps go up when the helper belonging to the step goes up (these are the positive diagonal elements) and how the next sigma step would need to go down when the same helper as before goes up, but the next helper stays the same. The contra-movement is roughly by the same amount.

Next we need d bermudan / d sigma. AD delivers

vega #1: 1.6991
vega #2: 1.2209
vega #3: 0.8478
vega #4: 0.5636
vega #5: 0.4005
vega #6: 0.2645
vega #7: 0.1573
vega #8: 0.0885
vega #9: 0.0347


And the last ingredient would be the calibration helpers’ vegas, which can be computed analytically (this is the npv change when shifting the input market volatility by one percent up), d helperNpv / d marketVol

vega #1: 0.0904%
vega #2: 0.1117%
vega #2: 0.1175%
vega #3: 0.1143%
vega #4: 0.1047%
vega #5: 0.0903%
vega #6: 0.0720%
vega #7: 0.0504%
vega #8: 0.0263%


Now multiplying everything together gives

d bermudan / d impliedVol = d bermudan / d sigma x d sigma / d helperNpv x d helperNpv / d impliedVol

which is

vega #0: 0.0153%
vega #1: 0.0238%
vega #2: 0.0263%
vega #3: 0.0207%
vega #4: 0.0209%
vega #5: 0.0185%
vega #6: 0.0140%
vega #7: 0.0124%
vega #8: 0.0087%


same as above. Partly because I just copy-pasted it here. But it actually comes out of the code also, just try it.