Week 12 Day 1: Requested topic: Fitting

Objectives:

Cover performance and syntax for general unbinned fitting.

Remember to pip install probfit (--user if you are not in a virtual environment). cupy is used at the end; you'll need conda install cupy in an environment. Times are from a 24 core Xeon with a Titan V.

We will put our preperation code in a file, and restart the kernel as we go, just to make sure we don't have any interaction between GPU libraries. likelyhood_answer is defined to be the numpy value - Note that the reduction can produce lots of round of error if not done properly. I know numpy does, not sure about probfit.

Results:

Package	Hardware	Time per call	Minimize time
Probfit	24 core Zeon	1.11 s	2.23 min
Numpy	24 core Zeon	21.1 ms	2.38 s
Numba parallel	24 core Zeon	11.5 ms	1.46 s
Numba parallel loop	24 core Zeon	3.68 ms	416 ms
GooFit	Titan V	2.30 ms	333 ms
CUPY	Titan V	2.9 ms	491 ms
CUPY kernel	Titan V	843 µs	118 ms
CUPY redu	Titan V	2.50 ms	343 ms
PyTorch	Titan V	3.68 ms	594 ms
TensorFlow static	Titan V	1.38 ms	225 ms

from read_pretty import read_pretty
read_pretty('common.py')

import numpy as np
from iminuit import Minuit
import math

np.random.seed(42)

dist = np.hstack([
    np.random.normal(loc=1, scale=2., size=500_000),
    np.random.normal(loc=1, scale=.5, size=500_000)
])

default_params = dict(
    f_0=.5,
    error_f_0=.01,
    limit_f_0=(0,1),

    mean=1.5,
    error_mean=.01,
    limit_mean=(-10,10),

    sigma=.4,
    limit_sigma=(0,1),
    error_sigma=.01,

    sigma2=3.,
    error_sigma2=.01,
    limit_sigma2=(1,3),
)


class Compare:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        if other == self.value:
            return True

        frac = abs(other - self.value) / self.value
        print(f'Missed by: {frac:.3}')
        return frac < .00001

likelyhood_answer = Compare(4976157.922404283)

def run_and_print(minuit):
    minuit.print_level=0
    a,b = minuit.migrad()
    print("FCN: {nfcn}\nis_valid: {is_valid}\nfval: {fval}\nedm: {edm}\n".format(**a))

Make data

First, we make our data. It's a simple double gaussian with no background.

%run common.py

import matplotlib.pyplot as plt

plt.hist(dist, bins='auto');

Probfit

Now, let's try using probfit to fit the data. First, we have to rename the parameters - probfit merges similar named parameters. I already know that gaussian comes with x, mean, and sigma. We use AddPdfNorm to add the pdfs, and they are kept normalized. A new parameter is added, f_0.

import probfit

second_gaussian = probfit.rename(probfit.gaussian, ["x", "mean", "sigma2"])
pdf_function = probfit.AddPdfNorm(probfit.gaussian, second_gaussian)

Now, let's build an unbinned likelyhood function:

unbinned_lh = probfit.UnbinnedLH(pdf_function, dist)

assert unbinned_lh(.5,.5,.5,.5) == likelyhood_answer

Missed by: 1.7e-14

%%timeit
unbinned_lh(*np.random.rand(4))

1.11 s ± 78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We have to use iMinuit's name based parameter setting interface:

minuit = Minuit(unbinned_lh, **default_params)

And let's do the fit!

%%time
run_and_print(minuit)

FCN: 136
is_valid: True
fval: 1672956.6975332375
edm: 1.8438039152750956e-05

CPU times: user 2min 24s, sys: 807 ms, total: 2min 24s
Wall time: 2min 23s

unbinned_lh.draw(minuit);

exit()

GooFit

Let's try that in GooFit. We set up our model in what I view as a more readable but more verbose way:

%run common.py

import goofit

goofit.GOOFIT_DEVICE

'CUDA'

x = goofit.Observable('x', -10, 10)

ds = goofit.UnbinnedDataSet(x)
ds.from_matrix([dist], filter=True)

mean = goofit.Variable('mean', 1.5, -10, 10)
sigma = goofit.Variable('sigma', .4, 0, 1)
sigma2 = goofit.Variable('sigma2', 3, 1, 3)
f_0 = goofit.Variable('f_0', .5, 0, 1)

gauss1 = goofit.GaussianPdf("gauss1", x, mean, sigma)
gauss2 = goofit.GaussianPdf("gauss2", x, mean, sigma2)

pdf = goofit.AddPdf('pdf', [f_0], [gauss1, gauss2])

pdf.fitTo(ds);

MnSeedGenerator: for initial parameters FCN = 3874835.510745
MnSeedGenerator: Initial state:   - FCN =   3874835.510745 Edm =       744932 NCalls =     19
MnSeedGenerator: Negative G2 found - new state:   - FCN =   3735439.992451 Edm =       712480 NCalls =     44
VariableMetric: start iterating until Edm is < 0.0001
VariableMetric: Initial state   - FCN =   3735439.992451 Edm =       712480 NCalls =     44
VariableMetric: Iteration #   0 - FCN =   3735439.992451 Edm =       712480 NCalls =     44
VariableMetric: Iteration #   1 - FCN =   3365672.041955 Edm =      20560.4 NCalls =     56
VariableMetric: Iteration #   2 - FCN =   3351286.807413 Edm =      3712.53 NCalls =     66
VariableMetric: Iteration #   3 - FCN =   3346224.031665 Edm =      290.217 NCalls =     76
VariableMetric: Iteration #   4 - FCN =   3345976.094246 Edm =      22.4637 NCalls =     86
VariableMetric: Iteration #   5 - FCN =   3345892.605172 Edm =       3.8225 NCalls =     96
VariableMetric: Iteration #   6 - FCN =   3345887.959668 Edm =   0.00383132 NCalls =    106
VariableMetric: Iteration #   7 - FCN =   3345887.956199 Edm =   3.4033e-06 NCalls =    116
VariableMetric: After Hessian   - FCN =   3345887.956199 Edm =   6.3729e-06 NCalls =    145
VariableMetric: Iteration #   8 - FCN =   3345887.956199 Edm =   6.3729e-06 NCalls =    145

Minuit did successfully converge.
# of function calls: 145
minimum function Value: 3345887.956199
minimum edm: 6.372903302792e-06
minimum internal state vector: LAVector parameters:
  -0.001982099827992
     0.1000580006389
  -0.001697590764306
 -0.0007592706638429

minimum internal covariance matrix: LASymMatrix parameters:
  4.6329523e-06   -5.04376e-10  2.6759732e-06  3.0630683e-06
   -5.04376e-10   7.932709e-09 -6.3590735e-10 -3.1439813e-10
  2.6759732e-06 -6.3590735e-10  4.3480155e-06  2.0847795e-06
  3.0630683e-06 -3.1439813e-10  2.0847795e-06  6.5938869e-06


# ext. ||   Name    ||   type  ||     Value     ||  Error +/- 

   0   ||       f_0 || limited ||     0.4990089507349 ||0.0007609969210436
   1   ||      mean || limited ||     0.9989112735614 ||0.000626640028246
   2   ||     sigma || limited ||     0.4991512050255 ||0.0007372245084616
   3   ||    sigma2 || limited ||      1.999240729409 ||0.001815747212224




The minimization took: 323.45 ms
Average time per call: 2.2354 ms

Note: On a single thread CPU build, this is about 7-10x faster than probfit. This is on a Titan V GPU, so it's a lot more than that.

Plotting is a bit more verbose, but not too bad:

grid, points = pdf.evaluatePdf(x)
xs = grid.to_matrix().flatten()

fig, ax = plt.subplots(figsize=(8,3))
ax.plot(xs, points)
ax.hist(dist, bins='auto', density=True)
plt.show()

exit()

Numpy

Let's try this with Numpy and iMinuit. We could use probfit to make the NLL, but it's much faster to do it ourselves.

%run common.py

Note:

I originally used numba here, but since this is all done in array form, even parallel Numba is only a small fraction (10-30%) faster on my laptop. On a 24 core Xeon, however... You'll see.

def gaussian(x, μ, σ):
    return 1/np.sqrt(2*np.pi*σ**2) * np.exp(-(x-μ)**2/(2*σ**2))

def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

def nll(f_0, mean, sigma, sigma2):
    return -np.sum(np.log(add(dist, f_0, mean, sigma, sigma2)))

assert nll(.5,.5,.5,.5) == likelyhood_answer

Missed by: 9.36e-16

Note that Numpy uses a good definition of a sum, while Python's built in sum has rounding point errors. This would be more important if we used fewer bits. Let's compare:

# Check reduction sum:
assert -sum(np.log(add(dist, .5, .5, .5, .5))) == likelyhood_answer

Missed by: 1.7e-14

# Check floating point accurate sum:
assert -math.fsum(np.log(add(dist, .5, .5, .5, .5))) == likelyhood_answer

%%timeit
nll(*np.random.rand(4))

/opt/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log

21.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178656
edm: 3.092553415281103e-06

CPU times: user 51.5 s, sys: 5.73 s, total: 57.2 s
Wall time: 2.38 s

# Much (2-3x) slower if we use this in Minuit
# unbinned_lh = probfit.UnbinnedLH(add, dist)

Numba

Let's use Numba's parallel abilities:

import numba

@numba.njit(parallel=True)
def gaussian(x, μ, σ):
    return 1/np.sqrt(2*np.pi*σ**2) * np.exp(-(x-μ)**2/(2*σ**2))

@numba.njit(parallel=True)
def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

@numba.njit(parallel=True)
def nll(f_0, mean, sigma, sigma2):
    return -np.sum(np.log(add(dist, f_0, mean, sigma, sigma2)))

assert nll(.5,.5,.5,.5) == likelyhood_answer

Missed by: 5.61e-16

%%timeit
nll(*np.random.rand(4))

11.5 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178658
edm: 3.0927053268030814e-06

CPU times: user 55.7 s, sys: 12.5 s, total: 1min 8s
Wall time: 1.46 s

Numba Parallel loop

@numba.jit
def gaussian(x, μ, σ):
    return 1/np.sqrt(2*np.pi*σ**2) * np.exp(-(x-μ)**2/(2*σ**2))

@numba.jit
def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

@numba.njit(parallel=True)
def nllp(dist, f_0, mean, sigma, sigma2):
    s = 0.0
    for i in numba.prange(len(dist)):
        s +=  np.log(add(dist[i], f_0, mean, sigma, sigma2))
    return -s
    
def nll(f_0, mean, sigma, sigma2):
    return nllp(dist, f_0, mean, sigma, sigma2)

assert nll(.5,.5,.5,.5) == likelyhood_answer

Missed by: 5.61e-16

%%timeit
nll(*np.random.rand(4))

3.65 ms ± 903 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178658
edm: 3.0927053268030814e-06

CPU times: user 19.8 s, sys: 134 ms, total: 19.9 s
Wall time: 416 ms

exit()

CUPY GPU

Let's use CUPY on the GPU:

%run common.py

import cupy

%%time
d_dist = cupy.array(dist)

CPU times: user 226 ms, sys: 576 ms, total: 803 ms
Wall time: 754 ms

def gaussian(x, μ, σ):
    return 1/cupy.sqrt(2*np.pi*σ**2) * cupy.exp(-(x-μ)**2/(2*σ**2))

def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

def nll(f_0, mean, sigma, sigma2):
    return -cupy.sum(cupy.log(add(d_dist, f_0, mean, sigma, sigma2)))

assert float(nll(.5,.5,.5,.5)) == likelyhood_answer

%%timeit
float(nll(*np.random.rand(4)))

3.02 ms ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178663
edm: 3.092622654275362e-06

CPU times: user 404 ms, sys: 24 ms, total: 428 ms
Wall time: 427 ms

Cupy Custom Kernel

Let's go all out and write a CUDA kernel by hand, then call sum on it. We are almost ideal here; the only downside is that we make a single temporary 1M element array before summing.

We could use a reduction kernel, but I don't know if I trust that do not have round off error issues.

mykernel = cupy.ElementwiseKernel(
    'float64 x, float64 f_0, float64 mean, float64 sigma, float64 sigma2',
    'float64 z', '''
    
    double s12 = 2*sigma*sigma;
    double s22 = 2*sigma2*sigma2;
    
    double p = -(x-mean)*(x-mean);
    double g = rsqrt(M_PI*s12) * exp(p/s12);
    double g2 = rsqrt(M_PI*s22) * exp(p/s22);
    
    z = log(f_0 * g + (1 - f_0) * g2);
        
    ''', 'mykernel')

def nll(f_0, mean, sigma, sigma2):
    return -cupy.sum(mykernel(d_dist, f_0, mean, sigma, sigma2))

assert float(nll(.5,.5,.5,.5)) == likelyhood_answer

%%timeit
float(nll(*np.random.rand(4)))

843 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178663
edm: 3.092621710282783e-06

CPU times: user 100 ms, sys: 18 ms, total: 118 ms
Wall time: 118 ms

rku = cupy.ReductionKernel(
    'float64 x, float64 f_0, float64 mean, float64 sigma, float64 sigma2',
    'float64 r',
    '''
    log(     f_0  * rsqrt(2*M_PI*sigma*sigma)   * exp(-(x-mean)*(x-mean)/(2*sigma*sigma)) +
        (1 - f_0) * rsqrt(2*M_PI*sigma2*sigma2) * exp(-(x-mean)*(x-mean)/(2*sigma2*sigma2)))
    ''',
    'a + b',
    'r = -a',
    '0',
    'redu_kernel'
)

def nll(f_0, mean, sigma, sigma2):
    return rku(d_dist, f_0, mean, sigma, sigma2)

assert float(nll(.5,.5,.5,.5)) == likelyhood_answer

%%timeit
float(nll(*np.random.rand(4)))

2.5 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178663
edm: 3.092623404318523e-06

CPU times: user 257 ms, sys: 86.9 ms, total: 344 ms
Wall time: 343 ms

exit()

PyTorch

%run common.py

import torch

device = torch.device('cuda:0')

%%time
d_dist = torch.tensor(dist, device=device)

CPU times: user 2.41 s, sys: 1.38 s, total: 3.8 s
Wall time: 3.83 s

def gaussian(x, μ, σ):
    return 1/np.sqrt(2*np.pi*σ**2) * torch.exp(-(x-μ)**2/(2*σ**2))

def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

def nll(f_0, mean, sigma, sigma2):
    return -torch.sum(torch.log(add(d_dist, f_0, mean, sigma, sigma2)))

v = nll(.5,.5,.5,.5)

assert v.item() == likelyhood_answer

Missed by: 1.87e-16

%%timeit
nll(*np.random.rand(4))

3.68 ms ± 57.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.697517866
edm: 3.0926286412622218e-06

CPU times: user 13.6 s, sys: 551 ms, total: 14.2 s
Wall time: 594 ms

exit()

TensorFlow: static graphs

Eager execution is broken for me due to incompatibilty with LZ4. Also tensorflow doesn't play nice with memory, so restarting the kernel.

%run common.py

import tensorflow as tf

/opt/anaconda/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

def gaussian(x, μ, σ):
    return 1/tf.sqrt(2*np.pi*σ**2) * tf.math.exp(-(x-μ)**2/(2*σ**2))

def add(x, f_0, mean, sigma, sigma2):
    return f_0 * gaussian(x, mean, sigma) + (1 - f_0) * gaussian(x, mean, sigma2)

def make_nll(d_dist, f_0, mean, sigma, sigma2):
    return -tf.reduce_sum(tf.math.log(add(d_dist, f_0, mean, sigma, sigma2)))

x = tf.constant(dist)
tf_f_0 = tf.placeholder(dtype=tf.float64)
tf_mean = tf.placeholder(dtype=tf.float64)
tf_sigma = tf.placeholder(dtype=tf.float64)
tf_sigma2 = tf.placeholder(dtype=tf.float64)

graph = make_nll(x, tf_f_0, tf_mean, tf_sigma, tf_sigma2)

sess = tf.Session()

def nll(f_0, mean, sigma, sigma2): 
    loss_value = sess.run(graph,
                          feed_dict={tf_f_0:f_0,
                                     tf_mean:mean,
                                     tf_sigma:sigma,
                                     tf_sigma2:sigma2})
    return loss_value

assert nll(.5,.5,.5,.5) == likelyhood_answer

Using random numbers here just to make sure we don't get some special caching or somthing from TensorFlow.

%%timeit
nll(*np.random.rand(4))

1.38 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

minuit = Minuit(nll,
                errordef=1,
                **default_params)

%%time
run_and_print(minuit)

FCN: 137
is_valid: True
fval: 1672956.6975178663
edm: 3.0926072991921636e-06

CPU times: user 396 ms, sys: 58.3 ms, total: 455 ms
Wall time: 225 ms

sess.close()

When using tensorflow, be a good user and kill the kernel so GPU memory gets reclaimed.

exit()

Quick check of summation algorithms:

>>> import numpy as np
>>> import math
>>> np.random.seed(42)
>>> dist = np.hstack([
...     np.random.normal(loc=1, scale=2., size=500_000),
...     np.random.normal(loc=1, scale=.5, size=500_000)
... ])

# fsum vs. naive
>>> math.fsum(dist) - sum(dist)
5.820766091346741e-10

# fsum vs. np.sum
>>> math.fsum(dist) - np.sum(dist)
0.0

# Check with 32 bit, fsum vs. naive
>>> math.fsum(dist.astype(np.float32)) - sum(dist.astype(np.float32), np.float32(0))
-20.27040922746528

# Check with 32 bit, fsum vs. np.sum
>>> math.fsum(dist.astype(np.float32)) - np.sum(dist.astype(np.float32))
-0.1454092274652794

# Check with 32 bit fsum vs. 64 bit fsum
>>> math.fsum(dist) - math.fsum(dist.astype(np.float32))
-2.3222528398036957e-06