Histograms and Cuts

Week 7 Day 2: Cuts and histograms

Objectives

Learn about manipulating realistic data sets
Look about making cuts by looking at histograms
Learn to install package locally on OSC

Note that the answers are hidden in this notebook. Please try first, then you can reveal the answer by double-clicking the markdown cell above the input cell.

The datafiles we'll be looking at today are realistic files from LHCb. Please do not distribute to anyone outside our class. The data is high energy physics data, but most of the concepts generalize to other statistical fields.

Like most high energy physics data, our files sit in ROOT - a special custom C++ behemoth. Since we don't have the (many) hours required to install ROOT, and installing ROOT inside an Anaconda environment is a nightmare, we'll use the pure-python uproot package to read in the data.

If you don't have uproot, run:

!pip install --user uproot

And then restart your kernel. This will install to ~/.local. (Your user directory is called ~ or $HOME in Unix.)

import uproot

Feel free to change this from inline to notebook; just be careful to not keep plotting to the same output figure.

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

Note: normally, we should open a file like:

with open(...) as f:
    ... # open file
... # file automatically closed

However, in uproot, files are opened and closed on each read, so while you can structure your code like this, is isn't necessary.

f = uproot.open("/fs/project/PES0765/tutorial_data/DVntuple-March04A.root")
# f = uproot.open('/data/tutorial/DVntuple-March04A.root')

We have several "directories" in here, each with a single "NTuple" or "TTree" with Decay in the name - don't worry about the ROOT centric terminology, these are just tabular data! Let's grab one:

XiDecayTuple = f["myXiTuple"]["XiDecayTuple"]
myLcTuple = f["myLcTuple"]["DecayTree"]

print(f"XiDecayTuple number of events: {len(XiDecayTuple):,}")
print(f"myLcTuple number of events: {len(myLcTuple):,}")

We could continue to use the "ROOT"like uproot interface, but why? Let's read the whole thing in as a DataFrame! Note that at this point, the data will sit in memory (since Pandas is in-memory only).

The NTuple contains "JaggedArrays" or "AwkwardArrays", arrays that have a different number of columns per row. We'll simply filter those out manually for now. We could use an hierarchical index in Pandas instead, but this would waist memory and is buggy at the moment. We could also read the data in as python lists, but since we don't need it, we'll filter it instead.

# Uproot bug causes crash (and would be hierarchical index if it worked):
#     df = t.pandas.df(entrystop=100_000, flatten=True)
#
# My proposal: https://github.com/scikit-hep/uproot/pull/157
#     df = t.pandas.df(entrystop=100_000, flatten=None)
#
# A bit wasteful in memory:
#     df = t.pandas.df(entrystop=100_000, flatten=False)
#
# I haven't shown you lambda functions, but this is supported by uproot:
#     df = t.pandas.df(lambda x: x.interpretation if not isinstance(x.interpretation, uproot.asjagged) else None)

good_keys = [
    key
    for key in XiDecayTuple.keys()
    if not isinstance(XiDecayTuple[key].interpretation, uproot.asjagged)
]
xi = XiDecayTuple.pandas.df(good_keys, entrystop=100_000)

xi.info()

xi.head()

The dataset uses prefixes. Can you find the unique set of prefixes of column names?

How much memory would you need to load the whole file? (with our filter, at least) - just approximate it by hand.

This is particle physics data from LHCb. We are looking at the following decay:

$$ \Xi^- \rightarrow \Lambda^0 \pi^-, \quad \Lambda^0 \rightarrow p \pi^- $$

We will use the names:

Particle	Prefix	Type	PDG Mass (MeV)
$\Xi^-$	`Xi_`	Recon	1321.71
$\Lambda^-$ from $\Xi^-$	`Lambda_`	Recon	1115.683
$\pi^-$ from $\Xi^-$	`PropmtPi_`	Observed	139.571
$p$ from $\Lambda^-$	`DecayPr_`	Observed	938.272
$\pi^-$ from $\Lambda^-$	`DecayPi_`	Observed	139.571

Let's plot the $\Xi^-$ mass:

How about the $\Lambda^0$ mass?

In the original document, we were asked to look at the $\Lambda^0$ mass for good $\Xi^-$, and bad $\Xi^-$ candidates. Let's make some cuts on the $\Xi^-$ mass. Try good candidates from 1315 to 1327 MeV and bad candidates that are more than 5 MeV from that range.

good_cut = ...
bad_cut = ...

fix, axs = plt.subplots(1, 2, figsize=(12, 5))
xi.loc[good_cut, "Lambda_MM"].plot.hist(bins=100, ax=axs[0])
xi.loc[bad_cut, "Lambda_MM"].plot.hist(bins=100, ax=axs[1])
plt.show()

Let's try another common trick: we'll look at the plot of the $\Xi^-$ mass were we've replaced the original $\Lambda^0$ mass with the known mass (essentially collapsing the distribution we saw before into a delta function). Due to a first-order approximation, we can just subtract the measured mass and replace it with the known mass. (Here I'm using "known" to indicate that it is an exact value, not that we actually truly know that exact value.)

Part 2: Another decay

This is particle physics data from LHCb again. We are looking at the following decay:

$$ \Lambda_c^{+} \rightarrow \Xi^{-} K^{+} \pi^{+}, \quad \Xi^- \rightarrow \Lambda^0 \pi^-, \quad \Lambda^0 \rightarrow p \pi^- $$

We will use the names:

Particle	Prefix	Type	PDG Mass (MeV)
$\Lambda_C^+$	`LambdaC_`	Recon	2286.46
$K^+$ from $\Lambda_C^+$	`PromptK_`	Observed	493.677
$\pi^+$ from $\Lambda_C^+$	`PropmtPi_`	Observed	139.571
$\Xi^-$ from $\Lambda_C^+$	`Ximinus_`	Recon	1321.71
$\Lambda^-$ from $\Xi^-$	`Lambda_`	Recon	1115.683
$\pi^-$ from $\Xi^-$	`XiPi_`	Observed	139.571
$p$ from $\Lambda^-$	`LambdaPr_`	Observed	938.272
$\pi^-$ from $\Lambda^-$	`LambdaPi_`	Observed	139.571

lc = myLcTuple.pandas.df(
    lambda x: None
    if isinstance(x.interpretation, uproot.asjagged)
    else x.interpretation
)

lc.head()

Let's plot the $\Lambda_C^-$ mass:

Yikes! That is not as clean as what we had before. Let's see if we can make some cuts to fix this up!

We'll start by making a background subtracted distribution. We'll select "signal + background", and we'll select "background". Then we can subtract the two (making sure to account for differences in acceptance), and that will give us a way to decide where to cut.

Hopefully obviously, we want to reduce background without reducing too much signal.

signal_cut = ...
background_cut_lower = ...
background_cut_upper = ...
background_cut = ...

Let's make a histogram of $\Lambda_C^-$ mass with three colors, one for signal_cut, one for background_cut, and one for the remaining parts.

plt.hist(
    [
        lc.loc[signal_cut, "LambdaC_MM"],
        lc.loc[background_cut, "LambdaC_MM"],
        lc.loc[~(background_cut | signal_cut), "LambdaC_MM"],
    ],
    bins=np.arange(2235, 2335.5, 0.5),
    stacked=True,
    label=["Signal + background", "Background", "Other"],
)
plt.xlabel("$\Xi^{-} K^{+} \pi^{+}$ mass [MeV]")
plt.ylabel("Events per 1/2 MeV")
plt.legend()

Let's make a plot where we look at a distribution, such as 'PromptK_ProbNNk', with signal and background. You should make the histograms by hand (ie, use np.histogram to name hsb and hb), then plot hsb*2-hb.

Select and make a PromptK_ProbNNk_cut.

Now plot the histograms before and after cuts of the $\Lambda_C^-$ mass.

You should make functions out of the last two plotting codes, and then use them to run the same test on 'LambdaC_TAU', the lifetime (travel distance) of the $\Lambda_C^-$.

bg_sub_hist(lc, signal_cut, background_cut, "LambdaC_TAU", np.linspace(0, 0.01))

And the other one:

final_plot(lc[PromptK_ProbNNk_cut], lc["LambdaC_TAU"] > 0.0005)