Profiling
Quick extension of last week's lecture: Remember we said that you should optimize only the slow parts of your program. How do you tell? Profile!
There are two types of profilers:
- Deterministic profilers: These modify the runtime of the program by tracking every line.
- Sampling profilers: These "sample" every so often.
The main profiler for Python is the built in cProfile
and profile
modules (the first being a faster version of the second). Since Python is already interpreted and slow, most profilers for Python are deterministic profilers.
The following example uses the built in profiler directly, as well as line_profiler
, from either conda (conda install line_profiler
in environment) or pip (pip install --user line-profiler
). Line profiler is a powerful, better interface (and IPython extension) for the cProfile module.
import numpy as np
from scipy.stats import norm
def sampler(
data,
samples=4,
mu_init=0.5,
proposal_width=0.5,
plot=False,
mu_prior_mu=0,
mu_prior_sd=1.0,
):
mu_current = mu_init
posterior = [mu_current]
for i in range(samples):
# suggest new position
mu_proposal = norm(mu_current, proposal_width).rvs()
# Compute likelihood by multiplying probabilities of each data point
likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
# Compute prior probability of current and proposed mu
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal
# Accept proposal?
p_accept = p_proposal / p_current
# Usually would include prior probability, which we neglect here for simplicity
accept = np.random.rand() < p_accept
if plot:
plot_proposal(
mu_current,
mu_proposal,
mu_prior_mu,
mu_prior_sd,
data,
accept,
posterior,
i,
)
if accept:
# Update position
mu_current = mu_proposal
posterior.append(mu_current)
return posterior
np.random.seed(123)
data = np.random.randn(20)
import cProfile
cProfile.run("posterior = sampler(data, samples=1000, mu_init=1.)")
Reading that last page is a bit tricky - we have the "insides" of all the Python functions being called. Let's try line_profiler
to just select the one function we care about, and see that line-by-line.
Like all external packages except matplotlib, we need to load the package as an IPython extension first:
%load_ext line_profiler
Then, we can use the lprun
magic along with the -f function
to select a function:
%lprun -f sampler posterior = sampler(data, samples=1000, mu_init=1.)
Much better! I can easily see what's wrong now. This allows you to target optimizations; often minor changes, like np.stack
instead of np.array
, may give you large speedups. Parts that are not performance critical can remain clear and easy to read/maintain.
Real life example:
- I had 128 DataFrames of around a million events each, and I wanted to make 128 2D histograms.
- Custom Histogram library Physt was taking 3 seconds per histogram.
- Native Numpy histogram (then feeding that into Physt) took .9 seconds per histogram.
- Custom Numba histogram function (then feeding that into Physt) took 0.1 seconds per histogram.
At this point, a total of 10-15 seconds for the procedure was fine. Physt histograms as output means I can use the nice OO design to merge and rebin histograms later, by the way.
with open("tmpfile.txt", "w") as f:
f.write("This is one line\n")
print("Second line", file=f)
# File gets closed here
with open("tmpfile.txt") as f:
for l in f:
print(l, end="")
You can iterate over a file to get lines, or use .readlines
to get a list of lines, or .read
to just read in (part) of the file. The mode flags can set text vs. binary 'b'
, read 'r'
, write 'w'
, or append 'a'
.
The f
object is a File Buffer. Quite a few functions and libraries in Python take a buffer - usually they take a file name or an open buffer. You can make a buffer that is attached to your memory instead of a file with io.StringIO
.
Serializing any objects
Python has a protocol for storing and recovering almost any object called "pickling". This is general purpose, but is often not ideal for large or long term storage, or for communicating data - there is some limitation on Python version, the class code should be similar or the same, and there is no compression. But, it can be handy in a pinch. You can use dump
and load
, or you can use dumps
and loads
to input/output directly to a string instead:
import pickle
pickle_str = pickle.dumps("some object")
print(pickle_str)
pickle.loads(pickle_str)
Use case: configuration
Let's look at better ways to do IO for specific cases. Let's look at a common one: storing some "configuration" like data (or any small set of data). As long as the data is small, it's nice to leave it human readable format, in a text file. Here are some popular formats:
- XML: Famous - HTML is a subset of XML. Very verbose and ugly, and often used with a custom schema.
- JSON: Popular. A little unfriendly for writing, but clean and simple. A subset of JavaScript makes it web-friendly. Great libraries available.
- INI: Not well defined, and very limited. Very easy for a human to author, though.
- YAML: Yet Another Markup Language, popular in some areas. Lots of weird corner cases to the syntax, though.
- TOML: Simpler version of the thing above. Gaining some use in Python as of late.
Of these, the best standard library support in Python goes to JSON, so we'll look at that one. It's the best in the list for (small amounts of) general data. For configuration, it's a tossup.
JSON has the same methods as pickle.
import json
structure_to_save = ["list", "items", {"dictionary": "Nested", "number": 2}]
print(json.dumps(structure_to_save, indent=True))
You can read more at Python 3 Module of the Week: json or the official documentation.
a = np.random.rand(1000)
The format is a simple binary one-array-per-file format; you can use savez
to save several files into one uncompressed zip file, or savez_compressed
to save to a compressed zip file:
np.savez_compressed("tmp.npz", a=a)
with np.load("tmp.npz") as f:
print(f["a"].shape)
While simple, this is still Numpy specific, and is not ideal for cross-language data storage.
HDF5
This is a standard format for storing binary structured data with optional compression. It has several nice features, like groups and attributes, that make it very powerful. You need an external library, but Anaconda comes with it by default. Technically, it comes with two. H5py and PyTables; I use HDF5 in C++ also, so I like the fact that H5py looks more like other HDF5 libraries, so I'll cover that. The other one is available with import tables
.
import h5py
with h5py.File("tmp.h5", "w") as f:
f["a"] = a
with h5py.File("tmp.h5") as f:
b = f["a"].value
assert np.all(a == b)
This is a common idiom in modern IO libraries; you treat the file like a dictionary. There are explicit methods if you need to do something fancier (like turn on compression for a dataset). HDF5 has tools to allow you to keep a dataset on file instead of reading it entirely to memory (but I've not had to use those).
Also, there are great non-Python tools for HDF5 available, like a graphical file viewer.
Note for HEP physicists: the ROOT file format has a few features that HDF5 does not, like incremental saving. However, other than small differences, the formats are similar. Except that HDF5 is available everywhere and even if it isn't, it takes something like 5 minutes to build instead of several hours...
</font>
import pandas as pd
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
df.to_hdf("tmp.h5", "df")
pd.read_hdf("tmp.h5", "df")
Bonus: Uproot
In HEP, we have a very commonly used and powerful format: ROOT. But getting ROOT to compile is a long (2+ hour) ordeal, and I have yet to get it to work with Anaconda Python on macOS.
An exciting new library, "uproot" lets you read ROOT files in pure Python without ROOT. The original name was to be micro-root, or $\mu$root, but it became uproot. You've already seen a bit of uproot in the previous lectures.
Writing files is planned, but still in progress.