Search
Week 10 Day 1: (Pro)file

Objectives

  • Cover profiling (extension of last lecture)
  • Cover reading and writing of files
  • Start the basics of ODEs

Profiling

Quick extension of last week's lecture: Remember we said that you should optimize only the slow parts of your program. How do you tell? Profile!

There are two types of profilers:

  1. Deterministic profilers: These modify the runtime of the program by tracking every line.
  2. Sampling profilers: These "sample" every so often.

The main profiler for Python is the built in cProfile and profile modules (the first being a faster version of the second). Since Python is already interpreted and slow, most profilers for Python are deterministic profilers.

The following example uses the built in profiler directly, as well as line_profiler, from either conda (conda install line_profiler in environment) or pip (pip install --user line-profiler). Line profiler is a powerful, better interface (and IPython extension) for the cProfile module.

import numpy as np
from scipy.stats import norm
def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.):
    mu_current = mu_init
    posterior = [mu_current]
    for i in range(samples):
        # suggest new position
        mu_proposal = norm(mu_current, proposal_width).rvs()

        # Compute likelihood by multiplying probabilities of each data point
        likelihood_current = norm(mu_current, 1).pdf(data).prod()
        likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
        
        # Compute prior probability of current and proposed mu        
        prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
        prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
        
        p_current = likelihood_current * prior_current
        p_proposal = likelihood_proposal * prior_proposal
        
        # Accept proposal?
        p_accept = p_proposal / p_current
        
        # Usually would include prior probability, which we neglect here for simplicity
        accept = np.random.rand() < p_accept
        
        if plot:
            plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)
        
        if accept:
            # Update position
            mu_current = mu_proposal
        
        posterior.append(mu_current)
        
    return posterior
np.random.seed(123)
data = np.random.randn(20)
import cProfile
cProfile.run('posterior = sampler(data, samples=1000, mu_init=1.)')

Reading that last page is a bit tricky - we have the "insides" of all the Python functions being called. Let's try line_profiler to just select the one function we care about, and see that line-by-line.

Like all external packages except matplotlib, we need to load the package as an IPython extension first:

%load_ext line_profiler

Then, we can use the lprun magic along with the -f function to select a function:

%lprun -f sampler posterior = sampler(data, samples=1000, mu_init=1.)

Much better! I can easily see what's wrong now. This allows you to target optimizations; often minor changes, like np.stack instead of np.array, may give you large speedups. Parts that are not performance critical can remain clear and easy to read/maintain.

Real life example:

  • I had 128 DataFrames of around a million events each, and I wanted to make 128 2D histograms.
  • Custom Histogram library Physt was taking 3 seconds per histogram.
  • Native Numpy histogram (then feeding that into Physt) took .9 seconds per histogram.
  • Custom Numba histogram function (then feeding that into Physt) took 0.1 seconds per histogram.

At this point, a total of 10-15 seconds for the procedure was fine. Physt histograms as output means I can use the nice OO design to merge and rebin histograms later, by the way.

Files

Let's look at reading in and out data!

Basic input/output

As a reminder, this is input output of text or data files in pure Python:

with open('tmpfile.txt', 'w') as f:
    f.write("This is one line\n")
    print("Second line", file=f)
# File gets closed here

with open('tmpfile.txt') as f:
    for l in f:
        print(l, end='')

You can iterate over a file to get lines, or use .readlines to get a list of lines, or .read to just read in (part) of the file. The mode flags can set text vs. binary 'b', read 'r', write 'w', or append 'a'.

The f object is a File Buffer. Quite a few functions and libraries in Python take a buffer - usually they take a file name or an open buffer. You can make a buffer that is attached to your memory instead of a file with io.StringIO.

Serializing any objects

Python has a protocol for storing and recovering almost any object called "pickling". This is general purpose, but is often not ideal for large or long term storage, or for communicating data - there is some limitation on Python version, the class code should be similar or the same, and there is no compression. But, it can be handy in a pinch. You can use dump and load, or you can use dumps and loads to input/output directly to a string instead:

import pickle
pickle_str = pickle.dumps("some object")
print(pickle_str)
pickle.loads(pickle_str)

Use case: configuration

Let's look at better ways to do IO for specific cases. Let's look at a common one: storing some "configuration" like data (or any small set of data). As long as the data is small, it's nice to leave it human readable format, in a text file. Here are some popular formats:

  1. XML: Famous - HTML is a subset of XML. Very verbose and ugly, and often used with a custom schema.
  2. JSON: Popular. A little unfriendly for writing, but clean and simple. A subset of JavaScript makes it web-friendly. Great libraries available.
  3. INI: Not well defined, and very limited. Very easy for a human to author, though.
  4. YAML: Yet Another Markup Language, popular in some areas. Lots of weird corner cases to the syntax, though.
  5. TOML: Simpler version of the thing above. Gaining some use in Python as of late.

Of these, the best standard library support in Python goes to JSON, so we'll look at that one. It's the best in the list for (small amounts of) general data. For configuration, it's a tossup.

JSON has the same methods as pickle.

import json
structure_to_save = ["list",'items',{"dictionary":"Nested", "number":2}]

print(json.dumps(structure_to_save, indent=True))

Larger data storage

Numpy

Let's assume we have data in Numpy. We can use one of several Numpy methods to store the data; we can even compress it.

a = np.random.rand(1000)

The format is a simple binary one-array-per-file format; you can use savez to save several files into one uncompressed zip file, or savez_compressed to save to a compressed zip file:

np.savez_compressed('tmp.npz', a=a)
with np.load('tmp.npz') as f:
    print(f['a'].shape)

While simple, this is still Numpy specific, and is not ideal for cross-language data storage.

HDF5

This is a standard format for storing binary structured data with optional compression. It has several nice features, like groups and attributes, that make it very powerful. You need an external library, but Anaconda comes with it by default. Technically, it comes with two. H5py and PyTables; I use HDF5 in C++ also, so I like the fact that H5py looks more like other HDF5 libraries, so I'll cover that. The other one is available with import tables.

import h5py
with h5py.File('tmp.h5', 'w') as f:
    f['a'] = a

with h5py.File('tmp.h5') as f:
    b = f['a'].value
assert np.all(a == b)

This is a common idiom in modern IO libraries; you treat the file like a dictionary. There are explicit methods if you need to do something fancier (like turn on compression for a dataset). HDF5 has tools to allow you to keep a dataset on file instead of reading it entirely to memory (but I've not had to use those).

Also, there are great non-Python tools for HDF5 available, like a graphical file viewer.

Note for HEP physicists: the ROOT file format has a few features that HDF5 does not, like incremental saving. However, other than small differences, the formats are similar. Except that HDF5 is available everywhere and even if it isn't, it takes something like 5 minutes to build instead of several hours...

Pandas

Pandas has very powerful connectors to lots of different formats, including CSV, JSON, Excel, and HDF5. If your data is already a table, just use Pandas!

import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
df.to_hdf('tmp.h5', 'df')
pd.read_hdf('tmp.h5', 'df')

Bonus: Uproot

In HEP, we have a very commonly used and powerful format: ROOT. But getting ROOT to compile is a long (2+ hour) ordeal, and I have yet to get it to work with Anaconda Python on macOS.

An exciting new library, "uproot" lets you read ROOT files in pure Python without ROOT. The original name was to be micro-root, or $\mu$root, but it became uproot. You've already seen a bit of uproot in the previous lectures.

Writing files is planned, but still in progress.