Worksheet

Week 7 Day 1: Worksheet

Load the following example dataset:

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = sns.load_dataset("titanic")
# Note: if you have problems, try:
# df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')

1: Look at the data

What does the dataset look like? (Most importantly, what are the column names)

2: Cleanup

2.1

Modify the dataset to provide more reasonable dtypes.

survived can be bool
alive can be bool (a little tricker)
pclass, sex, embarked, who, and embark_town can be categorical

Remember, you are modifying the DataFrame, so you'll probably not be able to rerun this cell without rerunning the one before it.

2.2

What categories does who come in?

2.3

For our purposes, we should have child=bool, sex=category, and we can drop adult_male and who columns. You can use the Python del statement to delete columns.

3: Several questions

What fraction survived the titanic (in the dataset we have)?

What fraction of children survived the Titanic?

What was the average fare of each class of adult passenger pclass?

4: Plotting

A bit tricky: Plot a stacked histogram binned over age (10 or 20 bins for 0-100) of alive vs. not alive passengers. You probably will find yourself filtering out the NaN's if you use matplotlib directly (which I think you need to).