Pandas scatter_matrix – plot categorical variables

I am looking at the famous Titanic dataset from the Kaggle competition found here: http://www.kaggle.com/c/titanic-gettingStarted/data

I have loaded and processed the data using:

# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# load the data from the file
df = pd.read_csv('./data/train.csv')

# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix

# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']

# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x]))

df.info()

scatter_matrix from matplotlib

How can I add the categorical columns like Sex and Embarked to the plot?

Best answer

You need to transform the categorical variables into numbers to plot them.

Example (assuming that the column ‘Sex’ is holding the gender data, with ‘M’ for males & ‘F’ for females)

df['Sex_int'] = np.nan
df.loc[df['Sex'] == 'M', 'Sex_int'] = 0
df.loc[df['Sex'] == 'F', 'Sex_int'] = 1

Now all females are represented by 0 & males by 1. Unknown genders (if there are any) will be ignored.

The rest of your code should process the updated dataframe nicely.