I am looking at the famous Titanic dataset from the Kaggle competition found here: http://www.kaggle.com/c/titanic-gettingStarted/data
I have loaded and processed the data using:
# import required libraries import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # load the data from the file df = pd.read_csv('./data/train.csv') # import the scatter_matrix functionality from pandas.tools.plotting import scatter_matrix # define colors list, to be used to plot survived either red (=0) or green (=1) colors=['red','green'] # make a scatter plot scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x])) df.info()
How can I add the categorical columns like Sex and Embarked to the plot?
You need to transform the categorical variables into numbers to plot them.
Example (assuming that the column ‘Sex’ is holding the gender data, with ‘M’ for males & ‘F’ for females)
df['Sex_int'] = np.nan df.loc[df['Sex'] == 'M', 'Sex_int'] = 0 df.loc[df['Sex'] == 'F', 'Sex_int'] = 1
Now all females are represented by 0 & males by 1. Unknown genders (if there are any) will be ignored.
The rest of your code should process the updated dataframe nicely.