Friday, September 2, 2016

Titanic data Analysis

This my Analysis for the famous Titanic passengers dataset..I will use the Python libraries NumPy, Pandas, and Matplotlib

About the Titanic dataset:

The Titanic dataset Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the Kaggle website, where the data was obtained.


titanic_data=pd.read_csv('titanic_data.csv')

#titanic_data.head()
print(titanic_data.describe()
We can see from the data that 342 people surivived out of the 891, 38% survived.

Questions

Male Vs Females survived ?

Age Survival

How did class affect survival?

Did cabin Letter ( A,B,C,D,E,F,G) matter in survival?

First Look at the Data

([<matplotlib.patches.Wedge at 0x248f7bf6a58>,
  <matplotlib.patches.Wedge at 0x248f7c02390>],
 [<matplotlib.text.Text at 0x248f7bfc898>,
  <matplotlib.text.Text at 0x248f7c08198>],
 [<matplotlib.text.Text at 0x248f7bfce10>,
  <matplotlib.text.Text at 0x248f7c08710>])

from the pie chart we see that in this data sample 61.6% died while only 38.4% lived

Male vs Females


sex_survive= titanic_data.groupby('Sex').sum()['Survived']




Its Clear that Females had better chance survivng than males
“Woman and Children First”

By Male / Female by Class

survival_by_class = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].sum()
%matplotlib inline
print(survival_by_class)




Sex     Pclass
female  1         91
        2         70
        3         72
male    1         45
        2         17
        3         47
Name: Survived, dtype: int64

We Can see that First Class woman who survived are more than any all other classes, but what does this tells us? did higher class woman had higher surivival chance? we can’t really tell from this graph because we don’t know the total higher class women that were aboard.
so,generaly we need to take into considration the total number of the class type that was aboard and the percentage survived.
The number of Third Class Men survived are more than those of First and secnound Class.
class_count = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].count()

percent_survived= (survival_by_class/class_count )* 100
p1=plt.bar(arange(n/2)+0.4,percent_survived['male'],color=['c'],width=0.4)
p2=plt.bar(arange(n/2),percent_survived['female'],color='pink',width=0.4)
plt.ylabel('Percentage Survived per Class')
plt.title('Survivals Percentage by Class')
plt.xticks([0.4,1.4,2.4], ('First Class','Secound Class','Third Class'))
plt.legend((p1[0], p2[0]), ('Men', 'Women'))
plt.show()

print(percent_survived)


Sex     Pclass
female  1         96.808511
        2         92.105263
        3         50.000000
male    1         36.885246
        2         15.740741
        3         13.544669
Name: Survived, dtype: float64
Its clearer here that 96.8% , 92.1% of first and secound class women respectivly surived compared to only 50% of third class women, which clearly shows that first and secound class women had much higher chance of surviving than third class
for men, its another story, class 1 is much higher than class 2 and 3

Survival by Age

The passengers age range from toddelrs to 80 year old men. Here we will explore the survival of difference age groups.
Note that age have a lot of missing values, exactly 177 missing value, that will be ignored for this analysis
age=titanic_data.Age[titanic_data.Age.notnull()] #cleaning data from Nans

plt.hist(age ,25, histtype='stepfilled')
plt.ylabel('Frequncy')
plt.title('Age Distrubition Histogram')
plt.xlabel('Age')







def cutDF(df):
    return pd.cut(df,[0, 12, 19, 35, 65, 120], labels=['A_Child', 'B_Teen', 'C_Youth', 'D_Adult', 'E_Old']) #letter added to keep sorting

titanic_data['Category'] = titanic_data[['Age']].apply(cutDF)

survival_by_age_count = titanic_data.groupby(['Category'])['Survived'].count()
survival_by_age = titanic_data.groupby(['Category'])['Survived'].sum()


plt.bar(arange(len(survival_by_age)),survival_by_age)

plt.ylabel('Number of People Survived ')
plt.title('Number of People Survived By Age')
labels=survival_by_age_count.keys()
labels=['Child', 'Teen', 'Youth', 'Adult', 'Old']
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)


print(survival_by_age)
plt.show()



Category
A_Child     40
B_Teen      39
C_Youth    128
D_Adult     82
E_Old        1
Name: Survived, dtype: int64

Most people survived were Youth, but to compare rates we must take total number of passengers for each age into account
survival_by_age_percent =survival_by_age / survival_by_age_count * 100

plt.bar(arange(len(survival_by_age_percent)),survival_by_age_percent)

plt.ylabel('Percentage Survived per Age')
plt.title('Survivals Percentage by Age')
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)


print(survival_by_age_percent)
plt.show()


Category
A_Child    57.971014
B_Teen     41.052632
C_Youth    38.438438
D_Adult    39.234450
E_Old      12.500000
Name: Survived, dtype: float64

here it shows that Children had the highest chance of surviing

Survival by Cabinet

Class 1 People Stayed in cabinates, starting with a Letter from A to G. did that have any effect on their survival ?

survive_by_cabin_letter = pd.DataFrame({
'count':np.zeros(8,dtype=int),
'survivors':np.zeros(8,dtype=int)
}, index=['A', 'B', 'C', 'D', 'E','F','G','T'])


for s,c in titanic_data[['Survived','Cabin']].dropna().T.items(): #dropna drops all nan items
    i=c['Survived']
    C=c['Cabin'][0] #first letter

    survive_by_cabin_letter['count'][C] += 1
    survive_by_cabin_letter['survivors'][C] += i
    
print("percent of survival per Cabinet letter")

plt.bar(np.arange(8), (survive_by_cabin_letter['survivors']/survive_by_cabin_letter['count'])*100 )
plt.xticks(np.arange(8)+0.4,['A', 'B', 'C', 'D', 'E','F','G','T'] )
plt.ylabel('Percetage')
plt.title('Survival Based on Cabinet')
plt.xlabel('Cabinet Letter')


percent of survival per Cabinet letter





<matplotlib.text.Text at 0x1c193c79828>

From the last result I can see that people in cabinats B,D,E survived more that people in A,C,G There is no significant difference. it does not indicate anything.

Conclusion

-The Data is is filled with missing values, we chose to ignore them and do all calculations without them. so the results are vague without a statsitical test -Women and Children had higher chance of survivng -Higher Class people had higher chance of survivng