This my Analysis for the famous Titanic passengers dataset..I will use the Python libraries NumPy, Pandas, and
Matplotlib
About the Titanic dataset:
The
Titanic dataset Contains demographics and passenger information from
891 of the 2224 passengers and crew on board the Titanic. You can view a
description of this dataset on the Kaggle website, where the data was
obtained.
titanic_data=pd.read_csv('titanic_data.csv')
#titanic_data.head()
print(titanic_data.describe()
We can see from the data that 342 people surivived out of the 891, 38% survived.
Questions
Male Vs Females survived ?
Age Survival
How did class affect survival?
Did cabin Letter ( A,B,C,D,E,F,G) matter in survival?
First Look at the Data
([<matplotlib.patches.Wedge at 0x248f7bf6a58>,
<matplotlib.patches.Wedge at 0x248f7c02390>],
[<matplotlib.text.Text at 0x248f7bfc898>,
<matplotlib.text.Text at 0x248f7c08198>],
[<matplotlib.text.Text at 0x248f7bfce10>,
<matplotlib.text.Text at 0x248f7c08710>])
from the pie chart we see that in this data sample 61.6% died while only 38.4% lived
Male vs Females
sex_survive= titanic_data.groupby('Sex').sum()['Survived']
Its Clear that Females had better chance survivng than males
“Woman and Children First”
By Male / Female by Class
survival_by_class = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].sum()
%matplotlib inline
print(survival_by_class)
Sex Pclass
female 1 91
2 70
3 72
male 1 45
2 17
3 47
Name: Survived, dtype: int64
We
Can see that First Class woman who survived are more than any all other
classes, but what does this tells us? did higher class woman had higher
surivival chance? we can’t really tell from this graph because we don’t
know the total higher class women that were aboard.
so,generaly we need to take into considration the total number of the class type that was aboard and the percentage survived.
The number of Third Class Men survived are more than those of First and secnound Class.
class_count = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].count()
percent_survived= (survival_by_class/class_count )* 100
p1=plt.bar(arange(n/2)+0.4,percent_survived['male'],color=['c'],width=0.4)
p2=plt.bar(arange(n/2),percent_survived['female'],color='pink',width=0.4)
plt.ylabel('Percentage Survived per Class')
plt.title('Survivals Percentage by Class')
plt.xticks([0.4,1.4,2.4], ('First Class','Secound Class','Third Class'))
plt.legend((p1[0], p2[0]), ('Men', 'Women'))
plt.show()
print(percent_survived)
Sex Pclass
female 1 96.808511
2 92.105263
3 50.000000
male 1 36.885246
2 15.740741
3 13.544669
Name: Survived, dtype: float64
Its
clearer here that 96.8% , 92.1% of first and secound class women
respectivly surived compared to only 50% of third class women, which
clearly shows that first and secound class women had much higher chance
of surviving than third class
for men, its another story, class 1 is much higher than class 2 and 3
Survival by Age
The passengers age range from toddelrs to 80 year old men. Here we will explore the survival of difference age groups.
Note that age have a lot of missing values, exactly 177 missing value, that will be ignored for this analysis
age=titanic_data.Age[titanic_data.Age.notnull()] #cleaning data from Nans
plt.hist(age ,25, histtype='stepfilled')
plt.ylabel('Frequncy')
plt.title('Age Distrubition Histogram')
plt.xlabel('Age')
def cutDF(df):
return pd.cut(df,[0, 12, 19, 35, 65, 120], labels=['A_Child', 'B_Teen', 'C_Youth', 'D_Adult', 'E_Old']) #letter added to keep sorting
titanic_data['Category'] = titanic_data[['Age']].apply(cutDF)
survival_by_age_count = titanic_data.groupby(['Category'])['Survived'].count()
survival_by_age = titanic_data.groupby(['Category'])['Survived'].sum()
plt.bar(arange(len(survival_by_age)),survival_by_age)
plt.ylabel('Number of People Survived ')
plt.title('Number of People Survived By Age')
labels=survival_by_age_count.keys()
labels=['Child', 'Teen', 'Youth', 'Adult', 'Old']
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)
print(survival_by_age)
plt.show()
Category
A_Child 40
B_Teen 39
C_Youth 128
D_Adult 82
E_Old 1
Name: Survived, dtype: int64
Most people survived were Youth, but to compare rates we must take total number of passengers for each age into account
survival_by_age_percent =survival_by_age / survival_by_age_count * 100
plt.bar(arange(len(survival_by_age_percent)),survival_by_age_percent)
plt.ylabel('Percentage Survived per Age')
plt.title('Survivals Percentage by Age')
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)
print(survival_by_age_percent)
plt.show()
Category
A_Child 57.971014
B_Teen 41.052632
C_Youth 38.438438
D_Adult 39.234450
E_Old 12.500000
Name: Survived, dtype: float64
here it shows that Children had the highest chance of surviing
Survival by Cabinet
Class 1 People Stayed in cabinates, starting with a Letter from A to G. did that have any effect on their survival ?
survive_by_cabin_letter = pd.DataFrame({
'count':np.zeros(8,dtype=int),
'survivors':np.zeros(8,dtype=int)
}, index=['A', 'B', 'C', 'D', 'E','F','G','T'])
for s,c in titanic_data[['Survived','Cabin']].dropna().T.items(): #dropna drops all nan items
i=c['Survived']
C=c['Cabin'][0] #first letter
survive_by_cabin_letter['count'][C] += 1
survive_by_cabin_letter['survivors'][C] += i
print("percent of survival per Cabinet letter")
plt.bar(np.arange(8), (survive_by_cabin_letter['survivors']/survive_by_cabin_letter['count'])*100 )
plt.xticks(np.arange(8)+0.4,['A', 'B', 'C', 'D', 'E','F','G','T'] )
plt.ylabel('Percetage')
plt.title('Survival Based on Cabinet')
plt.xlabel('Cabinet Letter')
percent of survival per Cabinet letter
<matplotlib.text.Text at 0x1c193c79828>
From
the last result I can see that people in cabinats B,D,E survived more
that people in A,C,G There is no significant difference. it does not
indicate anything.
Conclusion
-The
Data is is filled with missing values, we chose to ignore them and do
all calculations without them. so the results are vague without a
statsitical test -Women and Children had higher chance of survivng
-Higher Class people had higher chance of survivng