Thursday, October 27, 2016

Identify Fraud from Enron Email - Machine Learning

About Enron

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, put my skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.




About the Data

Udacity have combined this email data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.



Data Characteristics

Total number of points: 145
POI vs Non-POI: 18 vs 127
Number of features: 22 (not including POI)
Feature with many Nan values : 'loan_advances' ( removed from training)

Outlier Investigation

By Plotting the data I can see one very obvious outlier which is Total
 


After Removing it with
data_dict.pop('TOTAL',0)
the result is now 




  • I can still see some points that may be treated as outliers
  • I decided to leave them because : --they are valid data points and they are two of the biggest Enron bosses so they are definitely people of interest 


Selecting Features

Selection features I used SelectKBest tried different values for K . and I tried testing the final results with different algorithms (below)
First the scores of the features are like the following:
Feature Score Selected
salary 15.8060900874 Yes
bonus 30.6522823057 Yes
deferral_payments 0.00981944641905 No
deferred_income 8.49349703055 Yes
director_fees 1.64109792617 Yes
exercised_stock_options 9.95616758208 Yes
expenses 4.31439557308 Yes
from_messages 0.434625706635 No
from_poi_to_this_person 4.93930363951 Yes
from_this_person_to_poi 0.105897968337 No
loan_advances 7.03793279819 No (Removed)
long_term_incentive 7.53452224003 Yes
other 3.19668450433 Yes
restricted_stock 8.051101897 Yes
restricted_stock_deferred 0.679280338952 Yes
shared_receipt_with_poi 10.6697373596 Yes
to_messages 2.60677186644 Yes
total_payments 8.962715501 Yes
total_stock_value 10.814634863 Yes

 


New Features

I decided to make two new features based on the number of messages to and from POI
they are the ratio of messages send or received from anybody to from or to POI.

                         With New Feature      Without
Accuracy 0.85287 0.85260
Precision 0.42837 0.42699
Recall 0.30950 0.30850
F1 0.35936 0.35820

I can see very slight improvement.. so I tried dividing by total number of messages sent and received to see if I get a different result, and I got higher scores for feutures but worse final results

Scaling

  • As the main algorithm (Adaboost) is not affected by scalling no scalling is used in the final algorithm
  • However when testing other algorithms A StandardScaler() was inserted into the pipeline 



    Choosing The Algorithm

    Selection

    Tried different number of K for Select K Best and the results are

    Decision Tree

    Using Decision Tree Classifier with default parameters to see what is the best size for features 

    Score             K=3           K=6           K=9             K=12
    Accuracy 0.80869 0.84300 0.84127 0.83867
    Precision 0.10789 0.24679 0.23357 0.22222
    Recall 0.03350 0.08650 0.08350 0.08400
    F1 0.05113 0.12810 0.12302 0.12192




    Adaboost

    Learning rate =1 n_estimators = 50
    Score K=5 K=9 K=12 K=15
    Accuracy 0.81393 0.83027 0.84940 0.84847
    Precision 0.30597 0.31678 0.41172 0.41107
    Recall 0.23850 0.23600 0.30200 0.31550
    F1 0.26805 0.27049 0.34843 0.35700
    -It seems here that more features the better Precision and Recall scores -at K=15 I get the best performance so far, which is what I will use as the final classifier

    Trying Random Forest

    Score n_estimators=20 n_estimators=50
    Accuracy 0.85573 0.83387
    Precision 0.37538 0.30288
    Recall 0.12350 0.18900
    F1 0.18585 0.23276
    Here Random Forest is diffently much of an improvment over Decision tree, but still not better than Adaboost

    Conclusion

    I decided to choose Adaboost as the classifier because it gave the best results for Recall and Precision


    Tuning The Algorithm

    Adaboost algorithm did very good out-of-the-box but in order to get the best possible results from it, we must tune it to our data.
    The objective of algorithm tuning is to find the best values for the parameters and configurations for the algorithms that fits the problem
    I tried to approaches to get the best out of Adaboost :
    • 1) Trying with GridSearchCV ( automatic search)
    • 2) Trying diffrient values manually

    SelectKbest ( K =4 )
    Adaboost Classfier ( Learning_rate = 3 , n_estimators=20 )




    Validation

    Learning the parameters of a prediction function and testing it on the same data is a mistake,it will fail when data changes. This situation is called overfitting. to avoid it, it is common practice to hold out part of the available data as a test set ( in my case 30% ) and other data to train the classifier
    For validation I split the Data into 70% training and 30% test sets
    This way I am sure that my model will not over fit into the training data.


    Evaluation

    Notes about evaluation metrics

    Accuracy:

    • is the ratio of all true values predicted to all data.
    • Accuracy is not a good score in this case since data is highly skewed (18 POI vs 146 )
    • eg. if classifier only predicted 100% is false , it will still get 87.4% accuracy

    Precision :

    • Precision is the ability of the classifier not to label as positive a sample that is negative
    • High precision means that if classifier identifies a positive its very confidant that it is really a positive
    • Precision is tp/(tp+fp)

    Recall:

    • is the ability of the classifier to find all the positive samples.
    • Very high recall means that classifier will not miss any positives ( will identify all positives and other false positives)
    • Recall = tp/(tp+fn)
    I Used the provided tester.py to evaluate my function along with .score() function in the classifier
    The best performance was using Adaboost with these parameters
    AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1, n_estimators=50, random_state=None)
        Accuracy: 0.84847       Precision: 0.41107      Recall: 0.31550 F1: 0.35700     F2: 0.33089
        Total predictions: 15000        True positives:  631    False positives:  904   False negatives: 1369   True negatives: 12096
    
    test time: 100.252 s


    This is a just the summery of the full project.

Saturday, October 22, 2016

Analyzing four years of Traffic Violations using R

Traffic Violations Exploratory Data Analysis by Nour Galaby

 

 

















This is a summery of the full project
Full project here: https://github.com/NourGalaby/Traffic-Violations-Exportory-Data-Analysis-in-R

Traffic Violations EDA (Expltoray Data Analysis)

This dataset contains traffic violation information from all electronic traffic violations issued in the County of Montgomery.
It contains violations from 2012 to 2016. more than 800,000 entry.
In this project I will use R to make an Exploratory Data Analysis (EDA) on this Dataset.
 Sit tight,  lets get started.



This Dataset contains 24 variables 
  • “Date.Of.Violation” : Date where violation occured ex :1/1/2012
  • Time.Of.Violation : time when vilation happend ex: 00:01:00
  • “Violation.Description” : Description of violation in text
  • “Violation.Location” : The Location name in text
  • “Latitude” : Latitude location ex: 77.04796
  • “Longitude” : Longitude location ex:39.05742
  • “Geolocation” : both Latitude and Longitude ex:77.1273633333333, 39.0908983333333
  • “Belts.Flag” : if driver had belt at time of violation (Yes, NO)
  • “Personal.Injury” : if any personal injury occured as result of the violation (Yes, NO)
  • “Property.Damage” : if any property damaged occured as result of the violation (Yes, NO)
  • “Commercial.License”: If driver has Commercial License (Yes, NO)
  • “Commercial.Vehicle”: if Vehicle has Commercial.License (Yes, No)
  • “Alcohol” : If Driver was DUI at time of violation (Yes, No)
  • “Work.Zone” : if violation happend in a work zone (Yes, No)
  • “Violation.State” : The state where violation happend ex: MD
  • “Vehicle.Type” : ex: Automobile, Truck, Motorbike
  • “Vehicle.Production.Year”: ex:1990
  • “Vehicle.Manfacturer” : ex:Toyota
  • “Vehicle.Model” : ex: CoROLLA
  • “Vehicle.Color” : ex: Black, White
  • “Caused.an.Accident” : if the violation caused an accident (Yes, No)
  • “Gender” : Gender of driver (M,F)
  • “Driver.City” : City of driver ex:BETHESDA
  • “Driver.State” : ex: MD






The Main features of interest are the date and time of violation, and the damage caused. I would like to see how violations happen yearly, and if there is a certain period where a lot of violations happens



Lets start by plotting violations over time...


This data from the year 2012 to 2016 .. there may be some patterns. but its not clear and its too noisy to note anything. lets smooth it and try again

Adding smoother

 


default smoother doesn't help much.. that is because there is too many data.. Lets group by week. and take average over that week and see

Group by week


Much better.. if you look closely there maybe a pattern here…
but we will look into that shortly… Lets try grouping by month too.

Grouping over each month


now the pattern is clear … to make it even clearer lets group by year and plot
years over each other

Coloring Years


We can see that Violations increase over years.. and there seem to be a certain time where violations peak.

Plotting years

 


Here we can see that at May we see the most violations of the year.. and followed by October ? could that be the increase of people who travel
there at the summer ? or simply the start of summer and people go out more ? I wonder...
and at 2015 something was different and the peak was no longer at may.


We can see from this violations clearly how much each week differ from each year


Plotting People that caused damage by date and gender




I notice something here: Most violations are by males. but the days where males don’t make many violations. Female make many violations. We can and vice verse.. we can see it here in the spikes.. a male positive spike is often coupled with a female negative spike, but this issue should be looked at closer.

Alcohol only Violations   


I
It seems that most Alcohol violations for both men and women happen between 2 PM and 5 PM

Summary


This graph shows the count of violations in each minute. it shows when violations generally happen during the day.
Here are some things to notice about this graph
  • the line is the weighted mean calculated by passing a sliding window.
  • at 00:00 till 8:00 the variance in the number of violations is very low (all points are close)
  • violations peak twice a day; at 7:00 and at 11:00 PM


*from this plot we can see number of violations increase over the years till 2014 it reached a peak. then started to come down at 2015
  • May and October have the most violations in all years.. I wonder why?


This plot shows the location of violations of a particular location..zoomed in… I choose it because it looks like the violations draws the map of the streets..
You can tell the major streets by just looking at the violations.. and it looks oddly like a blood veins..
this plot may not convey a lot of information,however I think this plot is very interesting and that's why I choose to put it in the summary/

Reflection


The traffic violation data set contains information on more than 800,000 violation occurred from 2012 till 2016. I this shows how much violations increase through the years.and what are the most times violations occur in, which I learned May and October see the most violations.
Also I used this data to get the most popular cars and models.
It seems this data can be used a lot to help reduce violation and understand its causes. like analyzing the most locations that violations occur and understand its causes.
Struggles I had with this dataset is that most of its variables are categorical. and not continuous. This made it very hard to derive insights and make comparisons, I heavily relied on the “count” of violations variable. as I grouped by each category. and I found very interesting insights (like in datetime and location)
one thing to make it better and could be future work is using this data with another labeled maps data. so we can see clearly where the violations occur…
also the description of the violation could be grouped into categories (ex: speeding, traffic light ignore, reckless driving) and studied further to help reduce violations and accidents, and make traffic better for everyone.



Source Code and full project:  https://github.com/NourGalaby/Traffic-Violations-Exportory-Data-Analysis-in-R

Friday, September 2, 2016

Titanic data Analysis

This my Analysis for the famous Titanic passengers dataset..I will use the Python libraries NumPy, Pandas, and Matplotlib

About the Titanic dataset:

The Titanic dataset Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the Kaggle website, where the data was obtained.


titanic_data=pd.read_csv('titanic_data.csv')

#titanic_data.head()
print(titanic_data.describe()
We can see from the data that 342 people surivived out of the 891, 38% survived.

Questions

Male Vs Females survived ?

Age Survival

How did class affect survival?

Did cabin Letter ( A,B,C,D,E,F,G) matter in survival?

First Look at the Data

([<matplotlib.patches.Wedge at 0x248f7bf6a58>,
  <matplotlib.patches.Wedge at 0x248f7c02390>],
 [<matplotlib.text.Text at 0x248f7bfc898>,
  <matplotlib.text.Text at 0x248f7c08198>],
 [<matplotlib.text.Text at 0x248f7bfce10>,
  <matplotlib.text.Text at 0x248f7c08710>])

from the pie chart we see that in this data sample 61.6% died while only 38.4% lived

Male vs Females


sex_survive= titanic_data.groupby('Sex').sum()['Survived']




Its Clear that Females had better chance survivng than males
“Woman and Children First”

By Male / Female by Class

survival_by_class = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].sum()
%matplotlib inline
print(survival_by_class)




Sex     Pclass
female  1         91
        2         70
        3         72
male    1         45
        2         17
        3         47
Name: Survived, dtype: int64

We Can see that First Class woman who survived are more than any all other classes, but what does this tells us? did higher class woman had higher surivival chance? we can’t really tell from this graph because we don’t know the total higher class women that were aboard.
so,generaly we need to take into considration the total number of the class type that was aboard and the percentage survived.
The number of Third Class Men survived are more than those of First and secnound Class.
class_count = titanic_data.groupby(['Sex', 'Pclass'])['Survived'].count()

percent_survived= (survival_by_class/class_count )* 100
p1=plt.bar(arange(n/2)+0.4,percent_survived['male'],color=['c'],width=0.4)
p2=plt.bar(arange(n/2),percent_survived['female'],color='pink',width=0.4)
plt.ylabel('Percentage Survived per Class')
plt.title('Survivals Percentage by Class')
plt.xticks([0.4,1.4,2.4], ('First Class','Secound Class','Third Class'))
plt.legend((p1[0], p2[0]), ('Men', 'Women'))
plt.show()

print(percent_survived)


Sex     Pclass
female  1         96.808511
        2         92.105263
        3         50.000000
male    1         36.885246
        2         15.740741
        3         13.544669
Name: Survived, dtype: float64
Its clearer here that 96.8% , 92.1% of first and secound class women respectivly surived compared to only 50% of third class women, which clearly shows that first and secound class women had much higher chance of surviving than third class
for men, its another story, class 1 is much higher than class 2 and 3

Survival by Age

The passengers age range from toddelrs to 80 year old men. Here we will explore the survival of difference age groups.
Note that age have a lot of missing values, exactly 177 missing value, that will be ignored for this analysis
age=titanic_data.Age[titanic_data.Age.notnull()] #cleaning data from Nans

plt.hist(age ,25, histtype='stepfilled')
plt.ylabel('Frequncy')
plt.title('Age Distrubition Histogram')
plt.xlabel('Age')







def cutDF(df):
    return pd.cut(df,[0, 12, 19, 35, 65, 120], labels=['A_Child', 'B_Teen', 'C_Youth', 'D_Adult', 'E_Old']) #letter added to keep sorting

titanic_data['Category'] = titanic_data[['Age']].apply(cutDF)

survival_by_age_count = titanic_data.groupby(['Category'])['Survived'].count()
survival_by_age = titanic_data.groupby(['Category'])['Survived'].sum()


plt.bar(arange(len(survival_by_age)),survival_by_age)

plt.ylabel('Number of People Survived ')
plt.title('Number of People Survived By Age')
labels=survival_by_age_count.keys()
labels=['Child', 'Teen', 'Youth', 'Adult', 'Old']
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)


print(survival_by_age)
plt.show()



Category
A_Child     40
B_Teen      39
C_Youth    128
D_Adult     82
E_Old        1
Name: Survived, dtype: int64

Most people survived were Youth, but to compare rates we must take total number of passengers for each age into account
survival_by_age_percent =survival_by_age / survival_by_age_count * 100

plt.bar(arange(len(survival_by_age_percent)),survival_by_age_percent)

plt.ylabel('Percentage Survived per Age')
plt.title('Survivals Percentage by Age')
plt.xticks([0.4,1.4,2.4,3.4,4.4], labels)


print(survival_by_age_percent)
plt.show()


Category
A_Child    57.971014
B_Teen     41.052632
C_Youth    38.438438
D_Adult    39.234450
E_Old      12.500000
Name: Survived, dtype: float64

here it shows that Children had the highest chance of surviing

Survival by Cabinet

Class 1 People Stayed in cabinates, starting with a Letter from A to G. did that have any effect on their survival ?

survive_by_cabin_letter = pd.DataFrame({
'count':np.zeros(8,dtype=int),
'survivors':np.zeros(8,dtype=int)
}, index=['A', 'B', 'C', 'D', 'E','F','G','T'])


for s,c in titanic_data[['Survived','Cabin']].dropna().T.items(): #dropna drops all nan items
    i=c['Survived']
    C=c['Cabin'][0] #first letter

    survive_by_cabin_letter['count'][C] += 1
    survive_by_cabin_letter['survivors'][C] += i
    
print("percent of survival per Cabinet letter")

plt.bar(np.arange(8), (survive_by_cabin_letter['survivors']/survive_by_cabin_letter['count'])*100 )
plt.xticks(np.arange(8)+0.4,['A', 'B', 'C', 'D', 'E','F','G','T'] )
plt.ylabel('Percetage')
plt.title('Survival Based on Cabinet')
plt.xlabel('Cabinet Letter')


percent of survival per Cabinet letter





<matplotlib.text.Text at 0x1c193c79828>

From the last result I can see that people in cabinats B,D,E survived more that people in A,C,G There is no significant difference. it does not indicate anything.

Conclusion

-The Data is is filled with missing values, we chose to ignore them and do all calculations without them. so the results are vague without a statsitical test -Women and Children had higher chance of survivng -Higher Class people had higher chance of survivng