About Enron

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, put my skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

About the Data

Udacity have combined this email data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

Data Characteristics

Total number of points: 145

POI vs Non-POI: 18 vs 127

Number of features: 22 (not including POI)

Feature with many Nan values : 'loan_advances' ( removed from training)

Outlier Investigation

By Plotting the data I can see one very obvious outlier which is Total

After Removing it with

data_dict.pop('TOTAL',0)

the result is now

I can still see some points that may be treated as outliers
I decided to leave them because : --they are valid data points and they are two of the biggest Enron bosses so they are definitely people of interest

Selecting Features

Selection features I used SelectKBest tried different values for K . and I tried testing the final results with different algorithms (below)

First the scores of the features are like the following:

Feature	Score	Selected
salary	15.8060900874	Yes
bonus	30.6522823057	Yes
deferral_payments	0.00981944641905	No
deferred_income	8.49349703055	Yes
director_fees	1.64109792617	Yes
exercised_stock_options	9.95616758208	Yes
expenses	4.31439557308	Yes
from_messages	0.434625706635	No
from_poi_to_this_person	4.93930363951	Yes
from_this_person_to_poi	0.105897968337	No
loan_advances	7.03793279819	No (Removed)
long_term_incentive	7.53452224003	Yes
other	3.19668450433	Yes
restricted_stock	8.051101897	Yes
restricted_stock_deferred	0.679280338952	Yes
shared_receipt_with_poi	10.6697373596	Yes
to_messages	2.60677186644	Yes
total_payments	8.962715501	Yes
total_stock_value	10.814634863	Yes

New Features¶

I decided to make two new features based on the number of messages to and from POI

they are the ratio of messages send or received from anybody to from or to POI.

	With New Feature	Without
Accuracy	0.85287	0.85260
Precision	0.42837	0.42699
Recall	0.30950	0.30850
F1	0.35936	0.35820

I can see very slight improvement.. so I tried dividing by total number of messages sent and received to see if I get a different result, and I got higher scores for feutures but worse final results

Scaling

As the main algorithm (Adaboost) is not affected by scalling no scalling is used in the final algorithm

However when testing other algorithms A StandardScaler() was inserted into the pipeline

Choosing The Algorithm

Selection

Tried different number of K for Select K Best and the results are

Decision Tree

Using Decision Tree Classifier with default parameters to see what is the best size for features

Score	K=3	K=6	K=9	K=12
Accuracy	0.80869	0.84300	0.84127	0.83867
Precision	0.10789	0.24679	0.23357	0.22222
Recall	0.03350	0.08650	0.08350	0.08400
F1	0.05113	0.12810	0.12302	0.12192

Adaboost

Learning rate =1 n_estimators = 50

Score	K=5	K=9	K=12	K=15
Accuracy	0.81393	0.83027	0.84940	0.84847
Precision	0.30597	0.31678	0.41172	0.41107
Recall	0.23850	0.23600	0.30200	0.31550
F1	0.26805	0.27049	0.34843	0.35700

-It seems here that more features the better Precision and Recall scores -at K=15 I get the best performance so far, which is what I will use as the final classifier

Trying Random Forest

Score	n_estimators=20	n_estimators=50
Accuracy	0.85573	0.83387
Precision	0.37538	0.30288
Recall	0.12350	0.18900
F1	0.18585	0.23276

Here Random Forest is diffently much of an improvment over Decision tree, but still not better than Adaboost

Conclusion

I decided to choose Adaboost as the classifier because it gave the best results for Recall and Precision

Tuning The Algorithm

Adaboost algorithm did very good out-of-the-box but in order to get the best possible results from it, we must tune it to our data.

The objective of algorithm tuning is to find the best values for the parameters and configurations for the algorithms that fits the problem

I tried to approaches to get the best out of Adaboost :

1) Trying with GridSearchCV ( automatic search)
2) Trying diffrient values manually

SelectKbest ( K =4 )
Adaboost Classfier ( Learning_rate = 3 , n_estimators=20 )

Validation

Learning the parameters of a prediction function and testing it on the same data is a mistake,it will fail when data changes. This situation is called overfitting. to avoid it, it is common practice to hold out part of the available data as a test set ( in my case 30% ) and other data to train the classifier
For validation I split the Data into 70% training and 30% test sets
This way I am sure that my model will not over fit into the training data.

Evaluation

Notes about evaluation metrics

Accuracy:

is the ratio of all true values predicted to all data.
Accuracy is not a good score in this case since data is highly skewed (18 POI vs 146 )
eg. if classifier only predicted 100% is false , it will still get 87.4% accuracy

Precision :

Precision is the ability of the classifier not to label as positive a sample that is negative
High precision means that if classifier identifies a positive its very confidant that it is really a positive
Precision is tp/(tp+fp)

Recall:

is the ability of the classifier to find all the positive samples.
Very high recall means that classifier will not miss any positives ( will identify all positives and other false positives)
Recall = tp/(tp+fn)

I Used the provided tester.py to evaluate my function along with .score() function in the classifier
The best performance was using Adaboost with these parameters
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1, n_estimators=50, random_state=None)

    Accuracy: 0.84847       Precision: 0.41107      Recall: 0.31550 F1: 0.35700     F2: 0.33089
    Total predictions: 15000        True positives:  631    False positives:  904   False negatives: 1369   True negatives: 12096

test time: 100.252 s

This is a just the summery of the full project.

Nour Galaby's Blog

Thursday, October 27, 2016

Identify Fraud from Enron Email - Machine Learning