About Enron
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, put my skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.
About the Data
Udacity have combined this email data with a hand-generated list of
persons of interest in the fraud case, which means individuals who were
indicted, reached a settlement or plea deal with the government, or
testified in exchange for prosecution immunity.
Data Characteristics
Total number of points: 145
POI vs Non-POI: 18 vs 127
Number of features: 22 (not including POI)
Feature with many Nan values : 'loan_advances' ( removed from training)
Outlier Investigation
By Plotting the data
I can see one very obvious outlier which is Total
After Removing it with
data_dict.pop('TOTAL',0)
the result is now
- I can still see some points that may be treated as outliers
- I decided to leave them because : --they are valid data points and they are two of the biggest Enron bosses so they are definitely people of interest
Selecting Features
Selection features I used SelectKBest tried different values for K .
and I tried testing the final results with different algorithms (below)
First the scores of the features are like the following:
Feature | Score | Selected |
---|---|---|
salary | 15.8060900874 | Yes |
bonus | 30.6522823057 | Yes |
deferral_payments | 0.00981944641905 | No |
deferred_income | 8.49349703055 | Yes |
director_fees | 1.64109792617 | Yes |
exercised_stock_options | 9.95616758208 | Yes |
expenses | 4.31439557308 | Yes |
from_messages | 0.434625706635 | No |
from_poi_to_this_person | 4.93930363951 | Yes |
from_this_person_to_poi | 0.105897968337 | No |
loan_advances | 7.03793279819 | No (Removed) |
long_term_incentive | 7.53452224003 | Yes |
other | 3.19668450433 | Yes |
restricted_stock | 8.051101897 | Yes |
restricted_stock_deferred | 0.679280338952 | Yes |
shared_receipt_with_poi | 10.6697373596 | Yes |
to_messages | 2.60677186644 | Yes |
total_payments | 8.962715501 | Yes |
total_stock_value | 10.814634863 | Yes |
New Features¶
I decided to make two new features based on the number of messages to and from POI
they are the ratio of messages send or received from anybody to from or to POI.
With New Feature | Without | |
---|---|---|
Accuracy | 0.85287 | 0.85260 |
Precision | 0.42837 | 0.42699 |
Recall | 0.30950 | 0.30850 |
F1 | 0.35936 | 0.35820 |
I can see very slight improvement.. so I tried dividing by total number
of messages sent and received to see if I get a different result, and I
got higher scores for feutures but worse final results
Scaling
- As the main algorithm (Adaboost) is not affected by scalling no scalling is used in the final algorithm
- However when testing other algorithms A StandardScaler() was inserted into the pipeline
Choosing The Algorithm
Selection
Tried different number of K for Select K Best and the results areDecision Tree
Using Decision Tree Classifier with default parameters to see what is the best size for features
Score K=3 K=6 K=9 K=12 Accuracy 0.80869 0.84300 0.84127 0.83867 Precision 0.10789 0.24679 0.23357 0.22222 Recall 0.03350 0.08650 0.08350 0.08400 F1 0.05113 0.12810 0.12302 0.12192
Adaboost
Learning rate =1 n_estimators = 50Score K=5 K=9 K=12 K=15 Accuracy 0.81393 0.83027 0.84940 0.84847 Precision 0.30597 0.31678 0.41172 0.41107 Recall 0.23850 0.23600 0.30200 0.31550 F1 0.26805 0.27049 0.34843 0.35700 -It seems here that more features the better Precision and Recall scores -at K=15 I get the best performance so far, which is what I will use as the final classifierTrying Random Forest
Score n_estimators=20 n_estimators=50 Accuracy 0.85573 0.83387 Precision 0.37538 0.30288 Recall 0.12350 0.18900 F1 0.18585 0.23276 Here Random Forest is diffently much of an improvment over Decision tree, but still not better than AdaboostConclusion
I decided to choose Adaboost as the classifier because it gave the best results for Recall and PrecisionTuning The Algorithm
Adaboost algorithm did very good out-of-the-box but in order to get the best possible results from it, we must tune it to our data.The objective of algorithm tuning is to find the best values for the parameters and configurations for the algorithms that fits the problemI tried to approaches to get the best out of Adaboost :
- 1) Trying with GridSearchCV ( automatic search)
- 2) Trying diffrient values manually
Adaboost Classfier ( Learning_rate = 3 , n_estimators=20 )
Validation
Learning the parameters of a prediction function and testing it on the same data is a mistake,it will fail when data changes. This situation is called overfitting. to avoid it, it is common practice to hold out part of the available data as a test set ( in my case 30% ) and other data to train the classifier
For validation I split the Data into 70% training and 30% test sets
This way I am sure that my model will not over fit into the training data.Evaluation
Notes about evaluation metrics
Accuracy:
- is the ratio of all true values predicted to all data.
- Accuracy is not a good score in this case since data is highly skewed (18 POI vs 146 )
- eg. if classifier only predicted 100% is false , it will still get 87.4% accuracy
Precision :
- Precision is the ability of the classifier not to label as positive a sample that is negative
- High precision means that if classifier identifies a positive its very confidant that it is really a positive
- Precision is tp/(tp+fp)
Recall:
- is the ability of the classifier to find all the positive samples.
- Very high recall means that classifier will not miss any positives ( will identify all positives and other false positives)
- Recall = tp/(tp+fn)
The best performance was using Adaboost with these parameters
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1, n_estimators=50, random_state=None)
test time: 100.252 sAccuracy: 0.84847 Precision: 0.41107 Recall: 0.31550 F1: 0.35700 F2: 0.33089 Total predictions: 15000 True positives: 631 False positives: 904 False negatives: 1369 True negatives: 12096
This is a just the summery of the full project.