Thursday, October 27, 2016

Identify Fraud from Enron Email - Machine Learning

About Enron

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, put my skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal.




About the Data

Udacity have combined this email data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.



Data Characteristics

Total number of points: 145
POI vs Non-POI: 18 vs 127
Number of features: 22 (not including POI)
Feature with many Nan values : 'loan_advances' ( removed from training)

Outlier Investigation

By Plotting the data I can see one very obvious outlier which is Total
 


After Removing it with
data_dict.pop('TOTAL',0)
the result is now 




  • I can still see some points that may be treated as outliers
  • I decided to leave them because : --they are valid data points and they are two of the biggest Enron bosses so they are definitely people of interest 


Selecting Features

Selection features I used SelectKBest tried different values for K . and I tried testing the final results with different algorithms (below)
First the scores of the features are like the following:
Feature Score Selected
salary 15.8060900874 Yes
bonus 30.6522823057 Yes
deferral_payments 0.00981944641905 No
deferred_income 8.49349703055 Yes
director_fees 1.64109792617 Yes
exercised_stock_options 9.95616758208 Yes
expenses 4.31439557308 Yes
from_messages 0.434625706635 No
from_poi_to_this_person 4.93930363951 Yes
from_this_person_to_poi 0.105897968337 No
loan_advances 7.03793279819 No (Removed)
long_term_incentive 7.53452224003 Yes
other 3.19668450433 Yes
restricted_stock 8.051101897 Yes
restricted_stock_deferred 0.679280338952 Yes
shared_receipt_with_poi 10.6697373596 Yes
to_messages 2.60677186644 Yes
total_payments 8.962715501 Yes
total_stock_value 10.814634863 Yes

 


New Features

I decided to make two new features based on the number of messages to and from POI
they are the ratio of messages send or received from anybody to from or to POI.

                         With New Feature      Without
Accuracy 0.85287 0.85260
Precision 0.42837 0.42699
Recall 0.30950 0.30850
F1 0.35936 0.35820

I can see very slight improvement.. so I tried dividing by total number of messages sent and received to see if I get a different result, and I got higher scores for feutures but worse final results

Scaling

  • As the main algorithm (Adaboost) is not affected by scalling no scalling is used in the final algorithm
  • However when testing other algorithms A StandardScaler() was inserted into the pipeline 



    Choosing The Algorithm

    Selection

    Tried different number of K for Select K Best and the results are

    Decision Tree

    Using Decision Tree Classifier with default parameters to see what is the best size for features 

    Score             K=3           K=6           K=9             K=12
    Accuracy 0.80869 0.84300 0.84127 0.83867
    Precision 0.10789 0.24679 0.23357 0.22222
    Recall 0.03350 0.08650 0.08350 0.08400
    F1 0.05113 0.12810 0.12302 0.12192




    Adaboost

    Learning rate =1 n_estimators = 50
    Score K=5 K=9 K=12 K=15
    Accuracy 0.81393 0.83027 0.84940 0.84847
    Precision 0.30597 0.31678 0.41172 0.41107
    Recall 0.23850 0.23600 0.30200 0.31550
    F1 0.26805 0.27049 0.34843 0.35700
    -It seems here that more features the better Precision and Recall scores -at K=15 I get the best performance so far, which is what I will use as the final classifier

    Trying Random Forest

    Score n_estimators=20 n_estimators=50
    Accuracy 0.85573 0.83387
    Precision 0.37538 0.30288
    Recall 0.12350 0.18900
    F1 0.18585 0.23276
    Here Random Forest is diffently much of an improvment over Decision tree, but still not better than Adaboost

    Conclusion

    I decided to choose Adaboost as the classifier because it gave the best results for Recall and Precision


    Tuning The Algorithm

    Adaboost algorithm did very good out-of-the-box but in order to get the best possible results from it, we must tune it to our data.
    The objective of algorithm tuning is to find the best values for the parameters and configurations for the algorithms that fits the problem
    I tried to approaches to get the best out of Adaboost :
    • 1) Trying with GridSearchCV ( automatic search)
    • 2) Trying diffrient values manually

    SelectKbest ( K =4 )
    Adaboost Classfier ( Learning_rate = 3 , n_estimators=20 )




    Validation

    Learning the parameters of a prediction function and testing it on the same data is a mistake,it will fail when data changes. This situation is called overfitting. to avoid it, it is common practice to hold out part of the available data as a test set ( in my case 30% ) and other data to train the classifier
    For validation I split the Data into 70% training and 30% test sets
    This way I am sure that my model will not over fit into the training data.


    Evaluation

    Notes about evaluation metrics

    Accuracy:

    • is the ratio of all true values predicted to all data.
    • Accuracy is not a good score in this case since data is highly skewed (18 POI vs 146 )
    • eg. if classifier only predicted 100% is false , it will still get 87.4% accuracy

    Precision :

    • Precision is the ability of the classifier not to label as positive a sample that is negative
    • High precision means that if classifier identifies a positive its very confidant that it is really a positive
    • Precision is tp/(tp+fp)

    Recall:

    • is the ability of the classifier to find all the positive samples.
    • Very high recall means that classifier will not miss any positives ( will identify all positives and other false positives)
    • Recall = tp/(tp+fn)
    I Used the provided tester.py to evaluate my function along with .score() function in the classifier
    The best performance was using Adaboost with these parameters
    AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1, n_estimators=50, random_state=None)
        Accuracy: 0.84847       Precision: 0.41107      Recall: 0.31550 F1: 0.35700     F2: 0.33089
        Total predictions: 15000        True positives:  631    False positives:  904   False negatives: 1369   True negatives: 12096
    
    test time: 100.252 s


    This is a just the summery of the full project.

No comments:

Post a Comment