Skip to main content

Find that Fraudster - Part 2

Find that Fraudster - Part 2

We now try up-sampling the records for the Fraud transactions by synthetically generating the records using SMOTE. The training set will have a lot more records in this case as we keep 213233 Normal transactions and combine with 213233 Fraud transactions.

In [9]:
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=999)
In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train,y_train.values.ravel())
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=["Class"])
print("Total number of records in oversampled data is ",len(os_data_X))
print("Number of normal transcation in oversampled data",len(os_data_y[os_data_y["Class"]==0]))
print("No.of fraud transcation",len(os_data_y[os_data_y["Class"]==1]))
print("Proportion of Normal data in oversampled data is ",len(os_data_y[os_data_y["Class"]==0])*1.0/len(os_data_X))
print("Proportion of fraud data in oversampled data is ",len(os_data_y[os_data_y["Class"]==1])*1.0/len(os_data_X))
('Number transactions train dataset: ', 213605)
('Number transactions test dataset: ', 71202)
('Total number of transactions: ', 284807)
('Total number of records in oversampled data is ', 426466)
('Number of normal transcation in oversampled data', 213233)
('No.of fraud transcation', 213233)
('Proportion of Normal data in oversampled data is ', 0.5)
('Proportion of fraud data in oversampled data is ', 0.5)

Building the first Random Forest model with 201 estimators, max features = None, min_samples_split = 8, min_samples_leaf = 1

In [20]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from collections import OrderedDict
from time import time

model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 8, random_state =1, min_samples_split = 8, min_samples_leaf = 1, max_features =None)
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))
Model train took 1665.01 seconds 
In [21]:
importances = model_final_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_final_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
Feature ranking:
1. feature 11 (0.737632)
2. feature 3 (0.075527)
3. feature 10 (0.028557)
4. feature 2 (0.020987)
5. feature 0 (0.017561)
6. feature 17 (0.015284)
7. feature 16 (0.015204)
8. feature 9 (0.011258)
9. feature 18 (0.010306)
10. feature 14 (0.009791)
11. feature 8 (0.009426)
12. feature 1 (0.008546)
13. feature 12 (0.007113)
14. feature 7 (0.006741)
15. feature 4 (0.006380)
16. feature 15 (0.005712)
17. feature 13 (0.004881)
18. feature 5 (0.004726)
19. feature 6 (0.004366)
Out[21]:
(-1, 19)
In [24]:
# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the undersampled testing dataset: ', 0.79166666666666663)
('Precision metric in the undersampled testing dataset: ', 0.63758389261744963)
('F1 metric in the undersampled testing dataset: ', 0.70631970260223043)

We see that recall has gone down but precision has improved. Let us now try another Random Forest and changing min samples per leaf to 2

In [25]:
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 8, random_state =1, min_samples_split = 8, min_samples_leaf = 2, max_features =None)
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))
Model train took 1169.98 seconds 
In [26]:
# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the undersampled testing dataset: ', 0.80000000000000004)
('Precision metric in the undersampled testing dataset: ', 0.63576158940397354)
('F1 metric in the undersampled testing dataset: ', 0.70848708487084877)

Not much change from the previous version of Random Forest.

Lets change the parameter max features to sqrt and build another Random Forest

In [27]:
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 8, random_state =1, min_samples_split = 8, min_samples_leaf = 1, max_features = 'sqrt')
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))

# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
Model train took 229.45 seconds 
('Recall metric in the undersampled testing dataset: ', 0.83333333333333337)
('Precision metric in the undersampled testing dataset: ', 0.88495575221238942)
('F1 metric in the undersampled testing dataset: ', 0.85836909871244638)

The above version using max_features = 'sqrt' is a great improvement over the Random Forest that was using all the features in tree building.

Lets remove the hyper-parameters min_samples_split & min_samples_leaf let the model use default values for those hyper-parameters.

In [28]:
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 8, random_state =1, max_features = 'sqrt')
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))

# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
Model train took 265.40 seconds 
('Recall metric in the undersampled testing dataset: ', 0.84166666666666667)
('Precision metric in the undersampled testing dataset: ', 0.88596491228070173)
('F1 metric in the undersampled testing dataset: ', 0.86324786324786329)

This is so far the best model in terms of overall F1 score !!

Finally lets try to increase the number of estimators from 201 to 501 and see if that changes anything.

In [31]:
model_final_rf = RandomForestClassifier(n_estimators = 501, oob_score = True,n_jobs = 8, random_state =1,max_features = 'sqrt')
start = time()
model_final_rf.fit(os_data_X,os_data_y.values.ravel())
print("Model train took %.2f seconds " % ((time() - start)))

# Compute confusion matrix
y_pred_score_rf = model_final_rf.predict(X_test)
cnf_matrix = confusion_matrix(y_test,y_pred_score_rf)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
Model train took 638.14 seconds 
('Recall metric in the undersampled testing dataset: ', 0.83333333333333337)
('Precision metric in the undersampled testing dataset: ', 0.88495575221238942)
('F1 metric in the undersampled testing dataset: ', 0.85836909871244638)

We see that both precision and recall have slightly gone down in this case.Therefore adding more estimators is not helping the model performance.

We can also try other techniques such as varying the thresholds to find the sweet spot on the Precision Recall curve, changing the cost function to penalize based on the type of error made.

Please leave suggestions/comments if you liked the analysis.

In [ ]:
 

Comments

Popular posts from this blog

Deep Dreams with Keras & Tensorflow

Made some modifications to the DeepDreamcode courtesy François Chollet and added few extra layers, changed the loss function settings a bit and viola !!!

Used the pre-trained model based on the VGG16 network architecture.

I will post the github link shortly. 



Iteration 1


                                                                            Iteration 2


                                                                          Iteration 3


Iteration 4

Iteration 5


Random !