Skip to main content

Find that fraudster !!!

Find that Fraudster

Credit Card Fraud Detection

Author : Rahul Choudhry

Description:

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

In [1]:
import pandas as pd
import numpy as np 
#import tensorflow as tf
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.gridspec as gridspec
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from prg import prg
%matplotlib inline
/Users/rahulchoudhry/anaconda2/lib/python2.7/site-packages/pandas/computation/__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used

  UserWarning)
/Users/rahulchoudhry/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Users/rahulchoudhry/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py:913: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Read the file and verify the shape,missingness, and datatypes of the columns

In [82]:
df = pd.read_csv("creditcard.csv")
In [4]:
print df.shape
print df.describe()
print df.isnull().sum()
print df.info()
(284807, 31)
                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  3.919560e-15  5.688174e-16 -8.769071e-15  2.782312e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean  -1.552563e-15  2.010663e-15 -1.694249e-15 -1.927028e-16 -3.137024e-15   
std    1.380247e+00  1.332271e+00  1.237094e+00  1.194353e+00  1.098632e+00   
min   -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01   
25%   -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01   
50%   -5.433583e-02 -2.741871e-01  4.010308e-02  2.235804e-02 -5.142873e-02   
75%    6.119264e-01  3.985649e-01  5.704361e-01  3.273459e-01  5.971390e-01   
max    3.480167e+01  7.330163e+01  1.205895e+02  2.000721e+01  1.559499e+01   

           ...                 V21           V22           V23           V24  \
count      ...        2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean       ...        1.537294e-16  7.959909e-16  5.367590e-16  4.458112e-15   
std        ...        7.345240e-01  7.257016e-01  6.244603e-01  6.056471e-01   
min        ...       -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00   
25%        ...       -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01   
50%        ...       -2.945017e-02  6.781943e-03 -1.119293e-02  4.097606e-02   
75%        ...        1.863772e-01  5.285536e-01  1.476421e-01  4.395266e-01   
max        ...        2.720284e+01  1.050309e+01  2.252841e+01  4.584549e+00   

                V25           V26           V27           V28         Amount  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  284807.000000   
mean   1.453003e-15  1.699104e-15 -3.660161e-16 -1.206049e-16      88.349619   
std    5.212781e-01  4.822270e-01  4.036325e-01  3.300833e-01     250.120109   
min   -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01       0.000000   
25%   -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02       5.600000   
50%    1.659350e-02 -5.213911e-02  1.342146e-03  1.124383e-02      22.000000   
75%    3.507156e-01  2.409522e-01  9.104512e-02  7.827995e-02      77.165000   
max    7.519589e+00  3.517346e+00  3.161220e+01  3.384781e+01   25691.160000   

               Class  
count  284807.000000  
mean        0.001727  
std         0.041527  
min         0.000000  
25%         0.000000  
50%         0.000000  
75%         0.000000  
max         1.000000  

[8 rows x 31 columns]
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None

Looking at the Class field, we see that the data set is extremenly out of balance. Number of Normal transactions is 577 * Fraudulent transactions. Important to note as this will make the classification of positive events (Frauds) extremely difficult. We will be dealing with it later on.

In [5]:
df['Class'].value_counts()
Out[5]:
0    284315
1       492
Name: Class, dtype: int64

Time:

Time variable is the time elapsed in seconds for each transaction from the first transaction in the dataset. For now, keeping the field until we are sure if it has/has not value. Plotting a histogram of Time in Fraudulent and Normal transactions. We see there are a couple of peaks in the fraudulent transactions. At the time of the first peak in Fraudulent transactions (elapsed time = 40K seconds), there is also a large number of normal transactions. The second peak occurs at about 90K seconds since the start of first transaction. During this time, the normal transactions are very low.

We can also see that the the normal transactions show a trend. The first uptrend began at about 25K seconds and then started to decline at about 75K seconds. The delta between the two ~ 50K seconds is approximately 14 hours. This sounds intuitively correct and could be the transactions happening during the day hours. The difference between the two bottoms on the Normal transactions is ~ 82K seconds ~ 1 day.

In [6]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 50

ax1.hist(df.Time[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Time[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Number of Transactions')
Out[6]:
<matplotlib.text.Text at 0x10ffabf90>

Amount:

Next we look at the summary stats of the transaction amount field for fraudulent and normal transactions.The IQR of Fraudulent is between $1 to $105 and the median is $9. The mean is $122 and its large difference from the median is due to the outliers on the right side of the distribution. The fraudulent transactions also have a large standard deviation of $256.

For normal transactions, the IQR range is between $5 to $77. The difference between Mean ($88) and Median ($22) is $66 and is tighter than the $104 difference for fraudulent transactions.

The histograms below show the distributions of both the transaction types.

In [5]:
print ("Fraud")
print (df.Amount[df.Class == 1].describe())
print ()
print ("Normal")
print (df.Amount[df.Class == 0].describe())
Fraud
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
()
Normal
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
In [8]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 30

ax1.hist(df.Amount[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Amount[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')

The scatterplots between the time elapsed and transaction amount have been grouped by the transaction type. We do see the some extrement outliers in the Fraud transactions happening during the periods of low volumes for normal trsactions.

In [9]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,6))

ax1.scatter(df.Time[df.Class == 1], df.Amount[df.Class == 1])
ax1.set_title('Fraud')

ax2.scatter(df.Time[df.Class == 0], df.Amount[df.Class == 0])
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
Out[9]:
<matplotlib.text.Text at 0x1121bb3d0>

PCA Transformed features:

As mentioned in the descriptions above, this dataset has 29 numerical features that we obtained as a result of PCA. We do not have any business context of what those fields imply. In the series of plots shown below, we plot the histograms overlaid with density for each of these variables. The plots are color coded by the type of transaction. Green ~ Normal, Blue ~ Fraud. We will visually inspect these distributions and use that information to only keep the variables where we see a clear distinction.

In [77]:
#Select only the anonymized features.
v_features = df.ix[:,1:29].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(df[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(df[cn][df.Class == 1], bins=50)
    sns.distplot(df[cn][df.Class == 0], bins=100)
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))

Dropping V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8' as they have very similar distributions for both types of transactions.

In [83]:
#Drop all of the features that have very similar distributions between the two types of transactions.
df = df.drop(['V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8'], axis =1)

Performing scaling on Amount and Time field as a necessary data transformation step before modeling.

In [84]:
from sklearn.preprocessing import StandardScaler

df['normAmount'] = StandardScaler().fit_transform(df['Amount'].reshape(-1, 1))
df = df.drop(['Amount'],axis=1)
df['normTime'] = StandardScaler().fit_transform(df['Time'].reshape(-1, 1))
df = df.drop(['Time'],axis=1)
df.head()
/Users/rahulchoudhry/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  app.launch_new_instance()
/Users/rahulchoudhry/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
Out[84]:
V1 V2 V3 V4 V5 V6 V7 V9 V10 V11 V12 V14 V16 V17 V18 V19 V21 Class normAmount normTime
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.363787 0.090794 -0.551600 -0.617801 -0.311169 -0.470401 0.207971 0.025791 0.403993 -0.018307 0 0.244964 -1.996583
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 -0.255425 -0.166974 1.612727 1.065235 -0.143772 0.463917 -0.114805 -0.183361 -0.145783 -0.225775 0 -0.342475 -1.996583
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 -1.514654 0.207643 0.624501 0.066084 -0.165946 -2.890083 1.109969 -0.121359 -2.261857 0.247998 0 1.160686 -1.996562
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 -1.387024 -0.054952 -0.226487 0.178228 -0.287924 -1.059647 -0.684093 1.965775 -1.232622 -0.108300 0 0.140534 -1.996562
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 0.817739 0.753074 -0.822843 0.538196 -1.119670 -0.451449 -0.237033 -0.038195 0.803487 -0.009431 0 -0.073403 -1.996541
In [85]:
X = df.ix[:, df.columns != 'Class']
y = df.ix[:, df.columns == 'Class']

Undersampling the normal transactions so that the number of normal transactions is 3 times the fraudulent transactions. This is to overcome the extreme imbalance between the two classes as described above.

In [86]:
# Number of data points in the minority class
number_records_fraud = len(df[df.Class == 1])
fraud_indices = np.array(df[df.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = df[df.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud*3, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset
under_sample_data = df.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])*1.0/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])*1.0/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))
('Percentage of normal transactions: ', 0.75)
('Percentage of fraud transactions: ', 0.25)
('Total number of transactions in resampled data: ', 1968)

Modeling

Split the data into train - test in the ratio 75:25. The 25% test is our holdout sample that we do not use for training or CV.

In [87]:
from sklearn.cross_validation import train_test_split

# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.25
                                                                                                   ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
('Number transactions train dataset: ', 213605)
('Number transactions test dataset: ', 71202)
('Total number of transactions: ', 284807)

('Number transactions train dataset: ', 1476)
('Number transactions test dataset: ', 492)
('Total number of transactions: ', 1968)

Defining some helper functions for calculating different accuracy metrics that we will use for evaluating model performance

In [15]:
def ROC_curve_data(y_true, y_score):
    y_true  = np.asarray(y_true,  dtype=np.bool_)
    y_score = np.asarray(y_score, dtype=np.float_)
    assert(y_score.size == y_true.size)

    order = np.argsort(y_score) # Just ordering stuffs
    y_true  = y_true[order]
    # The thresholds to consider are just the values of score, and 0 (accept everything)
    thresholds = np.insert(y_score[order],0,0)
    TP = [sum(y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    FP = [sum(~y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
    TN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)
    FN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)

    for i in range(1, thresholds.size) : # "-1" because the last threshold
        # At this step, we stop predicting y_score[i-1] as True, but as False.... what y_true value say about it ?
        # if y_true was True, that step was a mistake !
        TP.append(TP[-1] - int(y_true[i-1]))
        FN.append(FN[-1] + int(y_true[i-1]))
        # if y_true was False, that step was good !
        FP.append(FP[-1] - int(~y_true[i-1]))
        TN.append(TN[-1] + int(~y_true[i-1]))

    TP = np.asarray(TP, dtype=np.int_)
    FP = np.asarray(FP, dtype=np.int_)
    TN = np.asarray(TN, dtype=np.int_)
    FN = np.asarray(FN, dtype=np.int_)

    accuracy    = (TP + TN) / (TP + FP + TN + FN)
    sensitivity = TP / (TP + FN)
    specificity = TN / (FP + TN)
    return((thresholds, TP, FP, TN, FN))

We are now ready to start the modeling. In the first cut, we will be using Logistic regression model and train it on the X variables in the 75% of records from the dataset we created after downsampling the majority class (Normal transactions) and combining with the fraudulent transactions.

Since the number of records in this 75% training sample is not that large(1476 records), I will be performimg 5 fold CV to get the optimal value of C parameter. The metric of interest is Area Under Precision Recall curve.

The function below called printing_Kfold_scores is used for performing Cross Validation ad then choosing the best value of C parameter.

In [88]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 
In [89]:
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # Different C parameters
    c_param_range = [0.001,0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),3), columns = ['C_parameter','Mean recall score','Mean_AUPRC'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        auprc_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # Call the logistic regression model with a certain C parameter
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            print type(y_pred_undersample)
            print type(y_train_data.iloc[indices[1],:].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            
            prg_curve = prg.create_prg_curve(y_train_data.iloc[indices[1],:].values, y_pred_undersample)
            auprc_acc = prg.calc_auprg(prg_curve)
            recall_accs.append(recall_acc)
            auprc_accs.append(auprc_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)
            print('Iteration ', iteration,': AUPRC score = ', auprc_acc)

        # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        results_table.ix[j,'Mean_AUPRC'] = np.mean(auprc_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
        print('Mean AUPRC score ', np.mean(auprc_accs))
        print('')

    print('Best C param for recall score ', results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'])
    print('')
    best_c = results_table.loc[results_table['Mean_AUPRC'].idxmax()]['C_parameter']
    print('Best C param for AUPRC ', best_c)
    print('')
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c
In [90]:
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
-------------------------------------------
('C parameter: ', 0.001)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.94444444444444442)
('Iteration ', 1, ': AUPRC score = ', 0.64406313996186693)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.98571428571428577)
('Iteration ', 2, ': AUPRC score = ', 0.5748685435417914)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.94805194805194803)
('Iteration ', 3, ': AUPRC score = ', 0.51594620317586437)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.95652173913043481)
('Iteration ', 4, ': AUPRC score = ', 0.5520557115017064)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.94594594594594594)
('Iteration ', 5, ': AUPRC score = ', 0.51192650537416495)

('Mean recall score ', 0.95613567265741162)

('Mean AUPRC score ', 0.5597720207110789)

-------------------------------------------
('C parameter: ', 0.01)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.81944444444444442)
('Iteration ', 1, ': AUPRC score = ', 0.96458837772397099)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.84285714285714286)
('Iteration ', 2, ': AUPRC score = ', 0.97099811676082859)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.87012987012987009)
('Iteration ', 3, ': AUPRC score = ', 0.97364096946460355)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.85507246376811596)
('Iteration ', 4, ': AUPRC score = ', 0.96908544215807069)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.90540540540540537)
('Iteration ', 5, ': AUPRC score = ', 0.97759805414935907)

('Mean recall score ', 0.85858186532099567)

('Mean AUPRC score ', 0.97118219205136658)

-------------------------------------------
('C parameter: ', 0.1)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.86111111111111116)
('Iteration ', 1, ': AUPRC score = ', 0.97407834101382496)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.8571428571428571)
('Iteration ', 2, ': AUPRC score = ', 0.96902331961591215)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.89610389610389607)
('Iteration ', 3, ': AUPRC score = ', 0.9745098159846397)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.86956521739130432)
('Iteration ', 4, ': AUPRC score = ', 0.97710176991150444)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.90540540540540537)
('Iteration ', 5, ': AUPRC score = ', 0.97759805414935907)

('Mean recall score ', 0.87786569743091492)

('Mean AUPRC score ', 0.974462260135048)

-------------------------------------------
('C parameter: ', 1)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.86111111111111116)
('Iteration ', 1, ': AUPRC score = ', 0.97407834101382496)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.87142857142857144)
('Iteration ', 2, ': AUPRC score = ', 0.96708292275075391)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.90909090909090906)
('Iteration ', 3, ': AUPRC score = ', 0.96251241477990079)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.89855072463768115)
('Iteration ', 4, ': AUPRC score = ', 0.97308581653717308)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.90540540540540537)
('Iteration ', 5, ': AUPRC score = ', 0.97268783518465041)

('Mean recall score ', 0.88911734433473577)

('Mean AUPRC score ', 0.96988946605326054)

-------------------------------------------
('C parameter: ', 10)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.86111111111111116)
('Iteration ', 1, ': AUPRC score = ', 0.97407834101382496)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.87142857142857144)
('Iteration ', 2, ': AUPRC score = ', 0.97206605153931136)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.90909090909090906)
('Iteration ', 3, ': AUPRC score = ', 0.96251241477990079)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.91304347826086951)
('Iteration ', 4, ': AUPRC score = ', 0.98068571151539952)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.90540540540540537)
('Iteration ', 5, ': AUPRC score = ', 0.97268783518465041)

('Mean recall score ', 0.8920158950593734)

('Mean AUPRC score ', 0.97240607080661723)

-------------------------------------------
('C parameter: ', 100)
-------------------------------------------

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 1, ': recall score = ', 0.86111111111111116)
('Iteration ', 1, ': AUPRC score = ', 0.97407834101382496)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 2, ': recall score = ', 0.88571428571428568)
('Iteration ', 2, ': AUPRC score = ', 0.9651767063629707)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 3, ': recall score = ', 0.90909090909090906)
('Iteration ', 3, ': AUPRC score = ', 0.96251241477990079)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 4, ': recall score = ', 0.91304347826086951)
('Iteration ', 4, ': AUPRC score = ', 0.98068571151539952)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
('Iteration ', 5, ': recall score = ', 0.90540540540540537)
('Iteration ', 5, ': AUPRC score = ', 0.97268783518465041)

('Mean recall score ', 0.89487303791651629)

('Mean AUPRC score ', 0.97102820177134919)

('Best C param for recall score ', 0.001)

('Best C param for AUPRC ', 0.10000000000000001)

*********************************************************************************
('Best model to choose from cross validation is with C parameter = ', 0.10000000000000001)
*********************************************************************************

We see that the best value of C parameter is 0.1

Next we write a handy function for drawing the confusion matrix.

In [91]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Now we train the model on the 75% of training data (undersampled) and use the best value of the tuning parameter obtained from CV and perform prediction on the remaing 25% of the undersampled data.

In [92]:
# Use this C_parameter to build the final model with the undersampled training dataset  and predict the classes in the undersampled test
# dataset
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the undersampled testing dataset: ', 0.85384615384615381)
('Precision metric in the undersampled testing dataset: ', 0.94871794871794868)
('F1 metric in the undersampled testing dataset: ', 0.89878542510121462)

Viola !! So our first model is correctly classifying 111 of the 130 fraud transactions correctly, and 356 of the 362 normal transactions correctly. Combining Accuracy and Recall, we calculate another measure called the F1 score which seems pretty good so far.

Next we calculate the Area Under the Precision Recall curve and the plot below looks pretty good.

In [93]:
prg_curve = prg.create_prg_curve(y_test_undersample.values, y_pred_undersample)
auprc_acc = prg.calc_auprg(prg_curve)
print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)
prg.plot_prg(prg_curve)
('AUPRC metric in the undersampled testing dataset: ', 0.95044978898349386)
Out[93]:

Next we try to make predictions using the same model on the overall test set where the normal transactions are much higher than the fraud transactions. It would be interesting to see how we perform now.

In [94]:
# Use this C_parameter to build the final model with the whole training dataset and predict the classes in the test
# dataset
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
prg_curve = prg.create_prg_curve(y_test.values, y_pred)
auprc_acc = prg.calc_auprg(prg_curve)
#print("AUPRC metric in the testing dataset: ", auprc_acc)
#prg.plot_prg(prg_curve)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the testing dataset: ", f1)

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the testing dataset: ', 0.8666666666666667)
('Precision metric in the testing dataset: ', 0.13774834437086092)
('F1 metric in the testing dataset: ', 0.23771428571428568)

Our recall has slightly improved as we are now classifying 104 correct fraudulent transactions out of 120 total fraudulent transactions. The precision has gone down a lot as we are incorrectly calling 651 normal transactions as fraud. Due to this high false positive, the precision has gone down.

Next lets plot ROC curves for both the undersampled test and the regular test dataset.

In [80]:
# ROC CURVE
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values)

fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
Out[80]:
<matplotlib.text.Text at 0x1133cc850>
In [81]:
# ROC CURVE
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test.values)

fpr, tpr, thresholds = roc_curve(y_test.values.ravel(),y_pred_score)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
Out[81]:
<matplotlib.text.Text at 0x1133d8b50>

This concludes the initial model experimentation of Linear models and we move to another class of models called Random Forest.

Random Forest

In [8]:
%timeit
from sklearn.ensemble import RandomForestClassifier
from collections import OrderedDict
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['shuffle']
`%matplotlib` prevents importing * from pylab and numpy

After loading the required dependencies, we build our first Forest model. We set n_jobs = 3 to leverage parallelism due to multo cores and also set the number of estimators as a fixed value = 501.

In [95]:
model = RandomForestClassifier(n_estimators = 501, oob_score = True,n_jobs = 3, random_state =1)
model.fit(X_train_undersample,y_train_undersample.values.ravel())
Out[95]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=501, n_jobs=3, oob_score=True, random_state=1,
            verbose=0, warm_start=False)

Next we do a prediction on the 25% Test set after downsampling and then calculate the performance metrics as calculated above for logistic regression.

In [87]:
y_pred_score_rf_usample = model.predict(X_test_undersample)
In [89]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_score_rf_usample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the undersampled testing dataset: ', 0.86153846153846159)
('Precision metric in the undersampled testing dataset: ', 0.97391304347826091)
('F1 metric in the undersampled testing dataset: ', 0.91428571428571426)

We see that Random Forest is better in terms of Accuract, Precision and F1 score as compared to Logistic regression. We can now try different combinatinons of max_features values and vary the number of Trees(estimators) and inspect the OOB error.

In [9]:
RANDOM_STATE = 123
ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True,
                               random_state=RANDOM_STATE))
]
In [10]:
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 51
max_estimators = 301

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X_train_undersample,y_train_undersample.values.ravel())

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))
In [18]:
# Generate the "OOB error rate" vs. "n_estimators" plot.
pylab.rcParams['figure.figsize'] = (14, 8)
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
Out[18]:
<matplotlib.legend.Legend at 0x10e54a4d0>

As seen from the plot above, the red line corresponding to max_features = None is having a lower OOB error and the error almost minimizes at 225 trees. We can also perform a randomized search and include some other hyper-parameters that control tree depth while fixing the number of trees and the number of features at every slpit from above.

In [21]:
# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
In [25]:
from scipy.stats import randint as sp_randint

Setting the hyper-prameter choices and setting the grid.

In [60]:
param_dist = {"max_depth": [3, None],
              "max_features": [None],
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "criterion": ["gini"]}
In [61]:
param_dist
Out[61]:
{'criterion': ['gini'],
 'max_depth': [3, None],
 'max_features': [None],
 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen at 0x10bd91710>,
 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen at 0x10bd91d50>}
In [62]:
from sklearn.model_selection import RandomizedSearchCV
clf = RandomForestClassifier(n_estimators=201,oob_score=True)
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)
In [63]:
from time import time
start = time()
random_search.fit(X_train_undersample,y_train_undersample.values.ravel() )
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
RandomizedSearchCV took 97.24 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.965 (std: 0.007)
Parameters: {'max_features': None, 'min_samples_split': 6, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1}

Model with rank: 2
Mean validation score: 0.964 (std: 0.010)
Parameters: {'max_features': None, 'min_samples_split': 3, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 3}

Model with rank: 2
Mean validation score: 0.964 (std: 0.010)
Parameters: {'max_features': None, 'min_samples_split': 6, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 2}

Choosing the best hyper-parameter choices from model tuning and building the final random forest moel.

In [64]:
model_final_rf = RandomForestClassifier(n_estimators = 201, oob_score = True,n_jobs = 3, random_state =1, min_samples_split = 6, min_samples_leaf = 1, max_features =None)
model_final_rf.fit(X_train_undersample,y_train_undersample.values.ravel())
Out[64]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=6, min_weight_fraction_leaf=0.0,
            n_estimators=201, n_jobs=3, oob_score=True, random_state=1,
            verbose=0, warm_start=False)
In [65]:
model_final_rf.feature_importances_
Out[65]:
array([ 0.00356512,  0.00439727,  0.00358704,  0.03303662,  0.0054486 ,
        0.00521412,  0.01250949,  0.00434431,  0.00767515,  0.00699696,
        0.00927106,  0.82735551,  0.00513986,  0.0265637 ,  0.00467576,
        0.01266649,  0.00746827,  0.01620954,  0.00387515])

plotting the feature importance plot.

In [102]:
importances = model_final_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_final_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
Feature ranking:
1. feature 11 (0.827356)
2. feature 3 (0.033037)
3. feature 13 (0.026564)
4. feature 17 (0.016210)
5. feature 15 (0.012666)
6. feature 6 (0.012509)
7. feature 10 (0.009271)
8. feature 8 (0.007675)
9. feature 16 (0.007468)
10. feature 9 (0.006997)
11. feature 4 (0.005449)
12. feature 5 (0.005214)
13. feature 12 (0.005140)
14. feature 14 (0.004676)
15. feature 1 (0.004397)
16. feature 7 (0.004344)
17. feature 18 (0.003875)
18. feature 2 (0.003587)
19. feature 0 (0.003565)
Out[102]:
(-1, 19)

Calculate the different performance evaluation metrics.

In [70]:
# Compute confusion matrix
y_pred_score_rf_usample = model_final_rf.predict(X_test_undersample)
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_score_rf_usample)
np.set_printoptions(precision=2)

recall = cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1])
precision = cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1])
f1 = 2*recall*precision/(precision+recall)
print("Recall metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[1,0]+cnf_matrix[1,1]))
print("Precision metric in the undersampled testing dataset: ", cnf_matrix[1,1]*1.0/(cnf_matrix[0,1]+cnf_matrix[1,1]))
print("F1 metric in the undersampled testing dataset: ", f1)
#print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)


# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
('Recall metric in the undersampled testing dataset: ', 0.87692307692307692)
('Precision metric in the undersampled testing dataset: ', 0.97435897435897434)
('F1 metric in the undersampled testing dataset: ', 0.92307692307692302)

We have been able to achieve a slight improvement in recall and F1 score from the baseline Random Forest !!!

In [71]:
prg_curve = prg.create_prg_curve(y_test_undersample.values, y_pred_score_rf_usample)
auprc_acc = prg.calc_auprg(prg_curve)
print("AUPRC metric in the undersampled testing dataset: ", auprc_acc)
prg.plot_prg(prg_curve)
('AUPRC metric in the undersampled testing dataset: ', 0.96558661525878553)