Detecting Credit Card Fraud

PUBLISHED ON JULY 22, 2020
Photo by Bermix Studio on Unsplash

Photo by Bermix Studio on Unsplash

This is a project I worked on to investigate some different methods for detecting credit card fraud–a classic unbalanced data problem. Feel free to skip around to different sections or straight to the results. I’ll also include the code so that you can run it and/or follow along.

Core Ideas

Imbalanced datasets are common when trying to solve many problems – spam email classification, customer churn, insurance fraud detection, or in our case, credit card fraud detection. In all of these cases, the minority class is much smaller than the majority class, ranging in size from 25% to 0.25%. If we can correctly classify the minority class without misclassifying the majority class, this is a huge success for a company, as identification of the minority class will often lead to huge savings. The goal of this project was to find the most effective classification method for a severely imbalanced dataset of credit card transactions. I explored four different techniques to achieve this – sampling, anomaly detection, cost-sensitive training, and ensemble methods. It was discovered that the baseline performance on the dataset was already pretty good, but could be boosted further with cost-sensitive methods or anomaly detection methods for maximum performance.

Introduction

Imbalanced data sets are prevalent in many real-world supervised learning problems. They can be incredibly insightful, and incredibly difficult to work with. There is often a minority class that exhibits very different behavior from the majority class that we are acutely interested in. Many classification models assume that there is an even distribution of classes, assume the misclassification error is equal, are accuracy driven, and often result in a bias toward the majority class. Thus, many problems are encountered when trying to apply normal classification techniques to imbalanced data. From email spam detection to ax evasion, models must be built using data that is skewed toward a majority class. This skew can be as small as 75%/25%, or as large as 99.99%/0.01%. If a model just predicted the majority class every time, the accuracy could be 99%, normally a great score. However, the minority class is completely misclassified. Thus, alternative metrics are employed–for this application, the Receiver Operating Characteristic curve is used.

In this project, the goal is to determine which method for imbalanced data is best at correctly predicting credit card fraud. Credit card fraud is extremely expensive to credit card companies, as they are responsible for the fraudulent transactions. If a fraudster spends $10,000 on someone else’s credit card, the company must pay for it. This cost is ultimately passed on to the customer through higher interest rates, lower rewards, and stricter standards. If the fraudulent charges can be detected early, the card could be shut down or the fraudster could be apprehended before they disappear. However, there are some problems associated with credit cards. If the company declines too many transactions in an attempt to stem fraud, the customer will no longer use their card because it is always declined, and they will lose money. If, on the other hand, the company allows too many questionable transactions, they risk allowing fraudulent transactions, which also leads to losing money.

The dataset (Dal Pozzolo et al. (2015)) is from Kaggle, and was collected and analyzed during a collaboration between the Machine Learning Group of the Université libre de Bruxelles and Worldline. The data contains 284,807 records of credit card transactions, of which 492 (0.172%) are fraudulent. The transactions take place over two days in September, 2013 in Europe. The first 28 features of the data are numerical values obtained from Principal Component Analysis on the original dataset to anonymize it. In addition to these principal components, there is also the transaction amount, and the time of the transaction, where the first transaction is time 0 and every following transaction is the time in seconds since the first transaction. The data is labelled depending on whether or not it is fraudulent. The huge disparity of class distribution is visible in Figure 1.

Figure 1: Distribution of the two classes. The fraudulent transactions aren’t even visible they are such a small amount!

Figure 1: Distribution of the two classes. The fraudulent transactions aren’t even visible they are such a small amount!

There are many techniques for imbalanced data classification. This project explores the most common techniques, although there are still other techniques that are not used as frequently and fill a niche. Of all the imbalanced data techniques, the largest is sampling, where the dataset is re-sampled to achieve a balanced set, where 50% is one class and 50% is the other class. Several different sampling techniques are explored in this paper. The next technique used is anomaly detection. In this method, it is assumed that the data is only one class, and the minority class are outliers. Cost-sensitive training is also employed, where misclassifications are heavily penalized. Lastly, we use various ensemble classification techniques to build robust classifiers. Before beginning the project, it is clear that there is some sort of relation on the data when we graph the first two principal components, V1 and V2, against each other. In Figure 2, it is clear that there are sections where there is overlap, as well as sections where the data is only fraudulent or only valid.

Figure 2: First two principal components, graphed, with fraudulent transactions in green and valid transactions in blue

Figure 2: First two principal components, graphed, with fraudulent transactions in green and valid transactions in blue

Related Work

There is a lot of past a present research on imbalanced data. Much of the cutting edge current research focuses on novel algorithms for classification or sampling. Barua et al. (2014) propose a new synthetic oversampling method where the minority class is weighted depending on how far it is from a majority class sample, and creates synthetic weights from that. Sun et al. (2015) propose a new ensemble technique where multiple balanced datasets are sampled, and from these balanced data sets, various ensemble methods are employed. Maldonado et al. (2014) explores the combination of high dimensionality and imbalanced classes through a novel method of feature selection.

There has also been significant research in detecting credit card fraud. A multi-classifier metalearning technique was employed by Chan and Stolfo (1998) for credit card fraud detection. In Chan et al. (1999), the authors similarly use multiple classifiers, but use distributed computing to decrease the time and make the solution more scalable. The papers by Brause et al. (1999), Ghosh and Reilly (1994), and Aleskerov et al. (1997) propose using neural networks to classify fraudulent credit card transactions, although they are 20 to 25 years old. Finally, Mahmoudi and Duman (2015) look into using a modified Fisher Discriminant Analysis for credit card fraud.

The authors of the dataset have written a paper (Dal Pozzolo et al. (2014)) about their analysis of the data set. Some of the things they explore in this paper are having a model trained on a steam of data so as to not keep terabytes of learning data, appropriate scoring metrics, and updating the model as new data comes in to avoid changing fraud patterns going undetected. I received a thorough background on mining from many survey papers or textbook chapters that covered a wide variety of methods for mining imbalanced data, including, Aggarwal (2015), He and Garcia (2009), Chawla (2009), Provost (2000), Jeni et al. (2013).

Methodology

In this project, four different techniques were used. The data was split into 60% training, 20% validation, and 20% testing. When tuning the model hyperparameters, I used 5-fold cross-validation and looked at the mean score returned. Baseline measurements were taken without any modifications, so that I could determine whether my results were meaningful. Baseline measurements were taken on four different classifiers– Decision Trees, Logistic Regression, Linear SVM, and Naive Bayes. All of the models were programmed using Scikit-learn, a machine learning library for Python, and the data was resampled using Imbalanced-learn, a sampling library for Python. As mentioned earlier, pure accuracy will not give a valid metric of our classifier. In this case, the classifier always predicted valid, the accuracy would be 99.828%. Several different metrics could be used, including Cohen’s kappa, F1 score, Krippendorf’s alpha, and area under the receiver operating characteristic curve (ROC) or precision-recall curve. In this project, the area under the ROC was used as the primary performance metric. In addition to the ROC curve, confusion matrices were also used to verify the results. An example confusion matrix can be seen in Figure 3.

To setup, we need to download the packages and load our data into a Data Frame.

#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import itertools
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as roc
from sklearn.metrics import roc_curve, auc, confusion_matrix, recall_score, precision_score
from sklearn import ensemble
from sklearn import neighbors
from sklearn import linear_model
from sklearn import svm
from sklearn import naive_bayes
from sklearn import metrics
from sklearn import model_selection
from sklearn import covariance
import imblearn
from imblearn import over_sampling
from imblearn import under_sampling
from imblearn import ensemble
from imblearn import combine

%matplotlib inline

#import data
df = pd.read_csv('creditcard.csv')

#drop time column
df=df.drop(['Time'], axis=1)
#normalize amount
df['Amount'] = (df['Amount'] - df['Amount'].mean()) / (df['Amount'].max() - df['Amount'].min())

#split into train/test data
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:29], df.iloc[:,29], test_size=0.20, random_state=42)

Now that the data is loaded, we can do some preliminary investigations. If you run

df.head()

to see the first 10 rows, you will see that there are 29 features that are titled V1–V28 and “Amount”, as well as a label (Class). Above, you’ll see that we normalized the “Amount” column to try to reduce the effect of outliers, as well as dropping the Time column. The data set has had preliminary PCA performed on it, so V1-V28 are principal components 1 to 28.

# Set receiver operating characteristic as our scorer
roc_scorer = metrics.make_scorer(score_func = metrics.roc_auc_score)

#Create Validation set on 25% of training data (so it is 20% of all data) to not overfit
val_indices = np.random.choice(y_train.shape[0], y_train.shape[0]/4, replace = False)
train_indices = np.where(~np.in1d(range(y_train.shape[0]), val_indices))[0]

X_val = X_train.iloc[val_indices,:] #first quarter
X_train2 = X_train.iloc[train_indices,:] #next three quarters
y_val = y_train.iloc[val_indices]
y_train2 = y_train.iloc[train_indices]

Now, we can start making some of the graphs that were shown above!

#The three plots created comprise Figure 2
#First two principal components overlaid
plt.scatter(df[df['Class'] ==1].V1,df[df['Class']==1].V2, c='darkgreen', zorder = 2)
plt.scatter(df[df['Class'] ==0].V1,df[df['Class']==0].V2, c='darkblue', zorder = 1)
plt.title('First two Principal Components')
plt.xlabel('V1')
plt.ylabel('V2')
plt.savefig('pca_first_two')

#Fraudulent transactions, V1 vs V2
plt.scatter(df[df['Class'] ==1].V1,df[df['Class']==1].V2, c='darkgreen')
plt.title('First two Principal Components--Fraudulent')
plt.xlabel('V1')
plt.ylabel('V2')
plt.axis([-50,5, -60,25])
ax = plt.gca()
ax.set_autoscale_on(False)
plt.savefig('pca_first_two_fraud')

#Valid transactions, V1 vs V2
plt.scatter(df[df['Class'] ==0].V1,df[df['Class']==0].V2, c= 'darkblue')
plt.title('First two Principal Components--Valid')
plt.xlabel('V1')
plt.ylabel('V2')
plt.axis([-50,5, -60,25])
ax = plt.gca()
ax.set_autoscale_on(False)
plt.savefig('pca_first_two_valid')

We can compare the sheer magnitude of the disparity between the classes in Figure 1 with the following code:

plt.figure(figsize=(5,5))
fraud_no = df[df['Class']==1].shape[0]
valid_no = df[df['Class']==0].shape[0]
plt.bar([1, 2], [fraud_no,valid_no] )
plt.title('Fraudulent and Valid Transaction Amounts')
plt.ylabel('Amount')
ax = plt.gca()
ax.set_xticks([1,2])
ax.set_xticklabels(['Fraudulent', 'Valid'])
plt.gcf().subplots_adjust(left=0.25)
plt.savefig('vald_fraudulent_amount')

We can look at our baseline results by plotting our training data when used in the different methods.

#Baseline, Logistic Regression
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_score = logreg.decision_function(X_test)
fpr[1], tpr[1], _ = roc_curve(y_test[:], y_score[:])
roc_auc[1] = auc(fpr[1], tpr[1])

#Baseline, Naive Bayes
bayes = naive_bayes.BernoulliNB()
bayes.fit(X_train, y_train)
y_score = bayes.predict_proba(X_test)[:,1]
fpr[2], tpr[2], _ = roc_curve(y_test[:], y_score[:])
roc_auc[2] = auc(fpr[2], tpr[2])

#Linear SVM 
svc = svm.LinearSVC(random_state=42)
svc.fit(X_train, y_train)
y_score = svc.decision_function(X_test)
fpr[3], tpr[3], _ = roc_curve(y_test[:], y_score[:])
roc_auc[3] = auc(fpr[3], tpr[3])

plt.figure(figsize = (7,7))
lw = 2
plt.plot(fpr[0], tpr[0], color='darkorange',
         lw=lw, label='Decision Tree (area = %0.4f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.plot(fpr[1], tpr[1], color='green',
         lw=lw, label='Logistic Regression (area = %0.4f)' % roc_auc[1])
plt.plot(fpr[2], tpr[2], color='blue',
         lw=lw, label='Naive Bayes (area = %0.4f)' % roc_auc[2])
plt.plot(fpr[3], tpr[3], color='red',
         lw=lw, label='SVM (area = %0.4f)' % roc_auc[3])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Baseline')
plt.legend(loc="lower right")
plt.savefig('roc_baseline')

Next, if we want to create the confusion matrix from figure 3, we can do the following

fpr = dict()
tpr = dict()
roc_auc = dict()

#Baseline decision tree
tree = sklearn.tree.DecisionTreeClassifier()
tree.fit(X_train, y_train)

y_score = tree.predict_proba(X_test)[:,1]
fpr[0], tpr[0], _ = roc_curve(y_test[:], y_score[:])
roc_auc[0] = auc(fpr[0], tpr[0])

cnf_matrix = confusion_matrix(y_test, y_score)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix, Baseline Decision Tree')

Sampling

The first technique employed is sampling. There are a variety of sampling techniques that allow one to change the makeup of the training data. Many sampling methods have implementations where only original data is resampled as well as implementations where synthetic data is generated. For each sampling technique, the same four classifiers that were used in the baseline were used–Decision Tree, Logistic Regression, Linear SVM, and Naive Bayes.

Figure 3: Baseline Decision Tree Confusion Matrix

Figure 3: Baseline Decision Tree Confusion Matrix

The first sampling category used was oversampling. Over-sampling modifies the minority class so that there is an even split between classes by adding more data from the minority class. In this example, that meant that there were now 284315 samples of both fraudulent and valid transactions. Two different oversampling techniques were used. The first technique was “simple” oversampling, where the minority class was simply repeated until it was the same size as the majority class. The second technique used was Synthetic Minority Oversampling Technique, or SMOTE (Chawla et al. (2002)). As the name suggests, this technique generates synthetic data from the minority class. The basic premise of SMOTE is that it randomly picks a point in the minority class, computes the k-nearest neighbors, and then adds k new points between the chosen point and the neighbors.

The next sampling technique used was undersampling. Undersampling, in contrast with oversampling, ignores a large portion of the majority class so that the training data consists of an even split of class, albeit smaller than the original size and much smaller than an oversampled dataset. In this case, that meant that there were 492 fraudulent transactions and 492 valid transactions. The first undersampling technique was Random Undersampling. Transactions were selected from the majority class at random to create an undersampled set. The second undersampling technique was NearMiss-1. NearMiss-1 selects points from the majority class based on the smallest average distance of the n closest minority points. The third and final sampling technique used was a combination method of over- and undersampling. This algorithm, called SMOTETomek (Batista et al. (2004)), first oversampled the minority class using SMOTE. However, SMOTE is prone to create noise, or examples that are not particularly helpful. Tomek’s Links (Tomek (1976)) are an undersampling technique that removes the most difficult points to classifiy– the points that are closest with a point of the opposite class. Thus, after oversampling with SMOTE, the data is then undersampled with Tomek Links.

I will show how to take the samples for each method, but I will only show the scoring for one method since it gets a bit repetitive.

# Simple Oversampling
y_train_pos = y_train[y_train == 1]
y_train_over = y_train_pos
X_train_pos = X_train[y_train ==1]
X_train_over = X_train_pos
for i in range(y_train.shape[0]/y_train_pos.shape[0]):
    y_train_over = pd.concat([y_train_over, y_train_pos])
    X_train_over = pd.concat([X_train_over, X_train_pos])
X_train_res = pd.concat([X_train_over, X_train])
y_train_res = pd.concat([y_train_over, y_train])

#Simple Oversampling decision tree
tree = sklearn.tree.DecisionTreeClassifier()
tree.fit(X_train_res, y_train_res)
y_score = tree.predict_proba(X_test)[:,1]
fpr[0], tpr[0], _ = roc_curve(y_test[:], y_score[:])
roc_auc[0] = auc(fpr[0], tpr[0])
cnf_matrix = confusion_matrix(y_test, y_score)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix, ROS Decision Tree')

#Simple Oversampling, Logistic Regression
logreg = linear_model.LogisticRegression()
logreg.fit(X_train_res, y_train_res)
y_score = logreg.decision_function(X_test)
fpr[1], tpr[1], _ = roc_curve(y_test[:], y_score[:])
roc_auc[1] = auc(fpr[1], tpr[1])

#Simple Oversampling, Naive Bayes
bayes = naive_bayes.BernoulliNB()
bayes.fit(X_train_res, y_train_res)
y_score = bayes.predict_proba(X_test)[:,1]
fpr[2], tpr[2], _ = roc_curve(y_test[:], y_score[:])
roc_auc[2] = auc(fpr[2], tpr[2])

#Linear SVM 
svc = svm.LinearSVC(random_state=42)
svc.fit(X_train_res, y_train_res)
y_score = svc.decision_function(X_test)
fpr[3], tpr[3], _ = roc_curve(y_test[:], y_score[:])
roc_auc[3] = auc(fpr[3], tpr[3])

plt.figure(figsize = (7,7))
lw = 2
plt.plot(fpr[0], tpr[0], color='darkorange',
         lw=lw, label='Decision Tree (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.plot(fpr[1], tpr[1], color='green',
         lw=lw, label='Logistic Regression (area = %0.2f)' % roc_auc[1])
plt.plot(fpr[2], tpr[2], color='blue',
         lw=lw, label='Naive Bayes (area = %0.2f)' % roc_auc[2])
plt.plot(fpr[3], tpr[3], color='red',
         lw=lw, label='SVM (area = %0.2f)' % roc_auc[3])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Simple Oversampling')
plt.legend(loc="lower right")
plt.savefig('roc_simple_over')

This is the entire process to train the models using oversampling, fit the models to the training data, and scoring/plotting them all using the ROC.

# SMOTE
smote = imblearn.over_sampling.SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_sample(X_train, y_train)

# Random Under Sampling
rus = imblearn.under_sampling.RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)

# Near Miss
nm = imblearn.under_sampling.NearMiss(random_state=42)
X_train_res, y_train_res = nm.fit_sample(X_train, y_train)

# "Combined" Methods, SMOTETomek
smt = imblearn.combine.SMOTETomek(random_state=42)
X_train_res, y_train_res = smt.fit_sample(X_train, y_train)

Anomaly Detection

The next technique used is anomaly detection. In this method, it is assumed that the data is only one class, and the minority class are outliers. If the classifier an correctly predict what data points are noise, then it can correctly predict which class a new data point belongs to. However, if one tries to clear too much noise, one risks grouping many of the majority class in with the minority class in order to have an outlier-free dataset.

The first anomaly detection method used was One-Class SVM (Schölkopf et al. (2001)). One-Class SVM is an unsupervised method which fits a decision function to the data. When new data arrives, it classified as either similar or different from the training set.

The second method used was Elliptic Envelope(Rousseeuw and Driessen (1999)). Elliptic Envelope assumes that the data is Gaussian distributed, and tries to fit an ellipse to the central data points, ignoring points outside of the ellipse. The Mahalanobis distance is used to measure how much of an outlier a point is.

The third method used was Isolation Forest (Liu et al. (2008)). Isolation Forest is an ensemble anomaly detection method, that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splits required to isolate a sample is equivalent to the path length from the root node to the terminating node, and this is the measure of normality and the decision function.

Here is how I used One-Class SVM

#Convert training data to OneClassSVM form 
rus = imblearn.under_sampling.RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)

fpr = dict()
tpr = dict()
roc_auc = dict()
# Set training data
X_train_anomaly = X_train_res
y_train_anomaly = y_train_res
X_test_anomaly = X_test
y_test_anomaly = [1 if x==0 else -1 for x in y_test]
# Fit training data
svm_rbf = sklearn.svm.OneClassSVM(kernel = 'rbf')
svm_rbf.fit(X_train_anomaly, y_train_anomaly)
# Score the model
y_score = svm_rbf.decision_function(X_test_anomaly)
fpr[2], tpr[2], _ = roc_curve(y_test_anomaly[:], y_score[:])
roc_auc[2] = auc(fpr[2], tpr[2])

Here is the implementation of Elliptic Envelope

#transform y_val
X_train_ee = X_train
y_train_ee = y_train
X_test_ee = X_test
#need to map y_val to inliner/outlier (1 for inliner(real class 0), -1 for outlier(real class 1))
y_test_ee = [1 if x==0 else -1 for x in y_test]
ee = sklearn.covariance.EllipticEnvelope(assume_centered = False, random_state = 42)
ee.fit(X_train_ee, y_train_ee)
#Measure auc roc
y_score = ee.decision_function(X_test_ee)
fpr[0], tpr[0], _ = roc_curve(y_test_ee[:], y_score[:])
roc_auc[0] = auc(fpr[0], tpr[0])

Finally, the implementation of Isolation Forest

X_train_anomaly = X_train
y_train_anomaly = y_train#[1 if x == 0 else -1 for x in y_train2]
X_val_anomaly = X_train
y_val_anomaly = [1 if x == 0 else -1 for x in y_train]

iso = sklearn.ensemble.IsolationForest()
iso.fit(X_train_anomaly)

y_score = iso.predict(X_val_anomaly)
fpr[1], tpr[1], _ = roc_curve(y_val_anomaly[:], y_score[:])
roc_auc[1] = auc(fpr[1], tpr[1])

We can plot the results of these three methods like before

plt.figure(figsize = (7,7))
lw = 2
plt.plot(fpr[0], tpr[0], color='pink',
         lw=lw, label='Elliptic Envelope (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.plot(fpr[1], tpr[1], color='teal',
         lw=lw, label='Isolation Forest (area = %0.2f)' % roc_auc[1])
plt.plot(fpr[2], tpr[2], color='purple',
         lw=lw, label='One-Class SVM (area = %0.2f)' % roc_auc[2])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Anomaly Detection')
plt.legend(loc="lower right")
plt.savefig('roc_anomaly')

Cost-Sensitive Training

Cost-sensitive training is also employed, where misclassifications of the minority class are more heavily penalized than misclassifications of the majority class. Three of the four classifiers used in the baseline and sampling sections are used here– Logistic Regression, One-Class SVM, and Decision Tree. Naive Bayes has no simple method of introducing a cost matrix. Since we have 284315 valid transactions, and 492 fraudulent transactions, the ratio 284315:492 is equivalent to 580:1, so we used a cost of 1 for misclassification of the valid transactions, and 580 for the cost of misclassification of the fraud transactions.

How did we implement it? As follows

# Weights proportional to imbalance
weights = {0: 1, 1: 580}
# Weighted decision tree
tree = sklearn.tree.DecisionTreeClassifier(class_weight = weights)
tree.fit(X_train, y_train)
y_score = tree.predict_proba(X_test)[:,1]
fpr[0], tpr[0], _ = roc_curve(y_test[:], y_score[:])
roc_auc[0] = auc(fpr[0], tpr[0])

# Weighted Logistic Regression
logreg = linear_model.LogisticRegression(class_weight = weights)
logreg.fit(X_train, y_train)
y_score = logreg.predict_proba(X_test)[:,1]
fpr[1], tpr[1], _ = roc_curve(y_test[:], y_score[:])
roc_auc[1] = auc(fpr[1], tpr[1])

#Weighted Linear SVM 
svc = svm.LinearSVC(random_state=42, class_weight = weights)
svc.fit(X_train, y_train)
y_score = svc.decision_function(X_test)
fpr[2], tpr[2], _ = roc_curve(y_test[:], y_score[:])
roc_auc[2] = auc(fpr[2], tpr[2])

Ensemble Methods

Several ensemble methods are used to build robust classifiers. These classifiers are training on the original data, and three different ensemble methods are used–Random Forest, AdaBoost, and Gradient Boosting. The element of randomness including in ensemble methods is hoped to improve the performance of the classifiers.

The ensemble methods were implemented as follows.

# Random Forest
rfc = sklearn.ensemble.RandomForestClassifier(criterion = 'entropy', max_features = 0.5)
rfc.fit(X_train, y_train)
y_score = rfc.predict_proba(X_test)[:,1]
fpr[0], tpr[0], _ = roc_curve(y_test[:], y_score[:])
roc_auc[0] = auc(fpr[0], tpr[0])

#AdaBoost
ada = sklearn.ensemble.AdaBoostClassifier(random_state = 42)
ada.fit(X_train, y_train)
y_score = ada.predict_proba(X_test)[:,1]
fpr[1], tpr[1], _ = roc_curve(y_test[:], y_score[:])
roc_auc[1] = auc(fpr[1], tpr[1])

# Gradient Boosting
grad = sklearn.ensemble.GradientBoostingClassifier(random_state = 42)
grad.fit(X_train, y_train)
y_score = grad.predict_proba(X_test)[:,1]
fpr[2], tpr[2], _ = roc_curve(y_test[:], y_score[:])
roc_auc[2] = auc(fpr[2], tpr[2])

Results

Decision Tree Naive Bayes Logistic Regression Linear SVM
Baseline 0.892 0.966 0.976 0.971
SMOTE 0.882 0.967 0.979 0.982
Simple Oversampling 0.872 0.966 0.979 0.982
NearMiss-1 0.662 0.876 0.900 0.826
Random Undersampling 0.891 0.964 0.980 0.826
SMOTETomeks 0.882 0.967 0.979 0.982

Figure 4: All Sampling Methods, Area under ROC curve

The results shown here are the ROC curves for each classifier, showing the tradeoff between true positive rate (TPR) and the false positive rate (FPR). Confusion matrices were also studied. Decision trees cannot produce ROC curves, and as such are represented as straight lines on the chart.

Baseline

The baseline measurements were taken for Decision Trees, Logistic Regression, Naive Bayes, and Linear SVM. Logistic Regression Performed the best on the test set. You can see the scores in Figure 3 and the ROC curve in Figure 4.

Sampling

The ROC curves and areas were computed for all the different sampling methods–Simple Oversampling, SMOTE, Random Undersampling, NearMiss-1, and SMOTETomeks. The Simple Oversampling and SMOTE scores were nearly identical, implying that there is not much advantage to using an algorithm like SMOTE for this dataset. Another interesting conclusion is that Random Undersampling performed significantly better than NearMiss-1 in every single classifier. SMOTETomeks did not perform significantly better than SMOTE or Simple Oversampling.

Figure 5: Baseline

Figure 5: Baseline

Figure 5: SMOTETomek

Figure 5: SMOTETomek

Figure 5: NearMiss-1

Figure 5: NearMiss-1

Figure 5: Random Undersampling

Figure 5: Random Undersampling

Figure 5: SMOTE

Figure 5: SMOTE

Figure 5: Simple Oversampling

Figure 5: Simple Oversampling

Anomaly Detection

Figure 6: Anomaly Detection Methods under ROC Curve

Figure 6: Anomaly Detection Methods under ROC Curve

Classifier Area Under ROC
One-Class SVM 0.872
Elliptic Envelope 0.949
Isolation Forest 0.908

Figure 7: Anomaly Detection Methods, area under ROC Curve

Figure 6 and 7 show the performance of the three anomaly detection methods. They performed well, but not as well as even the baseline Logistic Regression or Linear SVM. Of the three methods, Elliptic Envelope performed the best. However, these methods all take considerably longer to fit data to than the baseline measurements, so they are inferior to other classifiers on this dataset.

Cost-Sensitive Training

For the cost-sensitive methods, a cost matrix was given to the three baseline classifiers that can utilize cost matrices. This did not show a significant deviation from the baseline performance. Even with significantly modified cost matrices, there was no discernable change in the area under the ROC curve. These results can be seen in Figures 8 and 9.

Classifier Area Under ROC
Decision Tree 0.857
Logistic Regression 0.980
Linear SVM 0.954

Figure 8: Cost-Sensitive Methods, area under ROC Curve

Figure 9: Cost-Sensitive Methods ROC Curve

Figure 9: Cost-Sensitive Methods ROC Curve

Ensemble

The ensemble methods had a wide range of performances. Gradient Boosting was one of the worst methods measured, with a ROC area of 0.786, while AdaBoost was the best method measured, with a ROC area of 0.983. AdaBoost is a very powerful algorithm, so this is not a surprising result. Random Forest did a fine job– better than Gradient Boosting but worse than AdaBoost. One can see the results in the ROC curve and table of Figures 10 and 11.

Classifier Area Under ROC
Random Forest 0.857
Elliptic Envelope 0.980
Gradient Boosting 0.954

Figure 10: Ensemble Methods, Area under ROC Curve

Figure 11: Ensemble Methods ROC Curve

Figure 11: Ensemble Methods ROC Curve

Discussion

These results are not surprising, although the performance of each method cannot safely be extrapolated to other datasets, only this credit card dataset, and perhaps other credit card datasets. There are many different classifiers, and while they have roughly similar ROC areas, they all have different misclassification at different thresholds. The optimal model for this specific scenario could be obtained by examining the models that were created, and asking the client what is more important–misclassifying fraudulent charges of misclassifying valid charges. With that known, the model that is most useful to the client could produced.

Some potential problems arise with this project. First, we need to question the accuracy of the dataset itself. Credit card fraud is very hard to detect, and it is possible that there are still undetected instances of fraud within our dataset. If that is the case, we will never be able to detect them in the future, because we have trained our classifier to interpret them as valid transactions. Next, we should think about some way to implement a changing model, as credit card fraud itself is constantly changing. If the model cannot update, it will soon be obsolete. In the future, I would like to run all of these methods multiple times and take the average performance. After running 20 or 30 times, we could be sure that our results are representative, especially for models that change every time, like ensemble methods.

Imbalanced datasets is a popular research area, so there is much research currently happening on different sampling algorithms and other methods for building models. I have several ideas for further research on this data set. Most of these are things that I would have implemented if I had the time. First, I would implement XGBoost as another ensemble method. XGBoost is a distributed gradient boosting method that is very powerful. Second, I would be interesting in implement a neural network or deep learning solution. Several neural network papers were written in the 1990s about imbalanced problems, but computing power has expanded greatly since then, and there have been recent advances, like densely connected networks or deep networks, that would be very interesting. Third, I would be interest in implementing holdout, where some of the data is purposely withheld from the model to make it more robust. This is very similar to the notion of dropout in neural networks. Fourth, I would like to implement a multi-classification solution, where the results from several completely different classifiers are pooled and evaluated. Fifth, I would like to also evaluate, or at least compare other performance metrics like precision, recall, F1, or Cohen’s kappa. Lastly, I would like to try implementing a decision tree using the Hellinger distance, like in Cieslak et al.(2012).

Sources

Aggarwal, C. C. (2015). Data mining: the textbook. Springer.

Aleskerov, E., Freisleben, B., and Rao, B. (1997). Cardwatch: A neural network based database mining system for credit card fraud detection. In Computational Intelligence for Financial Engineering (CIFEr), 1997., Proceedings of the IEEE/IAFE 1997, pages 220–226. IEEE.

Barua, S., Islam, M. M., Yao, X., and Murase, K. (2014). Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2):405–425.

Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1):20–29.

Brause, R., Langsdorf, T., and Hepp, M. (1999). Neural data mining for credit card fraud detection. In Tools with Artificial Intelligence, 1999. Proceedings. 11th IEEE International Conference on, pages 103–106. IEEE.

Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J. (1999). Distributed data mining in credit card fraud detection. IEEE Intelligent Systems and Their Applications, 14(6):67–74.

Chan, P. K. and Stolfo, S. J. (1998). Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In KDD, volume 98, pages 164–168.

Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook, pages 875–886. Springer.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.

Cieslak, D. A., Hoens, T. R., Chawla, N. V., and Kegelmeyer, W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1):136–158.

Dal Pozzolo, A., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). Calibrating probability with undersampling for unbalanced classification. In Computational Intelligence, 2015 IEEE Symposium Series on, pages 159–166. IEEE.

Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., and Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications, 41(10):4915–4928.

Ghosh, S. and Reilly, D. L. (1994). Credit card fraud detection with a neural-network. In System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on, volume 3, pages 621–630. IEEE.

He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284.

11Jeni, L. A., Cohn, J. F., and De La Torre, F. (2013). Facing imbalanced data–recommendations for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 245–251. IEEE.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 413–422. IEEE.

Mahmoudi, N. and Duman, E. (2015). Detecting credit card fraud by modified fisher discriminant analysis. Expert Systems with Applications, 42(5):2510–2516.

Maldonado, S., Weber, R., and Famili, F. (2014). Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Information Sciences, 286:228–246.

Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings of the AAAI2000 workshop on imbalanced data sets, pages 1–3.

Rousseeuw, P. J. and Driessen, K. V. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471.

Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., and Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5):1623–1637.

Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics, (6):448–452.