Predicting eCommerce fraud

The Problem

Given anonymized eCommerce transaction data which were labelled “Fraud” or “Not Fraud”, the task was to build a binary classifier model that will predict whether a given transaction is “Fraud” or “Not Fraud” in new data. This challenging large-scale dataset had a wide range of features (400+) and ~1.2 million transactions, about half of which was labelled training data.

Credit: Photo by rupixen on Unsplash

The dataset for this Kaggle competetion was provided by Vesta Corporation and prepared in collaboration with IEEE-CIS. Vesta Corporation is a forerunner in guaranteed eCommerce payment solutions. The objective here is to improve the efficacy of fraudulent transaction alerts for millions of people around the world and reduce the loss due to fraud.

Solution

The approach that put me in the top 50% of the competitors involved building a Tree based model with careful preprocessing and feature engineering.

Preprocessing involved missing data imputation in numeric features, label-encoding of categorical features and removal of highly null columns. Given the nature of “Fraud” labelling logic, it was beneficial to indentify unique customers/card holders in the data. However as no personally identifiable data existed, proxies for unique-indentification were created using groups of loose identifiers. New features created included proxy-ID based aggregations, target likleihood encoding of interaction features and time based features.

A RandomForest based Tree model was fit along with class weighting to handle class-imbalance in the data (~96% of “Not Fraud”). The model was tuned to optimize for bias vs variance. A couple different Cross Validation (CV) strategies were used, which tried to replicate the train/test split closely.

Figure: Data Procecssing and Modeling Pipleine

Github Project Repository

For details and for code please see the project notebook on Github.

Acknowledgement and Disclaimer

Kaggle has kindly permitted to use their logo for this website. However myself and this website are not officially connected to Kaggle.