Detecting Fraudulent Procurement Activities with Machine Learning

Shokhrukh Yakubjanov
4 min readMar 7, 2023

--

Photo by seon.io

Procurement fraud can be a major concern for businesses of all sizes. According to a report by the Association of Certified Fraud Examiners, the median loss from procurement fraud cases is $100,000, making it one of the most costly types of fraud.

However, with the help of machine learning, it’s possible to detect fraudulent procurement activities before they cause significant damage. In this article, we’ll explore a Python code example that uses machine learning to identify potentially fraudulent procurement transactions.

The code uses a dataset of procurement transactions that includes information such as the vendor name, invoice amount, and order amount. It trains a random forest classifier on this data to predict whether a given transaction is fraudulent.

Import dataset:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load and clean the data
data = pd.read_csv("procurement_data.csv")
data = data.dropna()

First, we import necessary libraries for our research and then load our dataset for investigation. Also code drops any rows with missing values.

We use data.head() just to be sure that we load appropriate dataset. And we can see that everything good (for now).

Adding column “Amount Difference”

# Create new features
data["amount_diff"] = data["invoice_amount"] - data["order_amount"]

# Create target variable
data["fraud"] = np.where(data["amount_diff"] > 0, 1, 0)

Here we are creating a new feature called amount_diff which calculates the difference between the invoice amount and the order amount.

Next, the code creates the target variable fraud, which is set to 1 if the amount_diff is greater than 0, and 0 otherwise.

So here we can see some fraud results

Random Forest classifier

The data is then split into training and test sets using the train_test_split function from the sklearn.model_selection module:

# Split data into training and test sets
X = data.drop(["fraud", "amount_diff"], axis=1)
y = data["fraud"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The code then trains a random forest classifier on the training data using 100 estimators.

# Train the model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

But here in the process of training the model, we are hitting the wall:

The error suggests that there are string values in training data that could not be converted to float. We need to make sure that all categorical columns have been properly encoded before fitting the model.

So first we need to check all columns values types:

There are four columns that contains string values

To fix this, we need to encode the categorical variables in dataset to numerical values using techniques like Label Encoding before fitting the model.

LabelEncoder is a class from the sklearn.preprocessing module that can be used to convert categorical data (i.e., non-numerical data such as strings) into numerical data that can be used in machine learning algorithms.

from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
encoder = LabelEncoder()

# Fit and transform the Purchase Order ID column
data['Purchase Order ID Encoded'] = encoder.fit_transform(data['Purchase Order ID'])

# Fit and transform the Vendor ID column
data['Vendor ID Encoded'] = encoder.fit_transform(data['Vendor ID'])

# Fit and transform the Employee ID column
data['Employee ID Encoded'] = encoder.fit_transform(data['Employee ID'])

# Fit and transform the Item Description column
data['Items Encoded'] = encoder.fit_transform(data['Items'])

This will create four new columns in your dataframe, each with the encoded values. Note that the fit_transform() method is called on each column separately to ensure that each column is encoded independently.

And then we will check again the types of our columns values:

Everything good, also drop the original categorical columns so we can use the new columns as input features for model.

After training the model, the code uses it to predict the fraud status of the transactions in the test set.

Result

Finally, the code evaluates the model’s accuracy using the accuracy_score and classification_report functions from the sklearn.metrics module.

The output of the code shows that the model achieved perfect accuracy on the test set, correctly identifying all three transactions as not fraudulent:

Accuracy:  1.0
precision recall f1-score support

0 1.00 1.00 1.00 3

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3

The output of the code shows that the model achieved perfect accuracy on the test set, correctly identifying all transactions.

This result is promising, as it suggests that the model is capable of accurately detecting fraudulent procurement activities.

Of course, it’s worth noting that this is just one example of how machine learning can be used to detect procurement fraud.

--

--

Shokhrukh Yakubjanov
Shokhrukh Yakubjanov

Written by Shokhrukh Yakubjanov

I’m a certified IBM data scientist , examining new and more convincing methods for data analysis, visualization and data modeling.

No responses yet