Content Library
Back
Share

Data and machine learning in financial fraud prevention

Advanced data analytics and computer science can help fraud prevention teams improve their handle on risk decisioning

Data machine learning financial fraud prevention header

In a speech addressing the American Bar Association’s 39th National Institute on White Collar Crime, Deputy Attorney General Lisa Monaco warned that artificial intelligence (AI) “holds great promise to improve our lives — but great peril when criminals use it to supercharge their illegal activities, including corporate crime.” 

Nearly one in four respondents surveyed in Alloy’s 2024 State of Fraud Report agreed that AI-driven fraud was the most concerning fraud trend faced by their organization last year. But fraudsters aren’t the only ones leveraging AI; banks, fintechs, and credit unions are equipping themselves with machine learning (ML) and other types of AI to improve their fraud risk decisioning. 

Data and machine learning inline 1

Source: Alloy’s 2024 State of Fraud Benchmark Report

With most respondents reporting an increase in fraud attempts across consumer and business accounts over the past year, financial institutions (FIs) are tasked with implementing smarter, more streamlined, fraud prevention processes. 

This blog post explores how data and ML can safeguard your bank, fintech, or credit union against fraud. We’ll explain how data and ML work together in fraud detection, including which data processes are involved and how to adopt more effective, transparent ML models.

How to leverage machine learning in fraud detection

The more adept banks, fintechs, and credit unions become at stopping fraudulent identities, the more expensive it becomes for fraudsters to strike. To keep fraudsters out, organizations must leverage predictive fraud models using fraud prevention tactics like ML, biometrics, behavioral analytics, strong authentication techniques, and — above all — datasets that span a wide range of applications and human behaviors. 

Apply a wide dataset

One of the most important things you can do to stop fraud is to have a wide dataset of meaningful fraud signals. The more data sources you have, the easier it is to identify any fraudulent patterns. However, having a wide dataset can lead to another problem: the more data points a bank, fintech, or credit union has, the harder it is to determine how to use each data point. ML models look for patterns in data, and by having many data sources, one can generate more relevant model inputs to describe fraud behavior.

ML can ease decision-making by automatically extracting patterns from wide datasets. Data orchestrators can help you combine internal customer data with third-party identity data, offering a centralized view of risk that is informed by a wide variety of data sources. Traditional and alternative data vendors, like those offering payroll and rent payment information, can both play a critical role in expanding your datasets to better verify an applicant’s identity.

Optimize and augment fraud workflow rulesets 

Overly stringent fraud controls can frustrate legitimate customers and lead them to abandon their applications. At the same time, lax controls open the door to fraudulent activity, eroding trust and profitability. The key is to leverage AI and ML to intelligently assess risk in real-time from the moment a customer begins onboarding and throughout their lifecycle. This enables a seamless experience for genuine customers while proactively stopping fraudulent identities from engaging in malicious attacks.

Banks, fintechs, and credit unions can leverage fraud data and ML to supplement or augment rule-based fraud decisioning. Why is this so important? Optimizing rulesets helps maximize the number of good application approvals while minimizing manual reviews, fraudulent application approvals, and good application denials. As a result, optimizing fraud workflow rulesets also possesses enormous efficiency benefits.

Considerations for using ML fraud models

When optimizing rulesets with ML for fraud prevention, it helps to understand training data, ensure interpretability, and always remain skeptical of class imbalances. 

Understanding training data involves familiarizing yourself with different data types, processes, and hierarchies. Ensuring interpretability is crucial for building comfortability with your ML models. Finally, being aware of class imbalances can help you avoid misleading results that could impact the effectiveness of your fraud detection efforts. 

Let's take a closer look at each of these considerations:

1. Understand training data

To understand training data, you must first become familiar with data formats, processes, and hierarchies. These concepts help ensure data is properly prepared, the models are designed effectively, and the results can be interpreted correctly.

Data formats

Data formats refer to the structure and organization of the data used for training ML models. Common formats include structured data (tabular data with well-defined fields), unstructured data (including text, images, and audio), and semi-structured data (such as JSON and XML). Understanding the characteristics and limitations of each data type is crucial for selecting appropriate models and preprocessing techniques because different ML algorithms are suited for different data formats.

For example, traditional ML algorithms like logistic regression or decision trees often use tabular structured data. This data is organized into a set of features or inputs (the independent variables) and corresponding labels or targets (the dependent variable the model is trying to predict). In the case of fraud detection, the features could include transaction amount, location, time, customer demographics, etc. The label would indicate whether each transaction was fraudulent or legitimate.

Unstructured data, such as images or text, may require deep learning approaches such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs). With unstructured data, the raw data itself (e.g., pixel values for an image or word embeddings for text) serves as the model input. Labels are still required to train the model, but feature engineering is handled automatically by the deep learning model.

In the context of fraud detection, transaction records, customer profiles, and historical fraud patterns are common types of structured data used to train models. However, incorporating unstructured data like customer interactions, social media activity, or device fingerprints can provide additional insights to detect more sophisticated fraud. The key is mapping the raw unstructured data to an appropriate numerical representation for model inputs.

Data processes

ML models for fraud prevention rely on well-defined data processes, including data preparation, integration, orchestration, splitting, and monitoring. This helps ensure high-quality, relevant data is used for training and testing, enabling the models to detect and adapt to evolving fraud patterns effectively. The majority of the machine learning process is commonly devoted to data cleaning and exploratory data analysis (EDA). This stage is essential to prepare data for modeling and to better understand the data being input to the model. An ML model can only be as good as the data provided, so as tempting as it can be, it is essential not to take shortcuts at this step. 

Post-EDA, data orchestration helps FIs coordinate and manage data flow from various sources, ensuring data quality and availability. Data orchestration tools automate much of the manual data preparation work and format the data consistently for ML models. They also enable a non-linear approach to data, allowing fraud detection systems to reference the right data source at the right time. 

By implementing sophisticated data processes, fraud prevention teams can ensure that their ML models are trained on high-quality, relevant data, improving their ability to detect and prevent fraud at scale.

Hierarchy of labels and outcomes

In ML, labels are the target variable that the model is trained to predict based on the input features. In the context of fraud detection, labels are typically assigned to each data point (e.g., a transaction or an account) to indicate whether it is considered fraudulent or legitimate. These labels are used to train the ML model to recognize patterns and make predictions on new, unseen data.

The outcome of an investigation is used to label fraud models and may contain subcategories or hierarchical levels. For instance, if an investigation confirms that an entity is commiting fraud, the outcome can be further classified into specific types of fraud, such as identity theft, account takeover, or money laundering. Additionally, the outcomes may include degrees of risk that indicate the severity and impact of the fraudulent activity — such as high, medium, or low. 

Deep learning models are particularly well-suited for capturing nuanced outcomes because they can automatically learn hierarchical representations from raw data. By training on a large dataset with detailed outcome labels, deep learning models can learn to recognize patterns and relationships at different levels of granularity. 

Understanding the relationships between labels and their hierarchical outcomes empowers fraud analysts to make more informed decisions and prioritize their investigations effectively. Leveraging outcomes and labels can help fraud prevention teams generate helpful model inputs, such as aggregated historical risk scores, and high-quality labels for supervised learning. When ML models can learn from historical patterns, they can continuously adapt to evolving fraud tactics. 

2. Ensure interpretability

Interpretability in ML refers to the ability to understand and explain how a model makes its predictions or decisions. It involves gaining insight into the model's internal workings, the importance of different features, and the reasoning behind its decisioning. 

While ML models can highlight suspicious patterns and high-risk entities, transparency is important so stakeholders can get comfortable with the model outputs. While ML models can help focus attention and point fraud teams in the right direction, human judgment remains essential for interpreting the full context and making the ultimate decisions about fraud outcomes.

While deep learning models can achieve high accuracy, they are more challenging to interpret, meaning it can be hard to understand how they arrived at their predictions. 

Data and machine learning inline 2

Source: Cambridge

In fraud risk management, simpler ML approaches, like gradient-boosted classification trees or logistic regression, can outperform Deep Neural Networks on certain tasks while being much easier to interpret. These models provide clear insights into which features are most important for making predictions and how changes in input values affect the output. Parameters can be directly tied to existing rulesets, leading to actionable recommendations for rule improvements.

For example, if a logistic regression model identifies certain transaction characteristics as solid indicators of fraud, these insights can be used to refine the rules in a fraud detection system. At Alloy, our fraud experts leverage these interpretable models to help craft likely fraud patterns for our clients and offer data-driven rule improvements that enhance the overall effectiveness of the system.

3. Be skeptical of class imbalances

When an ML model provides promising metrics, it is tempting to immediately (and excitedly) share the results. Using accuracy metrics to measure the performance of fraud detection models can present misleading results because of a class imbalance problem.

Class imbalance occurs when one class contains significantly fewer instances than another class. If you compare the number of fraudulent transactions any institution sees against the number of legitimate transactions, the fraudulent transactions will still be significantly outnumbered. This can pose challenges when training ML models that need to learn boundaries or patterns that distinguish "fraud" from "not fraud."

Consider, for example, a dataset of 100,000 bank accounts where only 100 are fraudulent. A model that classifies all accounts as legitimate would achieve an accuracy of 99.9% without detecting any fraud. However, this model would be useless for fraud prevention because it fails to identify the minority class: the fraudulent accounts we are most interested in detecting.

Data and machine learning inline 3

Source: ResearchGate

Fraudulent activities typically account for a small percentage of the overall transactions or applications, making class imbalances common in fraud analytics. 

Here are some of the ways you can address class imbalances in fraud prevention:

Apply resampling techniques

Resample datasets, such as the minority class. This involves creating synthetic examples of the minority class (fraudulent transactions) to balance the dataset. Resampling techniques like SMOTE (Synthetic Minority Over-sampling Technique) can generate new examples based on the existing minority class instances. Downsampling the majority class (in other words, randomly removing legitimate transactions from the majority class) can also balance the dataset, but may lead to loss of information.

Adjust class weights

Much like resampling techniques, adjusting class weights can help ML models more accurately detect fraud patterns. Some ML algorithms (such as Random Forest and Xgboost) allow you to assign higher weights to the minority class during training. This tells the model to pay more attention to the minority class and penalizes misclassifications of that class more heavily.

Ensure appropriate evaluation metrics

The common accuracy measure (AUC, which is the area under the ROC curve) alone may not be a suitable metric for imbalanced datasets. It is often helpful to create a confusion matrix to understand the false positive rate (in the instance of fraud, these are good accounts misclassified as fraud) and the false negative rate (in the instance of fraud, these are fraud accounts misclassified as good accounts). 

When approaching imbalanced datasets, use metrics that focus on the model's performance on the minority class, such as precision, recall, F1-score, and area under the precision-recall curve (AUPRC).

Leverage ensemble methods

Combining multiple models trained on different subsets of the data or using different algorithms can help improve the overall performance of imbalanced datasets. Ensemble techniques like bagging, boosting, and stacking can be effective in handling class imbalance.

When developing fraud detection models, it is critical to be aware of class imbalances and employ appropriate techniques to mitigate their impact. By addressing class imbalances as they arise, you can build models that effectively identify fraudulent activities while minimizing false positives.

5 key takeaways

  1. Advanced data analytics and machine learning can improve fraud prevention for FIs and fintechs. By using a wide dataset spanning various applications and human behaviors, organizations can train ML models to identify fraudulent patterns more effectively.
  2. Optimizing fraud workflow rulesets using ML can help maximize good application approvals while minimizing manual reviews, fraudulent approvals, and false positives. This leads to a more seamless experience for genuine customers and improved efficiency for the organization.
  3. When using ML for fraud prevention, it's crucial to understand the training data and be aware of potential class imbalances. Familiarity with data formats, processes, and hierarchies helps ensure the data is properly prepared, and the results can be interpreted correctly.
  4. Interpretability creates transparency and trust in ML fraud models. While deep learning models can achieve high accuracy, simpler approaches like gradient-boosted classification trees or logistic regression can be easier to interpret and improve.
  5. Fraudulent activities typically make up a small percentage of overall transactions or applications, so class imbalance is a common challenge in fraud detection. To address this, fraud prevention teams can use resampling techniques, adjust class weights, ensure appropriate evaluation metrics, and leverage ensemble methods.

Alloy uses data and machine learning for better fraud prevention

Alloy's Identity Risk Solution offers a comprehensive approach to fraud prevention by integrating with 200+ trusted third-party data sources. Alloy orchestrates internal and external identity, behavioral, and transaction data to help identify fraudulent patterns quickly and proactively. 

There are three ways to leverage Alloy’s ML solution for more holistic fraud prevention:

1. Use Alloy-trained models

Alloy's Entity Fraud Model is an AI-powered data model that is designed to predict the likelihood of fraud across an entity's lifecycle. It offers unique insight into a customer’s risk profile on an ongoing basis by marrying onboarding signals, monetary data (like transaction velocity), and non-monetary data (such as the number of emails associated with the entity). The complementary Fraud Attack Radar model can be used during onboarding to help predict the likelihood of an organization experiencing a fraud attack. Together, these two models provide holistic coverage for Alloy users: the Entity Fraud Model works at the individual entity level, whereas the Fraud Attack Radar is designed to spot patterns in the population of applicants.

2. Leverage models from third-party vendors

Alloy integrates with leading third-party vendors to provide a holistic and proactive approach to fraud prevention. By leveraging these vendor models alongside Alloy's own models, organizations can benefit from a diverse range of fraud detection capabilities.

3. Bring your own ML model

Alloy allows you to use custom fraud models and input your own decisioning logic. Data will be unified and standardized in the Alloy dashboard, making it easy to integrate your models into your fraud prevention workflow.

Alloy combines advanced data analytics, data orchestration, and rule-based policy definitions to analyze behavior, verify identity, and detect anomalies automatically. Our team can make personalized suggestions to help you optimize your bank, fintech, or credit union's fraud risk policies. By leveraging the power of AI and machine learning, Alloy enables organizations to stay ahead of evolving fraud threats and protect their customers more effectively.

Experience Alloy’s AI-powered fraud prevention firsthand

Related content

Back