Decentralized Credit Scoring: A Technical Introduction
Abstract
With Artificial Intelligence becoming a prominent driving force in Traditional Finance, it goes without mentioning the capabilities of this technology in disrupting upcoming fields like Blockchain. Since blockchain is a distributed, decentralized, immutable ledger storing encrypted data, its integration with Machine and Deep Learning can be used for building more concrete data sharing routes as well as minimizing the time taken to find the cryptographic nonce (Developers Corner, 2021).
Blockchain and Artificial Intelligence are two of the most promising as well as the hottest technologies currently in the market. Despite having very different developing parts and use cases, researchers have been discussing and studying the ways these technologies can be collaborated to benefit one another. PwC predicts that by 2030 AI will add up to $15.7 trillion to the world economy, and as a result, global GDP will rise by 14%. According to Gartner’s prediction, business value added by blockchain technology will increase to $3.1 trillion by the same year (OpenMind, 2019).
Project Objective
This project of ours would be diving deep into exploring the use case of Machine and Deep Learning in Decentralised Credit Scoring. With the emerging limitations of traditional risk profiling mechanisms, this approach in the DeFi world allows the unbanked to obtain credit scores without any need for traditional credit history and thereby obtain access to lending services. The central idea is to rather form a credit score using non-intrusive on-chain data. Further advancements in this field could unlock possibilities for enhanced privacy and trust minimization, reducing attestations by a single authority.
Data Collection
Data Collection is one of the most integral parts of a machine learning project because it largely dictates how well a model is trained. Since blockchain makes use of a private distributed ledger to store data, it is much harder to retrieve user-specific data as compared to centralized systems wherein data rests with a single provider. While general data can be effortlessly extracted from the Blockchain due to it being open-source and public in nature, user-specific data is often very tough to find.
Due to the nascent and uncertain nature of blockchain applications at present, we decided to narrow our scope to focus only on one aspect of DeFi — Lending/Borrowing — using the Aave protocol. While there are several DeFi protocols available in the market such as Compound, Aave is built on the Ethereum protocol which is our key area of focus. Aave is an open-source and non-custodial liquidity protocol for earning interests on deposits and borrowing assets. To break down our data collection process, we would first begin by retrieving all wallet addresses which interacted with the Aave protocol through Etherscan’s CSV record.
After extracting the list of addresses that interacted with the Aave protocol, we can analyze each address by interacting with the Lending Pool contract. This contract hosts numerous user-oriented actions that can be invoked using either Solidity or Web3 libraries. For instance, the function getUserAccountData(address user) returns the user account data across all the reserves such as totalCollateralETH, availableBorrowsETH, and much more.
Credit Score Formula
In order to calculate the Credit Score for every user who interacted with the Aave, we have decided upon a deterministic formula that would take into account five key groups: Payment History (38.5%), Amount owed (33.5%), Length of Credit History (16.5%), Credit Mix (11.5%) and lastly Anomaly Score (20%).
Credit Score = P + A + L + C — X
if you would like to explore the formulation of the Credit Score Formula, please visit https://www.notion.so/Investigate-DeFi-Credit-Score-Formula-e82c5061070b4d05812dd26c38902f70
Use case of Machine Learning
While the ultimate goal of this project is in calculating a credit score for every user interacting with the Aave, Machine Learning will be used for anomaly detection — which is a subcomponent of the above formula (X). Anomaly detection is an important tool for identifying cases that are unusual within data that is seemingly comparable. In this context, we can identify users which stand out because they differ significantly from standard behaviors or patterns, i.e. making them more uncertain to make repayments and hence assigning them a higher anomaly score. This anomaly score is then subtracted from the other 4 subcomponents, and a final credit score is calculated.
Constructing the Data Frame
We will be constructing the DataFrame using the following features which we collected by interacting with the Lending Pool Contract (see fig 3) as well as by extracting from the CSV files downloaded from Etherscan.
- Account Activity
- Checks for how active is the account by dividing the number of transactions made by how long the account is open
- To ensure consistency, we will be having a fixed start and end date in order to ensure the account activity can be analyzed with respect to fixed timeframes
2. Health Factor
- This is a numerical representation of the safety of deposited assets.
- It is calculated as the proportion of collateral deposited versus the amount borrowed.
- A Health Factor of about 1 is recommended to avoid liquidation.
3. Loan To Value (LTV)
- Loan assessments with high LTV ratios are considered higher riskier loans.
- Each asset in Aave has a specific LTV
4. Current Liquidity Threshold
- A higher liquidity threshold is a percentage over which the user could easily get liquidated.
5. Total Collateral ETH
- Higher collateral generally means more safety when it comes to borrowing money from lenders.
6. Available Borrows ETH
- This refers to the borrowing power left of the user
7. Credit Mix (Diversification)
- Credit Mix basically refers to seeing how successful is a user in managing different types of credits.
- Greater emphasis on diversification is required to ensure the user is able to navigate through risks.
8. Repayment Rate
- calculating the percentage of how much loan has been repaid back.
Machine Learning Methodology
Due to the nature of the problem at hand, a supervised machine learning approach with labeled data would hardly work for our case. Hence, we would be making use of Unsupervised Learning, where networks train without labels, finds patterns, and splits the data into respective clusters.
In unsupervised learning, an anomaly can be detected with autoencoders. Autoencoders are data compressing algorithms that translate the original data into a learned representation, which allows us to run the function and calculate how far is the learned representation from the original data. Fraudulent data or those that are riskier in nature have higher error rates — which allows us to identify anomalies.
How To Build AutoEncoders?
To build autoencoders, you need three fundamental functions:
- Encoding Function
- Decoding Function
- Distance Function
The Encoding and Decoding Functions have been learned automatically from examples rather than being engineered by humans. They will be chosen to be parametric functions (typically Neural Networks) which would be optimized to minimize the reconstruction loss using Stochastic Gradient Descent. The Distance Function, on the other hand, can be easily constructed to detect the data reconstruction error rate.
How to detect anomolies using Autoencoders?
Since Autoencoders are trained to minimize reconstruction error, we will focus on training an autoencoder on normal rhythms and reconstructing all the data. The hypothesis we would be testing is that abnormal rhythms will have a higher reconstruction error i.e the error will be greater than a fixed threshold.
Importing Libraries
We will be using TensorFlow’s AutoEncoder to assist us with anomaly detection. For Data Analysis and Visualization, we will be importing the standard Pandas, NumPy, Matplotlib, and Seaborn libraries.
Exploratory Data Analysis
We will now inspect the dataset and analyze its key descriptive statistics using the Pandas library. The following code snippet is achieved by using the describe() function and calling it on the data frame.
Normalizing The Datasets
In Machine Learning, we often normalize data to ensure consistency across all different features of the data frame. This minimizes redundancies and ensures only related data is stored in each table. In this example, we would be performing min-max normalization by fitting all the values of our features within the scale from 0 to 1.
Building the Model
The encoder of the Model consists of three layers that encode data into lower dimensions. The decoder of the model consists of three layers that reconstruct the input data. Eventually, this model is compiled with Mean Squared Logarithmic Loss and Adam Optimizer.
Since our DataFrame comprises 140 data points, we will be training the encoder such that it learns to compress the dataset from 140 dimensions to latent space and the decoder will learn to reconstruct the original data.
The model is then trained with 20 epochs with a batch size of 512.
Fitting and Training the Model
After normalization, we will be using the respective train and test data sets to fit the model. This can be achieved through the following piece of code.
Detecting Anomalies
This model allows us to detect and classify anomalies if the reconstruction loss is greater than a fixed threshold. We begin by calculating the mean average error and then classify future examples as anomalous if the reconstruction rate is higher than one standard deviation from the training set.
To fix a threshold value, we would be looking at one standard deviation above the mean! We can experiment that with the training dataset with the following line of code.
We can streamline the detection of anomalies by defining a predict function that takes in 3 parameters; Model, Data, and Threshold and classifies if a data point is anomalous in nature by comparing the reconstruction error to the threshold.
Limitations of Research
Due to the nature of Decentralised Systems, it was difficult to unmask certain aspects of the data. Unlike Centralised Finance, where it is often easy to gather information about users — DeFi only offers a handful of features. Essentially, Data Extraction was challenging and consumed hours of research. The weights to the Credit Score formula were unanimously set, which could be considered biased to a certain extent. With only 140 data points available, we believe overfitting could be a major threat to the efficacy of the model’s performance.
Moving Ahead
As can be seen above, the data extraction component is extremely manual-intensive, which limits scalability. Therefore, at this moment, our project is experimental in nature, looking for ways on how we can integrate machine learning and blockchain/DeFi applications to achieve a better outcome. One way we can improve extraction of data moving forward is to write web scraping scripts that can pull the data for us.
Acknowledgments
This technical research is a section of a much larger report which will be published by the blockchain department of the NUS Fintech Society. This research is designed to be more technical in nature due to its key focus on Machine Learning and how it plays a pivot role in anomaly detection. We would like to thank NUS Fintech Society for their unwavering support and guidance towards the completion of this research project.