Name: Can Huzmeli
Country: United Kingdom
Organization: ICAN Consultancy
ICAN CONSULTANCY is your partner for management consultancy and software delivery. We are expert software delivery and project consultants from all around the world working on agile software delivery, management consultancy, management training, agile transformation, and cloud transformation.
In the cryptocurrency industry, there are a lot of fraudulent activities happening. The identityless and immutable nature of blockchain makes it very difficult to stop these things from occurring. While there are very good players in the ecosystem to create new products and services, they do not focus on the safety and security of using cryptocurrency wallets. The same ecosystem has also not created a profitable business, where the whole system depends on the attraction to the cryptocurrency. This makes it almost impossible to generate high B2B revenue. Anything that's developed has to be low-cost.
The journey to success is explained in detail in this series of blog posts.
Here, I will explain how I added a basic ML classifier feature using Dataiku in just four days. In this case, I wanted to add a trust score for an Ethereum wallet ID, meaning, how much can you trust that account.
First, I had to find data. For the time being, I didn't even decide whether I'd go for a supervised machine-learning algorithm or an unsupervised one. I wanted to make that decision based on what kind of data I could find. After a long day of researching data online, I found a list of around 700 Ethereum wallet IDs that were reported as fraudulent. I randomly checked a few of them; it was mostly phishing fraud. This was a good start, but I needed detailed data on these accounts. I wrote a script to fetch the detailed data of these accounts from Etherscan.
From Day 1, I had a CSV file ready to train the model. Or did I? Of course not, because in order to train the model, I needed normal accounts as well as fraudulent accounts. I couldn't train the model with only fraudulent account data.
In order to do that, I decided to find an Ethereum contract used to buy NFTs and determined that anyone who bought NFT through that contract would be a normal account. Now, you might think a fraudulent account can use the same account for phishing but also for self-investment. However, that's not the case; cyber criminals never use the same account for their personal usage (i.e., to buy NFTs) and for their criminal activities.
After having my data ready in a single CSV file, with a column "fraud" as 1 or 0 to classify the data, I was ready to train my model. I looked around AWS and GCP to see if I could achieve something quickly. However, they both proved to be too complicated for the basic model I was trying to create.
In AWS, there are ready models, but they are for very specific usage. For example, if you want to check if a credit card transaction is fraudulent or not, you can start using AWS Fraud Detector with just a few clicks. This is the same for the creation of fake accounts. However, if you want to build something custom, you have to build an entire machine learning pipeline with S3 - SageMaker - your custom application code in SageMaker (maybe using xgboost) - Create a Lambda function - Open your lambda function through API Gateway, and solve all of the configuration problems you would face on your way... This was too complicated, and I needed something quick and simple.
So, I decided to use Dataiku. I knew Dataiku from one of my consultancy engagements, where I was responsible for setting up Dataiku for a global mining company's Data Science function. I knew how easy it was to train a model and create an API service for scoring. It took me only a few hours to feed my CSV file and create the flow below.
Dataiku has a feature to compare different algorithms and inform you of the success rate. In my case, all three supervised algorithms performed almost the same and with a high success rate. So, I chose the simplest one: logistic regression. I found this feature very powerful. Normally, if I had to do all of that from scratch, it would have taken me weeks, if not months. With Dataiku, it took me only a few hours.
The last step was to expose an API to classify a given data set as fraudulent or not. This proved to be complicated, as the documentation missed the requirement to create an extension service in order to be able to create an API. It took me half a day trying to find out what the problem was. Luckily, Dataiku's online forum was very responsive. If you are interested to see the problem I had and the solution, this is the post I created in the Dataiku Community.
I quickly integrated my Lambda function with the Dataiku API service. The service does not only provide "1" or "0" but also a probability of that decision. I used the probability of the decision as well, which helped to create a score rather than a black-and-white result.
Now it was time to add the visualization to the page. It felt like I made the right choice by using MUI because MUI had the exact component I needed to show the result. I added a new React component and used the score information coming from the Dataiku API service call response to display the result.
When working on machine learning projects, creating everything from scratch became Plan B in case we couldn't achieve the outcome with Dataiku.
Business Area: Product & Service Development
Use Case Stage: Built & Functional
According to the 2022 Chainanalysis Crypto Crime report, a massive spike in theft and scams led to a global increase of 79% in crypto-related crimes. Allegedly, illicit addresses received $14 billion in this one-year period. The platform will help cryptocurrency users to check the receiver party for their trust score before they approve any payment. This way, we will be able to stop illicit addresses from getting fraudulent payments from honest users.
111 users received a trust score for an Ethereum wallet id in the first 90 days of our launch. The service is currently available free of charge.
Value Brought by Dataiku:
With Dataiku, we could speed up the development of machine learning capability for the product in its entirety, from data cleaning to training and finally to creating a production API. This reduced cost of development and also time saved enabled us to launch a product with a record delivery timeline. We could add a production-grade prediction scoring API within just four days, including data cleaning, analysis of different machine learning methods, and productionizing — all done through Dataiku.