A Vast New Data Set Could Supercharge the AI Hunt for Crypto Money Laundering

Blockchain analysis firm Elliptic, MIT, and IBM have released a new AI model—and the 200-million-transaction dataset it's trained on—that aims to spot the “shape” of bitcoin money laundering.
Ladder coming out of a Bitcoin logo
Illustration: rob dobi/Getty Images

One task where AI tools have proven to be particularly superhuman is analyzing vast troves of data to find patterns that humans can't see, or automating and accelerating the discovery of those we can. That makes Bitcoin's blockchain, a public record of nearly a billion transactions between pseudonymous addresses, the perfect sort of puzzle for AI to solve. Now, a new study—along with a vast, newly released trove of crypto crime training data—may be about to trigger a leap forward in automated tools' ability to suss out illicit money flows across the Bitcoin economy.

On Wednesday, researchers from cryptocurrency tracing firm Elliptic, MIT, and IBM published a paper that lays out a new approach to finding money laundering on Bitcoin's blockchain. Rather than try to identify cryptocurrency wallets or clusters of addresses associated with criminal entities such as dark-web black markets, thieves, or scammers, the researchers collected patterns of bitcoin transactions that led from one of those known bad actors to a cryptocurrency exchange where dirty crypto might be cashed out. They then used those example patterns to train an AI model capable of spotting similar money movements—what they describe as a kind of detector capable of spotting the “shape” of suspected money laundering behavior on the blockchain.

Now, they're not only releasing an experimental version of that AI model for detecting bitcoin money laundering but also publishing the training data set behind it: a 200-million transaction trove of Elliptic's tagged and classified blockchain data, which the researchers describe as the biggest of its kind ever to be made public by a thousandfold. “We're providing about a thousand times more data, and instead of labeling illicit wallets, we're labeling examples of money laundering which might be made up of chains of transactions,” says Tom Robinson, Elliptic's chief scientist and cofounder. “It's a paradigm shift in the way that blockchain analytics is used.”

Blockchain analysts have used machine learning tools for years to automate and sharpen their tools for tracing crypto funds and identifying criminal actors. In 2019, in fact, Elliptic already partnered with MIT and IBM to create a AI model for detecting suspicious money movements and released a much smaller data set of around 200,000 transactions that they had used to train it.

For this new research, by contrast, the same team of researchers took a much more ambitious approach. Rather than try to classify single transactions as legitimate or illicit, Elliptic analyzed collections of up to six transactions between Bitcoin address clusters it had already identified as illicit actors and the exchanges where those previously identified shady entities sold their crypto, positing that the patterns of transactions between criminals and their cashout points could serve as examples of money laundering behavior.

Working from that hypothesis, Elliptic assembled 122,000 of these so-called subgraphs, or patterns of known money laundering within a total data set of 200 million transactions. The research team then used that training data to create an AI model designed to recognize money laundering patterns across Bitcoin's entire blockchain.

As a test of their resulting AI tool, the researchers checked its outputs with one cryptocurrency exchange—which the paper doesn't name—identifying 52 suspicious chains of transactions that had all ultimately flowed into that exchange. The exchange, it turned out, had already flagged 14 of the accounts that had received those funds for suspected illicit activity, including eight it had marked as associated with money laundering or fraud, based in part on know-your-customer information it had requested from the account owners. Despite having no access to that know-your-customer data or any information about the origin of the funds, the researchers' AI model had matched the conclusions of the exchange's own investigators.

Correctly identifying 14 out of 52 of those customer accounts as suspicious may not sound like a high success rate, but the researchers point out that only 0.1 percent of the exchange's accounts are flagged as potential money laundering overall. Their automated tool, they argue, had essentially reduced the hunt for suspicious accounts to more than one in four. “Going from ‘one in a thousand things we look at are going to be illicit’ to 14 out of 52 is a crazy change,” says Mark Weber, one of the paper's coauthors and a fellow at MIT's Media Lab. “And now the investigators are actually going to look into the remainder of those to see, wait, did we miss something?”

Elliptic says it's already been privately using the AI model in its own work. As more evidence that the AI model is producing useful results, the researchers write that analyzing the source of funds for some suspicious transaction chains identified by the model helped them discover Bitcoin addresses controlled by a Russian dark-web market, a cryptocurrency “mixer” designed to obfuscate the trail of bitcoins on the blockchain, and a Panama-based Ponzi scheme. (Elliptic declined to identify any of those alleged criminals or services by name, telling WIRED it doesn't identify the targets of ongoing investigations.)

Perhaps more important than the practical use of the researchers' own AI model, however, is the potential of Elliptic's training data, which the researchers have published on the Google-owned machine learning and data science community site Kaggle. “Elliptic could have kept this for themselves,” says MIT's Weber. “Instead there was very much an open source ethos here of contributing something to the community that will allow everyone, even their competitors, to be better at anti-money-laundering.” Elliptic notes that the data it released is anonymized and doesn't contain any identifiers for the owners of Bitcoin addresses or even the addresses themselves, only the structural data of the “subgraphs” of transactions it tagged with its ratings of suspicion of money laundering.

That enormous data trove will no doubt inspire and enable much more AI-focused research into bitcoin money laundering, says Stefan Savage, a computer science professor at the University of California San Diego who served as adviser to the lead author of a seminal bitcoin-tracing paper published in 2013. He argues, though, that the current tool doesn't seem likely to revolutionize anti-money-laundering efforts in crypto in its current form, so much as serve as a proof of concept. “An analyst, I think, is going to have a hard time with a tool that's kind of right sometimes,” Savage says. “I view this as an advance that says, ‘Hey, there's a thing here. More people should work on this.’”

Savage warns, though, that AI-based money-laundering investigation tools will likely raise new ethical and legal questions if they end up being used as actual criminal evidence—in part because AI tools often serve as a “black box” that provides a result without any explanation of how it was produced. “This is on the edge where people get uncomfortable in the same way they get uncomfortable about face recognition,” he says. “You can't quite explain how it works, and now you're depending on it for decisions that may have an impact on people's liberty.”

MIT's Weber counters that money laundering investigators have always used algorithms to flag potentially suspicious behavior. AI-based tools, he argues, just mean those algorithms will be more efficient and have fewer false positives that waste investigators' time and incriminate the wrong suspects. “This isn't about automation,” Weber says. “This is a needle-in-a-haystack problem, and we're saying let's use metal detectors instead of chopsticks.”

As for the research impact that Savage expects, he argues that even beyond blockchain analysis, Elliptic's training data is so voluminous and detailed that it may even help with other kinds of AI research into analogous problems like health care and recommendation systems. But he says the researchers do also intend their work to have a practical effect, enabling a new and very real way to hunt for patterns that reveal financial crime.

“We're hopeful that this is much more than an academic exercise,” Weber says, “that people in this domain can actually take this and run with it.”