Collating Hacked Data Sets

Two Harvard undergraduates completed a project where they went out on the dark web and found a bunch of stolen datasets. Then they correlated all the information, and combined it with additional, publicly available, information. No surprise: the result was much more detailed and personal.

“What we were able to do is alarming because we can now find vulnerabilities in people’s online presence very quickly,” Metropolitansky said. “For instance, if I can aggregate all the leaked credentials associated with you in one place, then I can see the passwords and usernames that you use over and over again.”

Of the 96,000 passwords contained in the dataset the students used, only 26,000 were unique.

“We also showed that a cyber criminal doesn’t have to have a specific victim in mind. They can now search for victims who meet a certain set of criteria,” Metropolitansky said.

For example, in less than 10 seconds she produced a dataset with more than 1,000 people who have high net worth, are married, have children, and also have a username or password on a cheating website. Another query pulled up a list of senior-level politicians, revealing the credit scores, phone numbers, and addresses of three U.S. senators, three U.S. representatives, the mayor of Washington, D.C., and a Cabinet member.

“Hopefully, this serves as a wake-up call that leaks are much more dangerous than we think they are,” Metropolitansky said. “We’re two college students. If someone really wanted to do some damage, I’m sure they could use these same techniques to do something horrible.”

That’s about right.

And you can be sure that the world’s major intelligence organizations have already done all of this.

Posted on January 30, 2020 at 8:39 AM24 Comments

Comments

ATN January 30, 2020 10:37 AM

To collect Windows password from people at work, there is one solution I hate:
A user types for instance:
$ docker login
instead of:
$ docker login mycompany.com/docker
The user assumes default are setup… if he continues to fill “Username:” and “Password:” fields then a computer somewhere received login credentials…

Logging all those failed Username/Password can get you very far.

Clive Robinson January 30, 2020 11:08 AM

@ ALL,

I would like to say this,

    “For instance, if I can aggregate all the leaked credentials associated with you in one place, then I can see the passwords and usernames that you use over and over again.”

Would be obvious to any readers here unless relatively new.

I’ve mentioned it befor, but I went further to indicate not only that what you had used before was known but,

As the human mind is in the general case hopeless at remembering random, many use a “method” that is simple to produce a randomish looking password. Thus from seeing as little as two or three of your previous passwords they can fairky easily work out what that simple method is or the master secret you use (think of it as simple cryptanalysis from more than a century ago, assisted by modern tools).

It’s why at the end of the day especialy with length limited passwords our host @Bruce’s old recomendation of generating via a true (TRNG” or atleast Crypto graphically Secure Digital Random Bit Generator (CS-DRBG) random passwords and writing them down and putting the paper in your wallet still has merit.

However the days of “static passwords” are now long over every bit as much as storing them on a server in near or actual plain text.

We’ve kind of known about these weaknesses for over half a century.

Yes these sort of weaknesses were being publically talked about back in the early 1960’s as was for instance what became Y2K. But unlike Y2K we’ve yet to realy do anything about it.

Thus the first question we should ask ourselves is,

    What does this say about us as alleged professionals and how othrs see us and our advice?

A somewhat sobering thought.

But a second thought should also be,

    Why is it seen in academia as something that should be talked about as though a new insight?

The correct answers to both should tell us about ICTsec as an industry, and most are realy not going to like it.

Antistone January 30, 2020 3:01 PM

When I see a sentence like this:

Of the 96,000 passwords contained in the dataset the students used, only 26,000 were unique.

…I’m never quite sure whether they mean there are 26,000 distinct passwords (i.e. all of the 96,000 fields contain one of those 26,000 values) or whether they mean that 26,000 passwords appeared exactly once each, and then there were somewhere between 1 and 35,000 other values that appeared at least twice each among the remaining 70,000 fields.

Northern Realist January 30, 2020 3:24 PM

The academics who did this study sure do have an acute grasp of the painfully obvious! Wonder if their next “research” project will be telling us how marketing departments could use this info….

SpaceLifeForm January 30, 2020 3:29 PM

@ Antistone

I parsed it as: in the set of 70000, there were duplicates.

Duplicate hashes showed up at least twice in the set of 70000 hashes.

The hash of ‘password’ and ‘123456’ probably showed up in the double digits. Maybe triple digits.

So, an interesting data point then.

Only approximately 25% of the population has a clue regarding strong passwords.

Me January 30, 2020 3:29 PM

@Northern Realist

Sometimes the painfully obvious needs to be studied in order to enter it into the public sphere.

ap January 30, 2020 3:45 PM

This is bad, but not as bad as it looks at first glance. People who reuse passwords often reuse short and/or weak passwords, which will collide with passwords used by other users. Finding that two accounts used the same password will often times (most of the time?) only be a weak and deniable association at best.

SpaceLifeForm January 30, 2020 4:13 PM

@ ap

No. It is the reuse of passwords over multiple accounts that is the main problem.

It makes it too easy to correlate.

In particular, if the hash algorithm is the same on different platforms.

Never, never, use the same password on two accounts.

Just don’t do that.

ap January 30, 2020 5:52 PM

@SpaceLifeForm
Of course, I would never recommend reusing passwords under any circumstances. What I was trying to express is that a password that is shared between two accounts within two leaked datasets is not necessarily owned by the same individual, which is problematic for the baddies that might seek to exploit leaked data using the techniques the article suggested.

For example, in less than 10 seconds she produced a dataset with more than 1,000 people who have high net worth, are married, have children, and also have a username or password on a cheating website.

The key phrase here is “username OR password”; the study is assuming that two people with the same password (ex. P@ssw0rd123 on both a Facebook account and an Ashley Madison is the same person, when in fact there are thousands of people who have used the same password. Also, imagine “planting” someone’s personal information or passwords on a bunch of embarrassing websites, one of which will inevitably get hacked and leaked in the future, thus “framing” an innocent person and creating a false impression of their character.

Of course, if a second piece of information besides the password is also correlated, such as a username or email, it becomes much more difficult to deny.

Humdee January 30, 2020 9:03 PM

@bruce writes, “And you can be sure that the world’s major intelligence organizations have already done all of this.”

If they hadn’t, they wouldn’t have been very good at their jobs. (Which isn’t to say that I agree with what their job is.)

Matt January 30, 2020 9:42 PM

@Northern Realist:

These were… undergraduate college students, as it says in the article. They’re not “academics.” I’m unclear why you thought it was necessary to heap scorn upon a couple of college students doing a relatively simple data analysis project. Maybe you didn’t actually read the article.

Chris January 30, 2020 10:21 PM

Re-use of passwords is awful but it’s all but encouraged by most e-commerce sites. There are lots of merchants with whom I transact business who require a username and password for what’s only going to be one-time transaction. Unfortunately, these merchants often sell items I can’t get anywhere else on-line like concert tickets. They’re also the most likely to be breached without me ever knowing about it.

Major Intelligence January 30, 2020 11:21 PM

@bruce “And you can be sure that the world’s major intelligence organizations have already done all of this.”

Goes without saying.

Tatütata January 31, 2020 2:36 AM

Of the 96,000 passwords contained in the dataset the students used, only 26,000 were unique.

Yet another rediscovery/confirmation of the Zipf inverse power law, e.g. here… I just checked, and the result of 26000 unique entries out of a set of 96000 seems just about right, I came up with ~30k for an exponent of -0.8.

On a related topic, I recently bought an entry level surveillance camera of recent manufacture (from the first entries suggested by a top E-commerce site), I gather late 2018/early 2019 from the various chip markings. It isn’t for surveillance, I have a couple of non-security applications, i.e. detect certain events that I cannot easily record otherwise and log them into a file.

I was disappointed to find out that the camera’s userid/PW were admin/admin, and that the video interface is based on wretched Adobe Flash!!!

In addition to banning default passwords, California could perhaps legislate that service providers must vet new passwords for entropy and duplication, but in a way that prevents this check as some sort of oracle. The oft-imposed requirement of a mixture of upper and lower case letters, digits, and non-alphanumeric characters is probably not quite nearly enough, as someone will eventually figure out ways to test “regular” patterns such as @Widget123…

James W January 31, 2020 5:08 AM

One thing most of these studies don’t take into account is how many of these were “important” accounts. When a random website asks you to create an account, the obvious tendency is to create something that is simple and throwaway.

It would add more teeth to the study if we could find out how many of these accounts had repeated logins over a longer period of time.

nycman January 31, 2020 11:03 AM

@Chris ” for what’s only going to be one-time transaction.”
If it’s a one-time transaction, why not set a long random password and forget about it. If you happen to return to that site, you can always do a password reset, which they usually do by sending you a link to your e-mail. I’ve heard of people who reset their password each time they log in to sites, using it as a form OTP. Forces them to reset their password each time.

me February 1, 2020 7:01 AM

One should also use something like pwgen with “longer and complicater” options for password recovery settings like “mother’s name”, “preferred pet” and so on, so that social graph data and personal attitude and other preferences don’t leak and the items are non-reusable from one site to another.

And why not use entropy like this for “name” and “address” etc. to begin with, unless it’s needed for real-world postal or verification purposes?

Dasha M. February 2, 2020 5:40 PM

Hi all. We’re the Harvard undergrads who worked on this project. To clarify a few of the points raised in the comments:

@Antistone, @SpaceLifeForm: you’re right about the lack of clarity here. Our program found, on average, 6.71 passwords per email address but only 1.82 unique (i.e. distinct) passwords per email address.

@ap: You write: “The key phrase here is “username OR password”; the study is assuming that two people with the same password (ex. P@ssw0rd123 on both a Facebook account and an Ashley Madison is the same person, when in fact there are thousands of people who have used the same password.”

You are correct that if two accounts on different websites have the same password, this does not necessarily mean that the same person owns both accounts. However, this is NOT what we assumed. We used emails, not passwords or usernames, to link people across datasets: if we saw the same email address was used for both Ashley Madison and Facebook, we would assume that the same person owned both accounts. We also assumed that any other leaked credentials (e.g. usernames or passwords) tied to that email belong to the same person. We understand that using email address as a unique identifier is potentially imperfect (for example, multiple family members could use the same email), but it was adequate for our purposes

@James W: Agreed that it would be interesting to analyze the results based on important/frequently-used accounts vs. throwaway/one-time-use accounts. Unfortunately, we don’t have access to data about the frequency of account log-ins for a given website. However, even without this particular information, in many cases (e.g. Ashley Madison, porn sites, etc), the fact that we found out that a person has an account at all can be potentially dangerous.

@Matt: Thank you 🙂

Slag February 3, 2020 11:30 AM

@NYCMan I’ve tried doing that, but the password confirmation step messes me up if I do a random string. Its a lot easier to use a leet “password” variant for all of the password required websites that I don’t care who accesses them and a real password for email/bank/etc

Thatguy February 4, 2020 3:51 PM

I literally did something like this 4-5 years ago as a college student. Im sure intelligence orgs have done this way before that. I wouldn’t worry about them having this information. I would instead worry about organized criminals, terrorist groups, corporations, data brokers, and data analytics groups (think cambridge analytica), having these profiles. Of course, we already know to a degree that this is happening. Wait until they start selling individual peoples complete profiles on the darkweb. What public knowledge so far is available to put together in a profile? emails, usernames, websites registered at, addresses, birth location, relatives, photos, phone locations, phone numbers, apps used on phone, type of cell phone android/iphone, credit history/score, medical records, state voting information/political leanings, criminal, or any other court public records, potentially credit card and SSN, etc. This information can be used against to target. Celebrities, billionaires, ceo’s, public officials, or by stalkers..etc Eventually, if government doesnt take strong action to force corporations to protect or ideally erase our data, this type of information will be for sale on the darknet to anyone, if its not already. And dont allow corporations to lobby/advise lawmakers to put the blame on anyone that might acquire this information AFTER it gets leaked. That is counter productive and will only increase the value and therefor create a larger demand. Its time to finally get tough on Corporations. Its not reasonable that they collect this amount of data for marketing. They might not have any ill intent to use it irresponsibly, however, they are not the only ones that eventually get access to it. What happens when the government loses its authority to corporations? Do we really want corporations/Facebook to have almost as good intelligence capabilities as the DoD? It sure would be easy to target lawmakers/judges/attorneys to bribe and or blackmail. Maybe this has/is already happening.

Thats my public service for the day.

Wesley Parish February 6, 2020 12:48 AM

As always, I am not amazed that this can be done, but that people think it is in any way extraordinary. CJ Date in the 5th edition of his An Introduction to Database Systems gave an example of how one could datamine a perfectly innocuous-appearing database to extract very specific personal information for a very specific individual; the 5th edition was published in the 1990s. And he was talking about a well-secured database with only a standard data query interface.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.