Troy Hunt: Fixing Data Breaches Part 2: Data Ownership & Minimisation

Yesterday, I wrote the first part of this 5-part series on fixing data breaches and I focused on education. It's the absolute best bang for your buck by a massive margin and it pays off over and over again across many years and many projects. Best of all, it's about prevention rather than cure.

The next few parts of this series all focus on cures - how do we fix data breaches once bad code has already been written or bad server configurations deployed? In part 2 of the series, I want to talk about data ownership and minimisation and this is all about reducing the impact on individuals and organisations alike when things do go wrong.

Who Owns Our Personal Data?

The question in the title sounds simple but it goes a lot deeper than many think. It's also geographically specific in that there are different legal definitions in different places and indeed different social expectations due to cultural differences too. I want to focus on what I believe the answer should be rather than what the law permits; after all, we're addressing a global problem here that transcends legal boundaries.

But firstly, let's put that question in context: you sign up to a cat forum because you want to discuss cats with other feline aficionados. The site asks you for some personal information when you create the account which it then stores in a database. Who now owns that data? The cat site? Or you? This is an important question because it drives the way organisations then treat that data. But it's also somewhat philosophical so let's translate it into practical terms.

I'm going to refer a lot to the upcoming European General Data Protection Regulation (GDPR) that will hit Europe in May 2018 because protecting personal data is a cornerstone of the legislation. If you're not already familiar with GDPR, it's definitely worth having at least a high-level understanding of it because it affects companies operating outside of the EU too. I wrote a free course for Varonis on GDPR earlier this year and I'm going to be referring to points from there quite a bit in this blog post.

I want to call out a slide from that course that defines personal data because it's important to understand what we're talking about here:

Defining Personal Data

Of course, personal data goes much further than this but that gives you an idea of what we're talking about protecting better here. (Incidentally, do note the inclusion of things like mobile device ID, IoT data and trade union memberships; many people wouldn't normally think of these as "personal data", but once you consider the potential for abuse, it makes a lot more sense.) Let me bullet some key points here in terms of how GDPR views personal data:

It's owned by the individual, not the organisation holding it
The owner can request the organisation provide them with a copy of their data ("right of access")
The owner can request the organisation delete their data ("right to erasure")
The owner must consent to the data being collected and that consent must be:
1. Freely given
2. Specific
3. Informed
4. Unambiguous

Edit: Quick thanks to Simon Fitzgerald for his comment and the reference to the ICO piece on consent which points out that "consent is one way to comply with the GDPR, but it's not the only way".

I won't go deep into all that here because I explain it in the aforementioned course, but let me sum it up in an easily consumable way:

When you give your data to a cat forum, you still own it. You can ask the cat forum to give you a copy of it or to permanently delete it (they need to do this without charge). The cat forum can also only use your data in precisely the way you expected them to when you gave it to them.

When you read this explained in this fashion, it's hard not to nod your head in agreement because it just sounds fundamentally obvious, doesn't it? But let's juxtapose that with what regularly happens when you provide your data to a website:

When you give your data to a cat forum, they now own it. They may have lengthy terms and conditions that give them this right; you didn't read them because nobody ever does, but you agreed to them. They may not respond to your emails requesting access to your data and they may not have a process to delete it. Because they believe they own it, they may also share it with a dog forum they've partnered with or sell it to advertisers thus spreading the footprint of your data.

This is precisely the attitude we need to address so let's move onto tackling that and changing the way we think of personal data.

Use of Personal Data Should be Transparent and Easy to Understand

Back in September, a number of people pointed me at Experian's "FREE Dark Web Email Scan" (capitalisation is theirs, not mine) because on the surface of it, it seemed similar to my Have I Been Pwned (HIBP) service. Here's what it looks like:

And it does look similar to HIBP - enter your email address and go! But where things differ is in the highlighted areas, that is the 3 things you must read and understand before doing the search. I thought I'd check them out myself with my original plan being to read them and better understand what they were doing with people's data, until I discovered this:

21,498 words

You need to absorb 21,498 words spread over 42 pages worth of Microsoft Word formatted document before using this service. That is absolutely ridiculous! Experian knows damn well that nobody is going to do that and it's merely a legal arse-covering exercise: "But Joe, you agreed to us selling your data before you did the search, why are you so upset that other companies now have your personal info?"

I don't know what's in those 42 pages. I don't know if they're selling your data, storing your data or demanding custody of your first born child (do check out the way F-Secure used a herod clause to point out the futility of lengthy terms and conditions in that link).

There must be a reasonable expectation that people can read and understand what you intend to do with their personal data before they give it to you.

Seem fair? There's parts of GDPR that legislate this and that's a good thing. I'd like to see that thinking extend to other countries (and indeed the extraterritoriality provisions within GDPR help move us in that direction) and indeed I'd also like to see companies doing this simply because it's the right thing to do!

Data Collection Should be Minimised, Not Maximisation

Let's go back to the cat forum scenario again. In fact, let's take a look specifically at catforum.com and some of the information they request on the profile page:

Cat Forum DOB

That page also goes on to request information on where you live, your biography, interests, occupation and mobile phone number. Yes, they're all optional, but by virtue of placing those fields on the page, people will still fill them out. There is absolutely, positively no good reason why this information should be provided in order to talk about cats. I mean this is a forum with discussions such as if you can buy shoulder pads for your cat to ride on, building a condo for your cat and whether you'll get sick eating out of the same bowl as your cat.

Here's the problem and I'm going to quote directly from my written testimony sent in to Congress (hat tip to James for his suggestion on this):

Organisations view data on their customers as an asset, yet fail to recognise that it may also become a liability

I just checked HIBP and the following sites all collected DOB before having it exposed to unauthorised parties:

Acne.org, Adult Friend Finder, AhaShare.com, ai.type, Android Forums, Ashley Madison, Badoo, Beautiful People, Bitcoin Talk, Black Hat World, Boxee, Cannabis.com, ClixSense, COMELEC (Philippines Voters), Data Enrichment Records, diet.com, DLH.net, Dungeons & Dragons Online, eThekwini Municipality, Evermotion, Experian, Exposed VINs, Fling, Foxy Bingo, Funimation, gPotato, GTAGaming, hackforums.net, Health Now Networks, hemmelig.com, HongFire, InterPals, iPmart, JobStreet, Justdate.com, KM.RU, Linux Mint, Little Monsters, Lookbook, Lord of the Rings Online, Malwarebytes, Master Deeds, Mate1.com, Minefield, Modern Business Solutions, Money Bookers, MrExcel, Naughty America, Neopets, Neteller, Netshoes, Nival, Nulled, Paddy Power, PHP Freaks, Qatar National Bank, QuinStreet, SC Daily Phone Spam List, ServerPact, Sony, Soundwave, Special K Data Feed Spam List, StarNet, Ster-Kinekor, The Candid Board, Torrent Invites, Trillian, vBulletin, Victory Phones, VTech, WildStar, Wishbone, Спрашивай.ру

These sites include everything from a cannabis forum to a virtual keyboard to gaming sites. Just how essential was the subscriber's date of birth on these services? There are some valid use cases for knowing someone's approximate age (such as dating websites), but clearly, it's a totally pointless attribute for the vast majority of sites in this list. (And no, don't say "because COPPA needs it", you can establish whether someone is 13 or older without saving and storing their DOB for perpetuity.)

Let's turn this around and start looking at positive behaviours in terms of the way other people's data is handled. Take the sign-up form on HIBP as an example:

HIBP Signup Form

And for good measure, I'll add the sign-up form for Report URI too:

Report URI Signup Form

These are spartan. They ask nothing of the user beyond what is required to do their job. HIBP only needs an email address because that's all I'm looking for when someone appears in a data breach. Report URI needs a password as well because you need to be able to login. That is all. We don't even collect a name on either of those services because what good would it do? We could personalise some emails a bit more? It's pointless data.

But many organisations don't think of it that way - pointless data - they think of it as additional data points they can collect. To my previous point, these organisations are looking at our data as an asset and the more they have of it, the more valuable it is. Now put that situation in the context of this sage advice:

You cannot lose what you do not have.

How I handle HIBP is a perfect example of this: this is a data breach aggregation service and within that source data is billions of passwords, dates of birth and almost every other conceivable piece of personal data. However, the only personal data the online system only holds is email addresses. Despite the terabytes of data in the original breaches, if HIBP itself gets pwned (and that's always a possibility regardless of how much care I take), the only thing that would be exposed is those addresses. I don't want an incident to occur and I put an enormous amount of effort into ensuring it doesn't happen, but by practicing data minimisation I ensure that if it does happen, the impact is significantly less than what it would otherwise be.

Wherever Possible, Data Should Expire and be Deleted. Permanently.

I want to start by quoting myself again from that Congressional testimony:

Further compounding the data maximisation problem is the fact that the retention period of the data usually extends well beyond the period in which the service is used by the owners of the data. For example, signing up to an online forum merely to comment on a post means the subscriber's personal data will usually prevail for the life of the service. There are many precedents of data breaches occurring on sites where those who've had their personal data exposed haven't used the service for many years.

Take the cat forum above - once I've worked out that I shouldn't eat from the same bowl as my cat, does the site need to retain my date of birth and phone number for perpetuity? Of course, it sounds preposterous when I say it this way, but that's how these things work. It is easier to collect and retain than it is to purge. There are many reasons for this: storage is cheap, purging takes extra work and as already established, other people's data is valuable. Plus, of course, it's not always black and white when that purging should happen; if someone doesn't comment on cats again for a year should their data be removed? 3 years? 5?

It's clearer cut when services have a more finite period of operation. A competition runs for a period of time then it's over. A survey collects data which is then collated and reported on. Data retention beyond this point doesn't pose a value to the owner of the data and ideally, should be purged. But, of course, it's difficult convincing organisations of this because to go back to my previous comment, they still see it as an asset of theirs rather than posing a liability to the rightful owners of it. As GDPR is doing, this is an area where more legislation is required simply because we can't trust corporations to act in our best interests when it's detrimental to their own.

Data Aggregators Need Stronger Regulation

I hate to say we need more legal processes to comply with in an era where we frankly feel overburdened by them, but this is an area which is totally out of control. Let's take the recent South Africa situation as an example where the entire country had their data leaked! In this case, the data was exposed by a real estate company, an organisation that 99.x% of people in that data breach had never had any interaction with and wouldn't have even heard of them before this. Yet here they were with 66 million South African identities, all sitting there in a database backup facing the world possibly for as long as 2 and a half years. They bought that data from a "data enrichment" company, that is an organisation that makes money out of collecting and selling other people's data. Without consent. And it's legal.

We saw a somewhat similar situation in the US earlier this year with the Dun and Bradstreet NetProspex data. They describe their (again, perfectly legal) service as follows:

Optimize your demand generation engine by targeting the prospects who truly matter

And as I said in that original post, just in case there was any doubt as to what they're doing with our data, here's how D&B describe their service:

We help marketers develop and manage their B2B data. Our multi-faceted data quality processes — backed by the world's largest commercial database and seamless integration into your marketing systems — enables you to identify the best opportunities, build stronger relationships and accelerate growth for your company.

You know that old saying about people being the product? This is the manifestation of that. This is a publicly listed company making billions of dollars yet their treatment of personal data is indistinguishable from garden variety spammers. I wrote about one of these last year and explained how our data is collected and commoditised via free online services. In the comments of that post, the individual responsible for the data wrote this:

I've been in biz for just over 10 years... Paying taxes, and servicing small, medium-sized, and large size companies. I don't feel I have to defend anything (other than other similar sized companies as mine just trying to float by amidst an economic recession, and all the problems that come with it.) The value of what my company offers, the savings passed along, and the years in business speak for themselves, and louder than every competitor business on here that HATE US (ME) cuz they AIN'T US.

When you boil it down to brass tacks, this is the same service that D&B is selling albeit provided by mister "They hate us cuz they ANUS!" (yes, his words verbatim). Other people's data collected without their knowledge and sold to other companies the data owners had never heard of.

And then there's Equifax; a legally operating data aggregator providing credit report services and the target of one of the most significant data breaches of all time. And again, it's data collected without consent albeit entirely legally. The thing is though, we need credit report services like these because like it or not, they do provide a very important service. I wouldn't go so far as to propose that we should no longer have them, but clearly the risk they've now exposed 145.5 million US consumers to (and a bunch from other countries too), is very serious.

And this is where more regulation is required. Services offering a legitimate value to society (and despite Equifax's woes, this is what they do) need to be able to operate, albeit held to higher account than they presently are. But equally, services which serve no value other than to "optimise demand generation engines by targeting prospects" and do so by collecting data either without consent or by implying consent by burying it in lengthy agreements, frankly, need to be shut down.

Summary

This whole post is about giving control of data back to the rightful owners and minimising the impact on them when a breach occurs. This is equal parts a fundamentally simple objective to achieve and one that is enormously difficult. It's simple not to request that someone provides their date of birth to a cat forum; neither the site nor the user themselves lose anything by not collecting this data. Yet it remains a difficult objective because not only do so many services continue to view our data as an asset, they never expect to be the victim of a data breach which then turns that data into a liability.

Individual organisations can address this independently by being more responsible in terms of what data they request and indeed how long they keep it for. But we need more than that - we need legislation - and that's a much harder ask. Hopefully, GDPR will begin the move the needle in the right direction and gradually, organisations who are unwilling to do the right thing of their own accord will be forced to do it by the one thing they all understand - the impact on their bottom line.

Fixing Data Breaches Security

Fixing Data Breaches Part 2: Data Ownership & Minimisation