Facebook Has No Idea What Data It Has

This is from a court deposition:

Facebook’s stonewalling has been revealing on its own, providing variations on the same theme: It has amassed so much data on so many billions of people and organized it so confusingly that full transparency is impossible on a technical level. In the March 2022 hearing, Zarashaw and Steven Elia, a software engineering manager, described Facebook as a data-processing apparatus so complex that it defies understanding from within. The hearing amounted to two high-ranking engineers at one of the most powerful and resource-flush engineering outfits in history describing their product as an unknowable machine.

The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.” He quickly added, “For what it’s worth, this is terrifying to me when I first joined as well.”

[…]

Facebook’s inability to comprehend its own functioning took the hearing up to the edge of the metaphysical. At one point, the court-appointed special master noted that the “Download Your Information” file provided to the suit’s plaintiffs must not have included everything the company had stored on those individuals because it appears to have no idea what it truly stores on anyone. Can it be that Facebook’s designated tool for comprehensively downloading your information might not actually download all your information? This, again, is outside the boundaries of knowledge.

“The solution to this is unfortunately exactly the work that was done to create the DYI file itself,” noted Zarashaw. “And the thing I struggle with here is in order to find gaps in what may not be in DYI file, you would by definition need to do even more work than was done to generate the DYI files in the first place.”

The systemic fogginess of Facebook’s data storage made answering even the most basic question futile. At another point, the special master asked how one could find out which systems actually contain user data that was created through machine inference.

“I don’t know,” answered Zarashaw. “It’s a rather difficult conundrum.”

I’m not surprised. These systems are so complex that no humans understand them anymore. That allows us to do things we couldn’t do otherwise, but it’s also a problem.

EDITED TO ADD: Another article.

Posted on September 8, 2022 at 10:14 AM23 Comments

Comments

Gunter Königsmann September 8, 2022 11:23 AM

I actually don’t believe them: everything a firm never has done before is both impossible and causes infinite costs.
The leaders of the firm even might believe that. But the next step is normally that someone actually tries.

iAPX September 8, 2022 11:45 AM

It is very interesting on a legal level, notably concerning GDPR, local laws derived from GDPR in Europa, and also laws that are derivative of this work in North America.

These answers, under oath, might have deep implications for Meta/Facebook…

Jim Grisham September 8, 2022 11:53 AM

Complexity isn’t an excuse if the culture is to work so fast and loose that proper documentation (even just a data flow diagram?) is never made in the first place.

They have the money to hire internal technical writers and librarians, there’s just been little profit motive to do so.

Tech companies subject to audits on behalf of clients from financial industry and
governments seem ~slightly~ better at this, at least.

The Artemis rockets are complicated, sure, but NASA engineers are expected to be able to trace out any system if required, and explain how it works (or failed). The same can be said for 1960s-era auto mechanics or 19-year-old electricians on a nuclear-powered aircraft carrier.

Tatütata September 8, 2022 12:10 PM

The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.”

Come on, this attitude is by no means limited to Fessbuck… I would say that it is typical of computing in general.

How many hours or weeks of my life have I wasted reading code to understand what it actually does, going ever deeper into the rabbit hole?

In an earlier job was doing user technical support and was handed over the holy grail: a description of the validation rules for the core transaction types. It actually was a copy of a copy of a copy of scribbled notes describing about a dozen conditional checks. I suspect the ur-original to have been recorded on the proverbial napkin…

The systems were rebuilt to move away from IBM-mainframe back-ends to distributed client-server architecture. There was a whole set of project-management documents that were never followed, in the end the only thing that seemed to matter seemed to have ANYTHING to show to upper “management” that kindasortof worked…

The business was best described by a wall poster showing the general flow of data between a myriad subsystems, each independently evolved and maintained. I experienced something similar in my first job at a bank decades ago. The lines on the chart described in reality an unmanageable rat’s nest of cables in the machine room.

Tatütata September 8, 2022 12:19 PM

The Artemis rockets are complicated, sure, but NASA engineers are expected to be able to trace out any system if required, and explain how it works (or failed).

Before you gloat, just wait until one actually launches without blowing up, in case a sensor returns information in metric when its Shuttle forerunner used English units, or something silly of the sort…

I remember an issue of the IEEE Transactions back in the 80s/90s on the FAA’s redesign of its en-route ATC system.

IIRC, the original System/360 based systems had very little usable documentation, as the code been been patched out of recognition over the years.

What they had to work with was the (Fortran?) common block definitions.

The result was quite over time, over budget, and under expectations, as usual.

Pavodog September 8, 2022 12:30 PM

No HUMAN can understand it. Fortunately FB is owned/run by a robot named Zuckerberg. Surely after years of machine learning He knows or can figure it out.

RogerBW September 8, 2022 1:05 PM

So Facebook is, by its own people’s admission, unable to comply with the law?

Solution seems simple enough.

Bill Smith September 8, 2022 1:28 PM

OK… I loves me some good ol’ FB hate like the next guy, but I gotta be honest, it’s not like I don’t have some systems out there from a long time ago that are undocumented, uncommented, and only I can understand it after I stare at the code for a very long time.

But then again, Facebook, so…

Richard H Schwartz September 8, 2022 1:36 PM

Poor documentation is a long-standing tradition in software development.

Agile methodologies that “don’t need specs” and managers who depend on dashboards to show measurable productivity for every single engineer at all times to make their projects look better to higher-ups, and the higher-ups that enable this, have perfected this tradition.

solver September 8, 2022 1:49 PM

Well, “easy” solution in two steps:
– refactor the code, so they know in the future where the data comes from.
– drop every bit of data that has been already collected

Clive Robinson September 8, 2022 3:59 PM

@ ALL,

I’ve worked at a number of places in my time, and in two out of quite a few so far, I’ve seen this kind of behaviour.

Looking back they had certain defining charecteristics one of which was they were led by either Venture Capital, or Marketing people who were personally very invested in “polishing the turd” to maximise share price.

I got out of both, fairly quickly, which is just as well because they both do not exist any more.

They both clearly got the “pump and dump” treatment sold at a vastly over inflated price, and unsuprisingly colapsed like an over inflatted Whoppie-Cushion fairly quickly after they were purchased. As they basically had lots of “apparent data” or “apparent potential” but in reality no “real substance” and thus were little more than “hot air, blowing into the wind”.

For those that have tried doing “Third line Support” on complex systems that even “the code does not document” and developers won’t talk, and much of “the source is unavailable” you have my sympathies. Likewise if you have to team lead etc in such a toxic environment. My advice is do a “Pump and Dump” your self out[1] and if you can “Grab the money and hit the ground running[2] as well.

@ Tatütata, ALL,

“Come on, this attitude is by no means limited to Fessbuck… I would say that it is typical of computing in general.”

Not “general”, one or two, or just maybe I’ve been lucky about the places I’ve worked at…

Speaking of “luck”, the funny coincidence is earlier today I posted this link,

https://www.makeartwithpython.com/blog/is-engineering-management-bullshit/

That you might enjoy…

I’ll be honest and say I had no idea our host was going to post this thread today, and as you will see from the time stamp (6:26 AM) my comment preceads this thread (10:14AM) by several hours,

https://www.schneier.com/blog/archives/2022/09/friday-squid-blogging-squid-images.html/#comment-409768

Strangely this is by no means the first time this type of coincidence has happened on this blog…

[1] That is start a “major organisational defining project” as the project lead, build it up every which way you can including “bullish aspirational documentation” promising it will deliver “the secret to eternal life” or some such. Then about one third of the way in “jump ship” using it to get a new job at another company. The important thing to remember is projects don’t realy deliver anything untill over half way through so you leave “smelling of roses”[2].

[2] Remember if any major project you leave eventually succeeds claim it as “your success” as you “laid the foundations” etc. If it fails as it probably will[3], you then blaim those left behind for failure to build on the good foundations you put in place etc etc. Either way you win, because by then it does not matter as you are “long gone”… If you study some peoples C.V.’s you will see they specialise in “jumping up” this way untill they too are in a position to get a major slice of share stock and sell it at a vastly over inflated price.

[3] The sad fact is most large / major projects “fail misserably”. Some claim more than 9 in 10 fail, the statistics are hard to put together, but there does appear to be an inverse relationship between project success and size, importance, or cost. Part of the reason it’s hard to judge, is what I call the “limp over the line factor” that is though the project has failed to meet initial promises/expectations even minimally, for various face saving reasons it’s dressed up as a success in some way…

SpaceLifeForm September 8, 2022 5:27 PM

@ JonKnowsNothing, Ted, Clive, ALL

The panic is setting in.

‘https://www.wsj.com/articles/facebook-parent-meta-platforms-cuts-responsible-innovation-team-11662658423

Clive Robinson September 8, 2022 8:51 PM

@ lurker, ALL,

Re : Syscall wall chart.

“Is it really more complicated than Windows IIS?”

The image is both incompleate and lacks sufficient detail to say.

But I’ve seen systems with more nodes and edges, but one heck of a lot more hierarchical structure.

The thing that should concern people is the myriad of apparently circular dependencies, they are never ever a good idea.

Petre Peter September 9, 2022 6:51 AM

Watch out! It’s the big machine in the sky that no one knows how it works. The machine is not understood; the machine is not known; the machine is used.

Tatütata September 9, 2022 9:11 AM

Clive,

I won’t elaborate on my experience, but I bailed out from that job as droves of “management” “consultants” (note the double scare quotes) descended upon the place like locusts. Alas, one of my best decisions ever, (the pay and the hours used to be good) even though I didn’t quite know it at the time. Instinct?

Clive Robinson September 9, 2022 10:16 AM

@ Tatütata, ALL,

Re : Instinct

There are three basic types the ones we call “Nature”(genetic) and “Nurture”(taught) are the two most commonly identified the third is of much more interest and is in part “environment”, but not in the way most think.

I guess all though obvious it needs stating that someone has to be first with any idea. But why any given individual?

Most will say,

“Nature or Nurture”

but that’s at best incompleate and demonstratable as such (take two twins that have the same upbringing and they both will think differently, as numerous studies have shown).

We also say,

“Ideas come of age”

But again this is at best incompleate for similar reasons to the nature and nurture argument and almost bordering on trite.

So we also have the saying,

“A time and a place” or “Right time right place” etc.

Again this is just missing the point.

Our host @Bruce has in the past called it “Thinking Hinky” it’s a form of empathy where you can feel how an attacker might behave by,

“Putting yourself in their shoes”

But again it is insufficient as an argument, there is something qualtively different, some form of instinctual pattern recognition on a very broad and hazy data set. As some have said,

“Gut feeling” or “It resonated with me”

Interestingly, it’s now kind of accepted even humans in effect have two brains in a semi “master slave” relationship. That is the one in your head is “outwards facing” by your five senses, the other is “inwards facing” and deals with the organs that support the body and is attached to your “guts”.

The two communicate in quite complex ways both neurologicaly and chemically and both are two way at the very least.

You might have heard the saying,

“I do my best thinking on a full stomach”

or less said but more common, the opposite of needing to eat etc.

It’s known that “hunger” can significantly highten the senses, and it’s argued that this is a survival feature, focussing your abilities to catch or find your next meal, especially in an unforgiving or hostile environment.

Two things are sure,

1, We know next to nothing about how our brains work.
2, We know even less about how it functions with the rest of the body.

Thus having empathy for your prey and the way it behaves may well encorage subconcious pattern matching.

Thus original ideas may be down to a better developed,

“Hunters instinct”

Ollie Jones September 12, 2022 9:19 AM

The documents refer to Facebook’s “Data Lake”. A better name for it might be “The Great Dismal Data Swamp”.

CDog September 14, 2022 5:56 PM

Good article. It is alarming that Facebook doesn’t seem to understand the data machine that it has created. This sounds like a “Frankenstein” scenario.

foo November 25, 2022 1:24 PM

“And the thing I struggle with here is in order to find gaps in what may not be in DYI file, you would by definition need to do even more work than was done to generate the DYI files in the first place,” said “one of the most powerful and resource-flush engineering outfits in history.”

Bullshit. That’s some laziness there, and possibly some pretended laziness, rewarded by and for the benefit of profit-making.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.