Metadata Left in Security Agency PDFs

Really interesting research:

“Exploitation and Sanitization of Hidden Data in PDF Files”

Abstract: Organizations publish and share more and more electronic documents like PDF files. Unfortunately, most organizations are unaware that these documents can compromise sensitive information like authors names, details on the information system and architecture. All these information can be exploited easily by attackers to footprint and later attack an organization. In this paper, we analyze hidden data found in the PDF files published by an organization. We gathered a corpus of 39664 PDF files published by 75 security agencies from 47 countries. We have been able to measure the quality and quantity of information exposed in these PDF files. It can be effectively used to find weak links in an organization: employees who are running outdated software. We have also measured the adoption of PDF files sanitization by security agencies. We identified only 7 security agencies which sanitize few of their PDF files before publishing. Unfortunately, we were still able to find sensitive information within 65% of these sanitized PDF files. Some agencies are using weak sanitization techniques: it requires to remove all the hidden sensitive information from the file and not just to remove the data at the surface. Security agencies need to change their sanitization methods.

Short summary: no one is doing great.

Tags: academic papers, metadata, security analysis

Posted on March 12, 2021 at 6:03 AM • 22 Comments

Comments

Clive Robinson • March 12, 2021 7:51 AM

@ Bruce, ALL,

Short summary: no one is doing great.

Well maybe,

“Paper, Paper, NEVER Data”

Should be the way to go. Print the doc out manually redact it, then scan the result back in as an image file, and then encapsulate that in a PDF.

That way the only meta-data is from the final stage publishing process, which can be done by unclasified processes and staff in an external commercial staff agency.

a'anon • March 12, 2021 7:57 AM

I use ExifTool to remove metadata from PDF files (and pictures) before distributing them; i.e., posting to the ‘net, sending via e-mail, etc..

It produces a warning that data may be recovered.

$ exiftool -all= *.pdf
Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! – filename.pdf
1 image files updated
$

I do not get this warning with .jpg and .png files.

Are there any recommendations for a better alternative?

David Rudling • March 12, 2021 7:59 AM

This is somewhat surprising. It isn’t as if there are no tools publicly available to do the job, eg:-

ht tps://kbpdfstudio.qoppa.com/sanitize-pdf-tool-to-remove-sensitive-data/

although I don’t know how good a job it does.

Security Agencies, if no one else, could and should have access to tools certified to their Agency standards.

CdrJameson • March 12, 2021 10:06 AM

So, copy and paste it into notepad then make a new PDF from that?

Peter • March 12, 2021 11:06 AM

@Clive Robinson,CdrJameson

If you did that, the article would still rate you as failing. Re-scanning the pdf or generating the pdf on a non-secure computer would still leave meta-data.

They have no way of knowing whether the metadata is describing a secure box with access to important documents or some junk machine whose only purpose is to generate pdfs.

lurker • March 12, 2021 12:19 PM

@CdrJameson, All
How much metadata does Notepad remove from Word files? It’s so long ago I can’t remember when I first started reading the internals of Word files to get more information than the face value.

Since the PDF format is published, any app can create .pdf files by any method, and include as much or as little of its own metadata as it chooses, as well as adding pdf relevant metadata. From the paper:

The issue is that popular PDF producing tools are keeping metata by default with many other data while creating a PDF file. They provide no option for sanitization or it can only be achieved by following a complex procedure. Software producing PDF files need to enforce sanitization by default. The user should be able to add metadata only as an option.

and

We observed that if exiftool is used to sanitize a PDF file, it is still possible to recover all the metadata.

bold in original

The authors mention the use of grep to locate juicy bits, but it’s often useful to look at the context around each nugget of info.

my2cents • March 12, 2021 1:09 PM

Why waste paper ? Just make screenshots you put in a newly, securly created .pdf.

MK • March 12, 2021 1:30 PM

There are meta-steps in the redaction process. Otherwise one can analyze the blank or redacted areas to reconstruct text. Even printing, blacking and refrying to PDF is not sufficient:

ht tps://www.octantus.associates/redacting-documents-correctly-a-failure-case-study/

Kurt Seifried • March 12, 2021 2:15 PM

This is why I just take a screenshot of the page if I’m feeling paranoid, you’d get metadata but it’s now just time/my workstation, there’s no undo button/worrying about if I flattened all the levels, etc.

I’m always shocked that places don’t “virtual” scan (e.g. display and take an automated screenshot and repack them all into a PDF) in order to ensure what is displayed is all that gets released. It’s a scripted activity on any modern system.

Wand'anon • March 12, 2021 3:29 PM

> “Ana Above is the wicked witch from the north yet again under another handle…”

Who’s been messing up everything?
It’s been Ana all along!

Who’s been pulling every evil string?
It’s been Ana all along!

She’s insidious
(Ha-ha!)

So perfidious
That you haven’t even noticed

And the pity is
(The pity is, pity pity pity pity)

It’s too late to fix anything
Now that everything has gone wrong

Thanks to Ana
(Ha!)

Naughty Ana!
It’s been Ana all along!

SocraticGadfly • March 12, 2021 4:05 PM

Er, what about a screengrab as a PNG? Low resolution, true, but certainly avoids PDF issues.

Clive Robinson • March 12, 2021 4:12 PM

@ MK,

There are meta-steps in the redaction process.

I,gave the “octantus” article you provided a quick scan.

I did not see it mention the “Hard hyphen” issue…

In many documents you have made in standard WordProc software the “word wrap” only works with words upto some character length before it “soft hyphenates” longer words.

The person typing a document can put in their own “hard hyphens” if they do not like how the document works.

The problem is “marmalade” stay “marmalade” with soft hyphens so gets picked up in word searches. However with hard hyphens it can become two words such as “marm-” and “alade” which is actually what the user typed in…

I found this out when looking for “Digital watermarks” in a document for someone… And low and behold a *nix “pipe” filter I had writen some years ago found after a cut and paste, short nonsense words ended in hyphens and other short nonsense words that looked like common word endings a hard hyphenated “Skiing” whilst “Sk-” might be the result of somebody typing in a bit of a maths equation “iing” did not make sense on it’s own.

Another anoyance is the multiplicity of apostrophes in some charecter sets they might look the same on a printed page to a casual glance but they are not the same charecter

@ ALL

So if you are cut-n-pasting use an itermediate that only has “7-bit ASCII” as opposed to one, four or more bytes per character…

Some people use such things to not just watermark a document but put a unique serial number in a printed page… Something Whistleblowers need to be aware of.

MarkH • March 12, 2021 5:13 PM

@MK, Clive:

Nothing I saw in the octantus article suggests that blackout-print-scan is not a fully effective technique for the redaction of selected content.

What it does warn about is that if the process of selecting content to be redacted is lazy or sloppy, you’ll likely let sensitive content go through. That problem is not primarily a technical one.

Where security is critical, assurance levels must be kept very high.

Apokrif • March 12, 2021 9:06 PM

@my2cents: perhaps one is never sure that some user name, file path (which might include sensitive info), or information about the computer environement, cannot appear in the resulting file? Intellipedia pages (on theblackvault.com) released by the Department of Defense seemed to have been printed on paper.

my2cents • March 13, 2021 12:46 AM

@Apokrif: the same applies to scans. It always depends on how secure your environment (OS, software, hidden cameras in you living room (;-)), …) is.

David Leppik • March 13, 2021 12:11 PM

If you print and scan, you’re just replacing old metadata with new metadata from your scanning app, which might be just as bad. What you really want is a tool specifically designed to remove metadata.

lurker • March 13, 2021 1:06 PM

@David Leppik
What I want is tools that don’t embed metadata in the first place. Unfortunately once filesystems start requiring filename, creation/modification date and type, then you’re off down the slippery slope with extra pockets for fob watch, cigar cutter, &c.

JonKnowsNothing • March 13, 2021 3:01 PM

@lurker @David @All

re: tools that don’t embed metadata in the first place.

Don’t forget about Undo-Redo…

Long ago those audit trails were fingered as security issues.

Along with “Group Edit / Approval” features that generate even bigger audit breadcrumbs.

Clive Robinson • March 13, 2021 4:34 PM

@ David Leppik, ALL,

If you print and scan, you’re just replacing old metadata with new metadata from your scanning app

Not exactly. You are if you do it right, replacing the old “confidential metadata” with the new “unclasified metadata”.

Part of this is to use a scaned image format from last century. Back then many image formats did not contain meta-data. A PDF file will always contain considerable meta-data it is “built in” to the specification that Adobe appear to augment as fast as possible, so you will not be able to keep up with “what to strip out or make neutral”.

Personally I would not use PDF but simplified hand built Office Open XML document[1] that Microsoft Word since 2007 calls a .DOCX file.

Or even simpler to create, just create a folder with a main.html that has “./” links to the other files in the folder use touch to correct the dates and other file meta-data and build a zip archive.

If the recipient claims to be a technical numb-nuts just write a simple shell script to extract the file(s).

These ways you can control the meta-data going into the file. Building a shell script or simple C program to do the grunt work is not difficult. And if in a standard format like DOCX passes the “electronic discovery” nonsense, which is usually not about the document contents but getting their hands on meta-data, which you most definately want to stop as hard and as brutally as you can.

[1] OOXML is standardized by both the ISO and IEC (as ISO/IEC 29500 in later versions). Originally developed by Microsoft at the insistance of the US Government. Basically it’s a zipped XML-based file format. Where the zip is a standard zip archive and the XML a manifest and build of the container contents. You can download and read the whole four part specification but at just over 7000 pages for the EMCA version you might want to settle for just Part 2 or one of a number of online guides telling you how to build compatable files. A simplified one that will give you a feeling for what’s involved is,

https://docs.fileformat.com/word-processing/docx/

To experiment use word to create a .DOCX file with just a couple of gifs in it use zip to open it as an archive to view the contents. This will give you further info to fairly simply reverse engineer things to your own “controled meta-data” requirments.

SpaceLifeForm • March 13, 2021 5:53 PM

@ Clive, David Leppik

Not exactly. You are if you do it right, replacing the old “confidential metadata” with the new “unclasified metadata”.

I doubt the yellow dots are “unclassified”.

Ollie Jones • March 14, 2021 9:08 AM

This kind of information leak damages all kinds of users, not just security agencies.

Let’s hope the developers / publishers of PDF software and PDF print drivers take this article as a call to action.

How about this? A checkbox on print-to-PDF dialogs saying

[ ] include document metadata (this may be insecure)

along with a popup warning when a user checks it

People may be able to determine your identity or
other information from metadata in your PDF files.

Do you really want to include document metadata?

Don’t ask me again [ ] Yes No

This would probably handle the most common use cases. And, it would raise awareness of the metadata problem among overworked and underpaid knowledge workers.

Furthermore, Adobe should open-source and publish a freely available utility program to do what Acrobat does when one of their customers sanitizes a PDF. (Acrobat does a decent job of sanitizing when you remember to use that feature.)

And maybe the popular content management systems (Google docs, WordPress, etc) should gain code to check for the presence of metadata in uploaded documents.

JonKnowsNothing • March 14, 2021 10:25 AM

@Ollie Jones @All

re: How about this? A checkbox on print-to-PDF dialogs…

Check box is nice but how do you know what the publishers will really do with the excised MetaData?

Some one knowledgeable might be able to verify that the metadata didn’t leave your PC/System and that there are no hidden codes to tag your PDF as a “suspicious document” as it makes it’s way across the digital highway.

In the USA we have LEAs that have the authority to demand companies do their bidding, called National Security Letters (NSL). With a NSL demand, the publisher can flag the extracted metadata and send it via any of the many telemetry feeds to a “safe innocuous collector site” that in turn feeds into other sites. Once any point on the telemetry feed has been hacked the bad sites will get the data too.

Some LEAs define “relevant” to mean “all” and have the ability to store all the data they want, forever and ever and ever.

This is the polite method. Other countries do not have to go down the polite path.

Recently, Firefox decided to “update” their printing panel. The old panel worked fine. Afaik they added no new functionality, just rearranged the layout and updated the UI to have that Box Look that takes up so much space they needed Spandex-V drop down boxes resulting in about 10 extra clicks to access all the primary print options.

There are more print options than most people ever look at after they set things up. If the tick box is set way down into the depths of the 2-3-4th layers of Spandex-V options will anyone find it much less use it?

Schneier on Security

Metadata Left in Security Agency PDFs

Comments

Leave a comment Cancel reply