Pdf-parser
Description
pdf-parser is a python-based script written by Didier Stevens, that parses a PDF document to identify the fundamental elements used in the analyzed file.
Installation
$ cd /data/src/ $ wget http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip $ unzip pdf-parser_V0_4_3.zip $ chmod +x pdf-parser.py
Usage
Syntax
Usage: pdf-parser.py [options] pdf-file
Options
- --version
- show program's version number and exit
- -h, --help
- show this help message and exit
- -s SEARCH, --search=SEARCH
- string to search in indirect objects (except streams)
- -f, --filter
- pass stream object through filters (FlateDecode, ASCIIHexDecode, ASCII85Decode, LZWDecode and RunLengthDecode only)
- -o OBJECT, --object=OBJECT
- id of indirect object to select (version independent)
- -r REFERENCE, --reference=REFERENCE
- id of indirect object being referenced (version independent)
- -e ELEMENTS, --elements=ELEMENTS
- type of elements to select (cxtsi)
- -w, --raw
- raw output for data and filters
- -a, --stats
- display stats for pdf document
- -t TYPE, --type=TYPE
- type of indirect object to select
- -v, --verbose
- display malformed PDF elements
- -x EXTRACT, --extract=EXTRACT
- filename to extract to
- -H, --hash
- display hash of objects
- -n, --nocanonicalizedoutput
- do not canonicalize the output
- -d DUMP, --dump=DUMP
- filename to dump stream content to
- -D, --debug
- display debug info
- -c, --content
- display the content for objects without streams or with streams without filters
- --searchstream=SEARCHSTREAM
- string to search in streams
- --unfiltered
- search in unfiltered streams
- --casesensitive
- case sensitive search in streams
- --regex
- use regex to search in streams
Example
Confirm presence of Javascript
With pdfid, we have been able to detect the presence of Javascript in the PDF file.
Highlight links between objects
Using pdf-parser
Let's use pdf-parser to dig more about this PDF file.
$ ./pdf-parser.py --search=javascript jsunpack-n-read-only/samples/pdf-thisCreator.file obj 3 0 Type: Referencing: 5 0 R << /JavaScript 5 0 R >> obj 6 0 Type: Referencing: 111611 0 R << /JS 111611 0 R /S /JavaScript >>
The above command shows the links between objects 3 and 5 on one hand and 6 and 111611 on the other hand. Let's see whether object 5 is linked with other objects:
$ ./pdf-parser.py --object=5 jsunpack-n-read-only/samples/pdf-thisCreator.file obj 5 0 Type: Referencing: 6 0 R << /Names [(A)6 0 R ] >>
Object 5 is linked to object 6 and we now have the complete map:
Using pdfobjflow
Using pdfobjflow offers a quicker way of having the map:
$ ./pdf-parser.py /data/tools/jsunpack-n-read-only/samples/pdf-thisCreator.file | ./pdfobjflow.py $ eog pdfobjflow.png
Here is the map:
Decompress javascript
Now, let's decompress the javascript contained in object 111611 with the --filter and --raw options:
$ ./pdf-parser.py --object=111611 --filter --raw jsunpack-n-read-only/samples/pdf-thisCreator.file > out.js $ cat out.js obj 111611 0 Type: Referencing: Contains stream << /Filter /FlateDecode /Length 142 >> /*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/var b/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/=/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/ this.creator;/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/var a/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/=/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/ unescape(/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/b/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/);/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/eval( /*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/unescape(/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/this.creator.replace(/z/igm,'%') /*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/)/*fjudfs4FSf4ZX <POFRNFSdfnjrfnc> SaKsonifbdh*/);
The above command reveals an obfuscated JavaScript code. Piping the output to a few commands helps decoding it:
$ tail -n +11 out.js | js_beautify - | grep -v "^\/\*" | indent var b = this.creator; var a = unescape (b); eval (unescape (this.creator.replace (/z / igm, '%')));