PDFinfo.NET | Portable Document Format(PDF)

1. In brief

The project was developed in order to gather the information contained in PDF documents (metadata) which are available on the Internet.

2. What is metadata?

Metadata summarizes basic information about data / document properties. For the purpose of the project the following information was used:

Author
Date of modification
Date of creation
Page format (A4, A3)
Information about optimization
Number of pages
Producer of the software generating PDF
Size of document
Creator - a program generating PDF
Title
Version of PDF

3. Motivation and purpose of the report

The main motive was the author's curiosity about the number of available tools that generates or modifies PDF files.
The report will allow to get an answer, eg. the following questions:

Which producer has the largest share in the generation of PDF files,
What software is most often used to generate PDF files,
Which version of the PDF specification is the most common,
What is most common number of pages in PDF files.

With the access to the database various reports can be created based on available metadata.

4. Realization of the project

Stage I - Polish websites

Polish websites were scanned in order to search and analyze PDF files. This part is PoC - I can do IT
See a detailed description.

Stage II - Analysis of the dataset of the Common Crawl project

It is obvious that, with one server and link <100 Mbps (download) is not possible, in a short period of time, to scan the entire Internet and download all files.
Therefore, the author decided to use the Common Crawl project, which deals with the scanning of websites and provides data free of charge on AMAZON's servers.

An attempt was made to download and analyze the dataset CC 2014-10 which is 36,5TB in size (warc.gz files only) and contain 55 700 files approx. 685MB each.
It quickly turned out that warc.gz files contain approx. 80 PDF documents so the results would not be impressive. Nevertheless the decision was made to proceed
with the analysis and additionally to extract URLs to PDF files.
See a detailed description.

Stage III - Downloading and analysis of collected PDF links

The number of unique URLs collected during stage II is over 26 million. See a detailed description.

5. Summary

The result of the project are three databases (MySQL) containing metadata of unique PDF documents.

The database of polish websites - 787MB
The database of files that were found in the dataset of the Common Crawl project - 659MB
The database from URLs included in the dataset 2014-10 CC - ~10GB

Each database contains the following fields

Author - author of the document
Creator - PDF file generator
CreationDate - document creation date
Encrypted - information about file encryption
FileSize - document size in bytes
ModDate - modification date of the document
Pages - number of pages
PageSize - page format
PdfVersion - version of PDF
Producer - software producer
Optimized - information about optimization
Title - document title
Fhash - SHA1 checksum
Url - link to the file
Date - insertion date to the database

Detailed information can be found in the Statistics

About the project

1. In brief

2. What is metadata?

3. Motivation and purpose of the report

4. Realization of the project

5. Summary