About the project

1. In brief

The project was developed in order to gather the information contained in PDF documents (metadata) which are available on the Internet.

2. What is metadata?

Metadata summarizes basic information about data / document properties. For the purpose of the project the following information was used:
  • Author
  • Date of modification
  • Date of creation
  • Page format (A4, A3)
  • Information about optimization
  • Number of pages
  • Producer of the software generating PDF
  • Size of document
  • Creator - a program generating PDF
  • Title
  • Version of PDF

3. Motivation and purpose of the report

The main motive was the author's curiosity about the number of available tools that generates or modifies PDF files.
The report will allow to get an answer, eg. the following questions:

  • Which producer has the largest share in the generation of PDF files,
  • What software is most often used to generate PDF files,
  • Which version of the PDF specification is the most common,
  • What is most common number of pages in PDF files.

With the access to the database various reports can be created based on available metadata.

4. Realization of the project

Stage I - Polish websites

Polish websites were scanned in order to search and analyze PDF files. This part is PoC - I can do IT
See a detailed description.

Stage II - Analysis of the dataset of the Common Crawl project

It is obvious that, with one server and link <100 Mbps (download) is not possible, in a short period of time, to scan the entire Internet and download all files.
Therefore, the author decided to use the Common Crawl project, which deals with the scanning of websites and provides data free of charge on AMAZON's servers.

An attempt was made to download and analyze the dataset CC 2014-10 which is 36,5TB in size (warc.gz files only) and contain 55 700 files approx. 685MB each.
It quickly turned out that warc.gz files contain approx. 80 PDF documents so the results would not be impressive. Nevertheless the decision was made to proceed
with the analysis and additionally to extract URLs to PDF files.
See a detailed description.

Stage III - Downloading and analysis of collected PDF links

The number of unique URLs collected during stage II is over 26 million. See a detailed description.

5. Summary

The result of the project are three databases (MySQL) containing metadata of unique PDF documents.

  • The database of polish websites - 787MB
  • The database of files that were found in the dataset of the Common Crawl project - 659MB
  • The database from URLs included in the dataset 2014-10 CC - ~10GB

Each database contains the following fields

  • Author - author of the document
  • Creator - PDF file generator
  • CreationDate - document creation date
  • Encrypted - information about file encryption
  • FileSize - document size in bytes
  • ModDate - modification date of the document
  • Pages - number of pages
  • PageSize - page format
  • PdfVersion - version of PDF
  • Producer - software producer
  • Optimized - information about optimization
  • Title - document title
  • Fhash - SHA1 checksum
  • Url - link to the file
  • Date - insertion date to the database

Detailed information can be found in the Statistics