The main motive was the author's curiosity about the number of available tools that generates
or modifies PDF files.
The report will allow to get an answer, eg. the following questions:
With the access to the database various reports can be created based on available metadata.
Stage I - Polish websites
Polish websites were scanned in order to search and analyze PDF files. This part is PoC - I can do IT
See a detailed description.
Stage II - Analysis of the dataset of the Common Crawl project
It is obvious that, with one server and link <100 Mbps (download) is not possible,
in a short period of time, to scan the entire Internet and download all files.
Therefore, the author decided to use the Common Crawl project, which deals with the scanning
of websites and provides data free of charge on AMAZON's servers.
An attempt was made to download and analyze the dataset CC 2014-10 which is
36,5TB in size (warc.gz files only) and contain 55 700 files approx. 685MB each.
It quickly turned out that warc.gz files contain approx. 80 PDF documents so the results would not be impressive.
Nevertheless the decision was made to proceed
with the analysis and additionally to extract URLs to PDF files.
See a detailed description.
Stage III - Downloading and analysis of collected PDF links
The number of unique URLs collected during stage II is over 26 million. See a detailed description.
The result of the project are three databases (MySQL) containing metadata of unique PDF documents.
Each database contains the following fields
Detailed information can be found in the Statistics