PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
Currently, secured documents are not supported.
This Library is still under active development. As a result, users must expect BC breaks when using the master version.
This project is supported by Actualys.
Prerequisites
This library requires PHP 5.3.
PDFParser is built on top of TCPDF parser.
This library will be automatically downloaded through Composer command line.
Installation
Using Composer
Add PDFParser to your composer.json file :
{ "require": { "smalot/pdfparser": "*" } }
Now ask for composer to download the bundle by running the command:
$ composer update smalot/pdfparser
As standalone library
First of all, download the library from Github by choosing a specific release or directly the master.
Once done, unzip it and run the following command line using composer.
$ composer update
This command will download any dependencies (Atoum library) and create the 'autoload.php' file.
Now create a new file with this content, in the same folder :
<?php // Include 'Composer' autoloader. include 'vendor/autoload.php'; // Your code // ... ?>
Unit tests with Atoum
Run Atoum unit tests (with code coverage - if xdebug installed) :
$ vendor/bin/atoum -d vendor/smalot/pdfparser/src/Smalot/PdfParser/Tests/
Once this command is ended, the folder "coverage/" will contain html pages with a code coverage summary.
Use
This sample will parse all the pdf file and extract text from each page.
<?php // Include Composer autoloader if not already done. include 'vendor/autoload.php'; // Parse pdf file and build necessary objects. $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile('document.pdf'); $text = $pdf->getText(); echo $text; ?>
You can too extract text from each page handly or for a specific page.
<?php // Include Composer autoloader if not already done. include 'vendor/autoload.php'; // Parse pdf file and build necessary objects. $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile('document.pdf'); // Retrieve all pages from the pdf file. $pages = $pdf->getPages(); // Loop over each page to extract text. foreach ($pages as $page) { echo $page->getText(); } ?>
Here a sample code to extract metadata from document (Author, Creator, CreationDate, ...).
<?php // Include Composer autoloader if not already done. include 'vendor/autoload.php'; // Parse pdf file and build necessary objects. $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile('document.pdf'); // Retrieve all details from the pdf file. $details = $pdf->getDetails(); // Loop over each property to extract values (string or array). foreach ($details as $property => $value) { if (is_array($value)) { $value = implode(', ', $value); } echo $property . ' => ' . $value . "\n"; } ?>