Dieselpoint Search is great for searching large collections of PDF documents. The system will automatically parse the PDF, extract metadata and text, and add it to the index.

The PDF parser works in conjunction with crawler. If PDFs are part of a website, or if they are found in a directory structure, they can be found and processed automatically.

Metadata like author, title, date, etc. can be treated like regular text and searched. It can also be used to build sophisticated interfaces using Dieselpoint’s Search and Navigation technology. For example, a search request can show not only the top results, but also all of the authors, the number of documents associated with each author, the categories that the documents belong to, the number of documents in each category, and a variety of other information.

All parsing and manipulation of PDFs is done with Java, eliminating the need for non-Java third-party tools.

Smart Titles
Quite often, authors of PDFs neglect to enter titles into the document’s metadata. This makes it difficult to display a good, descriptive title when a PDF appears on a search results page. Dieselpoint Search eliminates this problem by providing “Smart Titles”.

The system analyzes each PDF looking for clues as what the title might be, and employs advanced heuristics to select one. Studies show that Dieselpoint’s algorithm selects a title which is the same as the one that a human would have selected over 90% of the time.

Extensible Metadata Platform (XMP)
XMP is an XML file that is embedded within a PDF file. It gives PDFs the ability to contain a rich data source, which can have information about authors, digital rights, categories, and keywords, to name a few examples.

Dieselpoint Search automatically extracts and indexes XMP data, making it possible to search and navigate PDF document collections using this information.

Click the link above to learn more about XMP.