Dieselpoint Search is great for searching large collections of PDF documents. The system will automatically parse the PDF, extract metadata and text, and add it to the index.
The PDF parser works in conjunction with crawler. If PDFs are part of a website, or if they are found in a directory structure, they can be found and processed automatically.
Metadata like author, title, date, etc. can be treated like regular text and searched. It can also be used to build sophisticated interfaces using Dieselpoint’s Search and Navigation technology. For example, a search request can show not only the top results, but also all of the authors, the number of documents associated with each author, the categories that the documents belong to, the number of documents in each category, and a variety of other information.
All parsing and manipulation of PDFs is done with Java, eliminating the need for non-Java third-party tools.
The system analyzes each PDF looking for clues as what the title might be, and employs advanced heuristics to select one. Studies show that Dieselpoint’s algorithm selects a title which is the same as the one that a human would have selected over 90% of the time.
Dieselpoint Search automatically extracts and indexes XMP data, making it possible to search and navigate PDF document collections using this information.
Click the link above to learn more about XMP.