SharePoint has an excellent enterprise search engine but often the results are disappointing to end-users. One significant reason that search in SharePoint doesn’t return the expected results is that the files are not searchable.
Often PDF documents (such as manuals, invoices, and scanned documents) are just image files. Scanned documents are often TIFF, JPG, BMP or PDF documents. PDF documents come in lots of flavours and although they may look like text-based documents many versions are actually just an image file.
SharePoint search crawls documents and uses a collection of iFilters to extract the text from documents. The text is then added to the search index to match queries from the users when they search.
If the search crawler encounters scanned documents or image only PDFs it just treats them like any other image and doesn’t know that there is text within the document. For example, if there is a library containing invoices that are stored as PDFs without text, they simply won’t be found when searching.
Optimcal Character Rcognition (OCR) is a highly established technology that can add the missing text layer to documents. OCR can recognise different languages, fonts and even handwriting to extract and add text to documents. Recent innovations in machine learning makes the OCR process more effective than ever.
The new documents are saved in a format known as PDF-A which includes a newly created text layer. An added bonus is that the documents are usually half the size of the original document.
If you are unsure if you are currently storing non-searchable files in your SharePoint sites (or file shares, or Azure Storage) you can download the free version of Ocrato and run unlimited audits across everywhere that you store documents..
With a single license you can easily convert all your files to ensure they are discoverable by SharePoint enterprise search. Our industry leading OCR engine will quickly convert your files to a compact and searchable format.