Index PDF documents within Sharepoint

Chad Gross, once again, is the Sharepoint man of the day for coming up with this method for indexing and searching through PDF documents within Sharepoint:


To get full text search capability on SBS, the best method is to follow the instructions on the SBS Premium Technologies CD.  Now, to get full-text search of PDFs (including content) will require a few extra steps:

1) Install the PDF iFilter from Adobe on your SBS:
http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611

2) Verify the Indexing service is started & set to start automatically.

3) Add the PDF icon to your docicon.xml file so your Document Libraries  show the correct icons for your PDFs:
http://msmvps.com/cgross/archive/2004/10/26/16679.aspx

Another thing to realize is that not all PDFs are created equal, and some require additional handling in order to be able to search their contents.  Basically, if you create a PDF locally from a Windows application (Word, Excel, QuickBooks, etc.) using Adobe’s Acrobat Distiller or PDFWriter, then the PDF will include text that the iFilter will detect when the PDF is uploaded to your Sharepoint site. 


As a result, you’ll be able to search the contents of these files from Sharepoint.  However, if you’re scanning documents to a PDF format – then what you have is an image dumped into a PDF.  As a result, these scanned documents are not able to have their contents searched without extra handling. 


So what’s the extra handling?  If you’re scanning documents, you’ll want to make sure you run your scans through an OCR application and then create a dual-layer PDF (image + OCR text).  Then when you upload this dual-layer PDF to your Sharepoint site, the iFilter detects the OCR text allowing you to search the contents.

For low to medium volume jobs, I would recommend ScanSoft’s OmniPage Pro – it works with most scanners and is one of the most accurate OCR engines I’ve used.  It also allows for workflow creation to help automate the scan / ocr / save process, and natively supports creating dual-layer PDFs.  If you’re looking at medium to large volumes of scanning, I would recommend 
taking a look at leasing a new copier / printer / scanner / fax unit. Gestetner has several newer units that include the ability to scan to a PDF + OCR right at the unit.

3 thoughts on “Index PDF documents within Sharepoint

  1. Are these all the steps involved? How soon before I can start seeing the content of the PDFs in the search results? I made the changes about 5 minutes ago, as you outlined them, the icons are showing up, the indexing service is started, but I’m still not seeing specific words and expressions from the PDFs when I search for them. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *