New PDF Optimization Capabilities

Google made an announcement last week that they are now able to (mostly) read text in PDF files.  What does this mean, you ask?  Google already indexes PDF files!  Yes, but before they had to be PDF versions of  files that were digital to begin with (like Word documents).  Now, however, any document that’s been scanned into a computer and uploaded to a website can be “read” by Google.

This opens up vast amounts of previously print-only documents to be ranked on the search giant.  You could scan in and upload all of last year’s press releases, for instance, and have the copy within those PDF files help your website rank better.  Any old document that could be of benefit to users and search engines visiting your site now has the chance of being found and understood by Google.  And the others (like Yahoo) won’t be far behind in adopting similar technology.

With any OCR (that’s the technical term for image recognition) software, however, there are flaws.  No OCR is perfect and you have to take some basic steps to ensure that your scanned documents are presented as clearly as possible.

As always, however, be aware of how new material on your website can affect the user experience.  Don’t dump a whole bunch of content on your website if it’s going to hamper one’s ability to navigate the website.  Anything added should be an enhancement to users, not just for the benefit of more search engine rankings.

Here are some samples of this technology in action.  Clicking on the “View as HTML” links brings up a page with the scanned text, with the search terms highlighted (first listing in the search results):

Example 1: Steady Success in a Volatile World

Example 2: repairing aluminum wiring

[Image Credit]

Tags: , ,

Leave a Reply