OCR for Poor-Quality Documents

Can AI-powered OCR really read handwriting better than a human?

About 80 percent of document workflows can be handled by traditional OCR (optical character recognition). This type of OCR – which has been around for over 30 years – can identify almost every variation of machine-printed text, based on the fonts and symbols it has “studied.” So, which ones can’t it read? 

Why Traditional OCR is No Longer Enough

Traditional OCR can’t read and extract data from handwriting or poor-quality machine-printed documents. While it can recognize all types of clearly printed characters, once the text becomes smudged or skewed, it no longer knows what it’s looking at. This means, while it can handle 80 percent of document workflows, humans must intervene to take care of the remaining 20 percent.

But this is a problem because human intervention inevitably leads to three major challenges:

  • Inaccuracy: mistyping and exception handling
  • Resources: difficult to source talent willing and able to manually extract text from low-quality documents
  • Security: the transfer from machine to human back to machine causes concern for security. Especially for those use cases in tightly regulated industries with sensitive information, like financial services, government, and healthcare organizations.

The industry needs a better solution for the 20 percent of work that traditional OCR can’t handle. But before we get into what that solution is – let’s answer an important question.

What are Poor-Quality Documents & How are They Created?

Low-quality documents are machine-printed documents that can’t be read by traditional OCR technology. This is because somewhere along the way, their data got compromised. And in most cases, it can be traced back to one of two sources.

Faxes

Fax machines were invented in the 1800s. Two hundred years later, people still use them. There’s an especially high level of fax machine use in the medical and legal fields. By nature, faxed images are significantly less clear, and they can also be skewed. When the same document has been faxed multiple times, the image becomes even worse. And this happens all the time.

Scans

Mailroom Automation ShredsScans are typically higher quality than faxes but still prove problematic in a number of ways. First, they can take up costly storage space. As a result, organizations will intentionally scan at a lower quality – sometimes all the way down to 75 DPI. Also, certain documents, such as death certificates, are designed to not be readable after a scan to prevent fraud, by intentionally creating messy artifacts on the scanned image.

So, if traditional OCR can’t read low-quality faxes and scans, what can?

OCR for Poor-Quality Documentation

For a while, traditional OCR was all we had. So, organizations had to take a few shortcuts to make up for it. OCR for poor-quality documents uses AI and machine learning to read and extract hard-to-read documents. Here’s how it works:

AI, Machine Learning & Computer Vision Engines

OCR for poor-quality documents uses more advanced technology than traditional OCR. Instead of simple techniques to identify letter shapes, this type of OCR leverages a highly trained machine learning model and advanced computer vision engines to predict what is there.

  • Machine learning: a subset of artificial intelligence that provides systems with the ability to automatically learn and iterate from experience without explicit instructions, relying on patterns and inference instead
  • Computer vision: another subset of artificial intelligence that can automate tasks that the human visual system can do

The combination of highly trained machine learning models and computer vision engines unlocks OCR’s ability to replicate the way humans are able to read low-quality documents. In fact, if the model is good enough, it can extract text better than humans – but we’ll get to that.

Teaching An AI to Be Better-Than-Human

Machine learning models are only as good as the dataset they’re trained on.

Training requires a lot of specific data. If you add new forms and workflows, you need more training. Over time, the algorithm will improve. But the most important gains (90 percent accuracy and above) are incredibly resource-intensive.

Using AI-Powered OCR in the Real World

And then, of course, you need to put the model into practice. This requires a large dataset of what you want to digitize (usually different types of forms that you normally see through your processing workflow), experts to help you build out a model based on those forms and ongoing support to help you improve it over time.

So, yes – OCR that can read low-quality DPI scans and blurry faxes exists. But who is using it – and who is making it?

Applications & Benefits of OCR for Poor-Quality Documents

Any business that has massive amounts of information arriving on paper is under constant pressure to “do more with less”. These types of businesses can benefit most from OCR for poor-quality documentation.

Paperwork processing – a necessary evil for many organizations – is one such example. Processing is common for insurance and healthcare organizations. It’s painful because it makes businesses spend precious time and resources on manual data entry. OCR for poor-quality documentation gives them a way to reallocate. Here are a few more areas where it can help.

Examples of Applications:

  • Insurance Claims: beneficiary designation form processing
  • Insurance: flagging invalid data to maintain daily service level agreements
  • Insurance: death certificate processing
  • Healthcare: patient enrollment forms
  • Pharmaceuticals: prescription pre-authorization

OCR for poor-quality documents not only helps these applications fully automate their processing workflow, it also enables better data, analytics, and decision-making.

Examples of Benefits:

  • Greater straight-through processing
  • Reduced exception handling
  • Accomplish 80% of the work with 20% of the staff.

How to Find the Right OCR Solution

If your organization needs an OCR solution that can read and extract low-quality content, do your homework. Words like “AI” and “machine learning” are used too freely and not every vendor will be able to back up how their technology works. Finally, when it comes to accuracy and performance, look for vendors with transparent numbers.

Technology

  • Is their solution AI-powered or is it just a well-marketed, human-data-entry and machine hybrid?
  • Can they explain the math behind their solution? What kind of machine learning models do they employ?

Accuracy

  • How accurate are they? Can they provide a number (95%, 99%, etc.)?
  • Can they provide accuracy numbers for every process they perform and every document they read and extract?

Experience

  • Are they a brand-new startup or have they been around a long time?
  • Why are they in the OCR game? How much do they really know about it? Vidado got started doing crowd data entry. This is what gave us the largest human-verified dataset (1 billion+ fields) in the industry.

Ease of Use

  • Do they offer a cloud-based SaaS solution, or must you host onsite?
  • How soon before you can start using the product? Many providers take about 6 months to a year to achieve high-level accuracy. (Vidado offers it on Day 1.)
  • How much training is required on the model? If it’s an AI-powered platform, will you need machine learning expertise on staff? Or will the provider handle everything (like we do)?
  • Are there any hardware requirements or IT burdens?

Business case

  • Do they have experience solving real business issues with their technology?
  • Have they imbued their technology with lessons learned from that experience? We have.

Try AI-Powered OCR for FREE

Get a FREE 30-day trial of Vidado – no credit card required – and start turning low-quality scans, faxes and even handwriting into digitized data. Create your free account to get started. 

Start Free Trial