About 80 percent of document workflows can be handled by traditional OCR (optical character recognition). This type of OCR – which has been around for over 30 years – can identify almost every variation of machine-printed text, based on the fonts and symbols it has “studied.” So, which ones can’t it read?
Traditional OCR can’t read and extract data from handwriting or poor-quality machine-printed documents. While it can recognize all types of clearly printed characters, once the text becomes smudged or skewed, it no longer knows what it’s looking at. This means, while it can handle 80 percent of document workflows, humans must intervene to take care of the remaining 20 percent.
But this is a problem because human intervention inevitably leads to three major challenges:
The industry needs a better solution for the 20 percent of work that traditional OCR can’t handle. But before we get into what that solution is – let’s answer an important question.
Low-quality documents are machine-printed documents that can’t be read by traditional OCR technology. This is because somewhere along the way, their data got compromised. And in most cases, it can be traced back to one of two sources.
Fax machines were invented in the 1800s. Two hundred years later, people still use them. There’s an especially high level of fax machine use in the medical and legal fields. By nature, faxed images are significantly less clear, and they can also be skewed. When the same document has been faxed multiple times, the image becomes even worse. And this happens all the time.
Scans are typically higher quality than faxes but still prove problematic in a number of ways. First, they can take up costly storage space. As a result, organizations will intentionally scan at a lower quality – sometimes all the way down to 75 DPI. Also, certain documents, such as death certificates, are designed to not be readable after a scan to prevent fraud, by intentionally creating messy artifacts on the scanned image.
So, if traditional OCR can’t read low-quality faxes and scans, what can?
For a while, traditional OCR was all we had. So, organizations had to take a few shortcuts to make up for it. OCR for poor-quality documents uses AI and machine learning to read and extract hard-to-read documents. Here’s how it works:
OCR for poor-quality documents uses more advanced technology than traditional OCR. Instead of simple techniques to identify letter shapes, this type of OCR leverages a highly trained machine learning model and advanced computer vision engines to predict what is there.
The combination of highly trained machine learning models and computer vision engines unlocks OCR’s ability to replicate the way humans are able to read low-quality documents. In fact, if the model is good enough, it can extract text better than humans – but we’ll get to that.
Machine learning models are only as good as the dataset they’re trained on.
Training requires a lot of specific data. If you add new forms and workflows, you need more training. Over time, the algorithm will improve. But the most important gains (90 percent accuracy and above) are incredibly resource-intensive.
And then, of course, you need to put the model into practice. This requires a large dataset of what you want to digitize (usually different types of forms that you normally see through your processing workflow), experts to help you build out a model based on those forms and ongoing support to help you improve it over time.
So, yes – OCR that can read low-quality DPI scans and blurry faxes exists. But who is using it – and who is making it?
Any business that has massive amounts of information arriving on paper is under constant pressure to “do more with less”. These types of businesses can benefit most from OCR for poor-quality documentation.
Paperwork processing – a necessary evil for many organizations – is one such example. Processing is common for insurance and healthcare organizations. It’s painful because it makes businesses spend precious time and resources on manual data entry. OCR for poor-quality documentation gives them a way to reallocate. Here are a few more areas where it can help.
OCR for poor-quality documents not only helps these applications fully automate their processing workflow, it also enables better data, analytics, and decision-making.
If your organization needs an OCR solution that can read and extract low-quality content, do your homework. Words like “AI” and “machine learning” are used too freely and not every vendor will be able to back up how their technology works. Finally, when it comes to accuracy and performance, look for vendors with transparent numbers.
Get a FREE 30-day trial of Vidado – no credit card required – and start turning low-quality scans, faxes and even handwriting into digitized data. Create your free account to get started.