How to Find the Right OCR Strategy Based on Your Document Needs
OCR, which stands for optical character recognition, is a powerful tool that has drastically changed how many companies do business. Without OCR, text within a PDF, scanned document, or any other type of image file is completely unreadable to a computer. OCR breaks these images into a grid of pixels to recognize patterns and transform image-based letters into usable, computer-readable text.
This technology has done wonders in digitally transforming numerous businesses over the past few decades and remains integral to countless document management strategies. By scanning documents and lifting key fields, companies can index and digitally file these documents in an enterprise content management (ECM) system quickly and easily. OCR can also send critical data lifted from these documents to other systems. A system that uses OCR for these processes is known as a document capture solution. Although OCR is an extremely powerful tool when implemented correctly, each implementation strategy has its own set of benefits and challenges. Because of this, it’s important to choose a strategy that meets your business needs.
Factors that determine your OCR Strategy
Different business processes have different expectations from a document capture solution. Whether the process you are trying to transform requires 100% accuracy, fast capture times, or large volumes, the difference will almost certainly come down to your implementation strategy. Below, we’ve outlined 6 factors that can impact which strategy is right for you.
1. Document Formatting:
Unusual fonts, special characters, handwritten fields, and a lack of uniform layouts can all affect the accuracy of your capture solution. Similarly, documents that are skewed, stained, torn, or of low print quality can also decrease capture accuracy. It’s essential to determine who has control of these variables. Do these papers originate from inside your company, or do they come from a partner or customer? Similarly, will you be archiving older documents, or will most of the documents be new enough that implementing new formatting policies can ensure better accuracy?
2. Desired Accuracy:
Surprisingly, not all capture strategies aim for 100% accuracy. In fact, according to a OCR best practices guide by the University of Illinois, most OCR software is advertised to be between 97% and 99% accurate. Quality assurance testing high volumes of documents isn’t always a time-worthy investment either. There are, however, strategies to ensure your data is highly accurate, usually through minimizing your system’s reliance on OCR.
3. Additional Data Sources:
Suppose your company has additional sources of data, such as a customer relationship management system (CRM), an enterprise resource planning system (ERP), or any other line of business application. In that case, your capture system can integrate with these solutions and pull data from them. This reduces your reliance on OCR and improves the overall accuracy of your information.
4. Time sensitivity of retrieval:
If documents need to be retrieved quickly, then the index fields must be accurate. Users will want to be able to search a document by the most readily available field and find it effortlessly. In contrast, if one index field can afford to be captured incorrectly 0.3% of the time, and users can simply search with another index field and then fix the mistranslated field, then 100% accuracy becomes less of a concern, and the advertised 99.7% accuracy will suffice.
5. Frequency of Document Retrieval:
If documents are not going to be retrieved very often, then 100% accuracy is less of a concern in this case as well. When archiving documents because of retention mandates, for example, employees typically have a day or more to find the documents they are looking for in the event of an audit. These documents will also likely have little use outside of these audits. In this case, 99.7% capture accuracy will suffice as the occasionally mistranslated field can simply be fixed after the document is found with another search criterion.
6. Capture Volume:
The amount of documents that need to be regularly captured also plays a prominent role in determining your OCR strategy. This is because larger volumes of documents need to be captured more quickly to keep up with the system’s demand. This fast-paced document capture reduces manual quality assurance time while potentially demanding a high degree of accuracy, depending on your other needs.
Types of OCR Strategies
To recap, document formatting, desired accuracy, availability of other data, capture volume, retrieval frequency, and time sensitivity can all affect which capture strategy will suit your needs. We’ve broken these strategies into 3 categories, each with its own strengths and limitations.
A simple OCR strategy is intended to limit a process’ reliance on OCR to improve capture accuracy. This strategy typically captures a single data point that is capable of identifying the entire document. The rest of the index fields are then looked up in another line of a business application containing highly accurate information. By capturing this single data point and having it call on the related information, you practically eliminate the risk of mistakes slipping through the process while drastically reducing the number of documents that need to be manually checked.
For example, if one OCR’d field is captured incorrectly, the system likely won’t be able to find any information associated with it, and an employee will be able to easily identify an issue and resolve it. In addition, knowing which documents have errors reduces the load of documents that need to be checked from every document being captured to the number that actually have errors. This strategy does require that your business has a highly accurate data source to pull information from, though. If your business does not meet this requirement, another strategy may also work for you.
Complex OCR strategies capture all index fields and run various checks to see if the data is accurate. Examples of these checks include matching the captured totals on invoices against the numbers that are being summed or using the 9th digit in a vehicle identification number (VIN) which is designed to detect whether they are invalid. In addition, by capturing all index fields, this strategy ensures that you can find a document using at least one of the fields if another was captured incorrectly. Not all captured information may have a way to check for accuracy, though. Otherwise, this strategy is highly effective.
Situational OCR is typically used when documents need to be archived in bulk. These documents need to be captured quickly and accurately, and if there are index fields that will apply to multiple documents, this combination of speed and accuracy is entirely possible. Using situational OCR, barcoded cover pages can be placed on the documents being captured, and each of these barcodes can contain index fields for a set of documents. Suppose an HR team is archiving employee documents, for example. In that case, all documents related to a specific employee can be grouped with one barcode containing the employee’s name, ID number, and other identifying fields. This way, all of these documents will have at least two common, highly accurate index fields with which they can be looked up.
By knowing which factors of a document capture process are important to your business, you can choose a strategy that fits your needs.
Square 9 is an end-to-end document-based digital transformation provider, offering web forms, ECM, workflow automation, and document capture solutions. For more information on how Square 9 can help with your document needs, you can request a demo or let us help.