What is OCR, NLP and how does AI read documents? From scan to data in the system

Author: Michael Jan Rogocki (AI Engineer & Data Scientist) · Last updated:

In every company, someone is re-keying data from documents. From invoices, contracts, forms, reports. By hand, from the screen into the system, line by line.

It's work that has to be done — but that requires neither creativity nor experience. It requires time and focus, which is exactly what a team usually lacks for the tasks that genuinely need human involvement.

Two technologies from the field of AI — OCR and NLP — can take over this mechanical part. Together they form a path in which a document enters as an image and comes out as organized information in the system. Without manual re-keying and without the risk that the knowledge of what goes where stays only in individual employees' heads.

Below we explain what OCR is, what NLP is, how they work together — and we show, using a real example from the insurance market, how much time and how many errors can be eliminated.

1. What is OCR and how does it work?

⚡ In one sentence

OCR (Optical Character Recognition) is a technology that converts an image of text — a scan, a photo, a PDF — into digital text that a computer can read and search.

💡 In plain terms

Imagine a stack of paper invoices on a desk. To a computer, each of them is simply a picture — dark and light pixels. The computer sees them the way you see an image in an unfamiliar alphabet: shapes without meaning.

OCR solves this problem. When a document reaches the system as an image (a scan, a phone photo, a PDF), OCR analyzes it in two steps:

  • Character recognition — the algorithm analyzes the shapes in the image and matches them to known letters, digits and symbols. Modern OCR systems use neural networks that are trained on thousands of examples — which is why they cope with different fonts, print quality and even, partially, with handwriting.
  • The result — digital text that can be searched, copied, edited and processed further.

After OCR, the computer "sees" words rather than pixels. But note: it sees words — it doesn't yet "understand" them. OCR doesn't know that "1,250.00 PLN" is an amount on an invoice and that "Jan Kowalski" is a customer's name. OCR reads — it doesn't interpret.

🔧 Deep dive

OCR as a technology has roots reaching back to the 1920s (the first experiments with character recognition), but commercial systems appeared in the 1960s and 1970s — first for sorting mail and reading checks, then as general-purpose tools (Ray Kurzweil's omni-font OCR, 1970s). The breakthrough in accuracy came with the use of neural networks: systems such as Tesseract (open source, originally HP, sponsored by Google from 2006 to 2018, now developed by the community) and commercial solutions like ABBYY FineReader are trained to recognize characters on the basis of hundreds of thousands of examples. The accuracy of modern OCR systems on clean, well-scanned printed text reaches 95–99%, depending on the system and the quality of the input — the best commercial solutions approach 100% under ideal conditions.

The most important limitation: OCR quality depends directly on the quality of the input. A blurry scan, low resolution (below 300 DPI), a skewed document, handwriting — all of these reduce accuracy. That's why, in implementation practice, the first step is standardizing the way documents reach the system (cf. What is process optimization? — the section on process mapping).

2. What is NLP and how does it differ from OCR?

⚡ In one sentence

NLP (Natural Language Processing) is a technology that lets a computer "understand" the meaning of text — not just read the words, but extract from them specific information, intentions and relationships.

💡 In plain terms

OCR gives us text. But text on its own isn't yet information.

Take an invoice as an example. After OCR, the computer sees a string of characters: "VAT invoice no. 2024/03/0147, issue date: 15.03.2024, gross amount: 4,920.00 PLN, payment term: 14 days". It sees this as plain text — the same way it sees the header, the footer and the sender's address. It doesn't know what is what.

NLP solves this problem. It analyzes the text and extracts structure from it:

  • It recognizes that "4,920.00 PLN" is an amount — not a phone number or a product code.
  • It identifies "15.03.2024" as the issue date and "14 days" as the payment term.
  • It assigns "2024/03/0147" as the document number.
  • It classifies the whole document as a "VAT invoice" — not an order, not a complaint.

The difference between OCR and NLP can be described in one sentence: OCR is the eyes, NLP is the brain. OCR sees the letters, NLP "understands" what they mean in context. It's worth adding: when AI analyzes an image not to read text but to recognize objects, defects or scenes — that's already the domain of Computer Vision, not OCR.

🔧 Deep dive

NLP (natural language processing) is a branch of AI covering several techniques, of which the most important in the context of document processing are:

  • Tokenization — splitting text into units (words, sentences, fragments). It's the starting point for further analysis (cf. What is Artificial Intelligence? — the section on the mechanics of AI).
  • NER (Named Entity Recognition) — recognizing entities: dates, amounts, company names, addresses, document numbers. It's the foundation of automatically extracting data from documents.
  • Text classification — assigning a document to a category (invoice, complaint, order, claims correspondence). The system doesn't require a list of rules — it's trained on examples.
  • Sentiment and intent analysis — in the context of correspondence: is the customer asking, complaining, threatening, requesting information? This allows matters to be routed automatically to the right people.

NLP isn't a single technology — it's a set of tools. Which of them to use depends on the business problem. In a simple case (extracting data from structured invoices), NER is enough. In a complex one (classifying insurance correspondence in multiple languages), you need a combination of classification, intent analysis and NER. NLP is also the foundation of RAG technology, where text is not only analyzed but used to generate answers to questions (cf. What is RAG and an AI agent?).

An important caveat: when we write that NLP "understands" text, that's a simplification. NLP doesn't understand in the human sense — it processes statistical patterns and matches text to learned categories (cf. What is Artificial Intelligence? — the section on how AI "thinks"). The effect can come close to understanding, but the mechanism is fundamentally different.

3. How AI reads documents — from scan to data in the system

⚡ In one sentence

Processing a document with AI is a chain of steps: scan → OCR (reading the text) → NLP ("understanding" the meaning) → the data goes into the system without human involvement.

💡 In plain terms

Let's trace the path of a single invoice — from the moment it reaches the company to the moment its data is in the system.

  1. The document reaches the system. The invoice arrives by email as a PDF, or someone scans it from paper. An important distinction: if the PDF is "digital from birth" (generated by an accounting system or an editor), the text is already inside — OCR isn't needed, you can go straight to step 3. But if the PDF is a scan of a paper document or a photo — the computer sees an image, not text, and that's where OCR's role begins.
  2. OCR reads the text. The OCR system analyzes the image and converts it into digital text. After this step we have the full content of the invoice as text — but still as one string of characters, without structure.
  3. NLP extracts the data. The NLP system analyzes the text and extracts specific information: the invoice number, the date, the amount, the supplier's tax ID, the line items, the payment term. It assigns each piece of information to the appropriate field.
  4. The data goes into the system. The extracted data enters the accounting system, the ERP or a spreadsheet — automatically, without manual entry. A person receives a ready record to verify and approve.

Processing the document itself — from the moment the system "sees" it to the moment the data is extracted — takes seconds. The whole path (including downloading the attachment, preprocessing, saving to the system) — minutes, not hours.

But an important caveat: this chain works well when the company knows what data it wants to extract and in what form it stores it. Without an organized process (which documents, where they come from, where the data goes, who verifies it), technology won't solve the problem — because it isn't clear what to do with the data (cf. What is automation? — the section on the stages of automation).

🔧 Deep dive

In implementation practice, the line between OCR and NLP is increasingly blurred. Modern multimodal models — based on the Transformer architecture (cf. What is Artificial Intelligence? — the section on Transformer and LLM) — can analyze a document simultaneously as an image and as text. They don't need a separate OCR step and then a separate NLP step. They look at the whole page: they see the layout, the tables, the headings — and extract the data directly.

This means the traditional "OCR first, then NLP" split applies to classic systems. In the newest solutions, both stages can happen at once. For a company deploying such a solution the effect is the same — documents converted into data — but the technology underneath is simpler and more flexible.

Regardless of the technology, however, the same principle applies: the quality of the results depends on the quality of the input data and the organization of the process. A multimodal model copes better with a poor scan than classic OCR — but it still needs a clearly defined goal: what data to extract, in what format, into which system.

"Document-processing technology has changed radically over the past few years. We used to build separate pipelines: OCR, then extraction rules, then a classifier. Today a multimodal model does it in one step. But one thing hasn't changed: before you launch the system, you have to know exactly what data you want to extract and what to do with it. Without that, even the best model produces data no one uses."

— Michael Jan Rogocki, AI Engineer & Data Scientist, cm-opti

4. Where it works in practice — a case study from the insurance market

⚡ In one sentence

An insurance company in the German market deployed an NLP agent to classify claims correspondence — the effect: 30% fewer backlogs, 90% accuracy in recognizing the risk category.

💡 In plain terms

This is the daily reality of a mid-sized insurance company in the German market. Claims correspondence — letters, emails, attachments — arrives in large volumes every day (annual data volume: a high six-figure number). Each document requires assessment: is this urgent? What risk category? Which department to route it to? A highly regulated environment — delays risk claims becoming time-barred and customer satisfaction dropping.

Before the deployment: employees manually read each document, classified it and routed it to the appropriate process. Processing time grew, backlogs piled up.

What the system did:

  • Reading documents — OCR read the content of incoming documents (scans, PDFs, emails).
  • "Understanding" the content — the NLP agent analyzed the text, assessed the significance of the document, recognized the risk category and the sender's intent.
  • Automatic routing — relevant documents went automatically to the appropriate processes. Irrelevant documents (duplicates, spam, informational letters requiring no action) were rejected — this alone reduced the volume by about 25%.

Results:

  • 30% reduction in the document-processing backlog.
  • About 25% less correspondence requiring manual handling (rejection of irrelevant documents).
  • 90% accuracy in automatically recognizing the risk category.

The employees didn't lose their jobs — the nature of their work changed. Instead of reading and sorting hundreds of letters a day, they verify the system's decisions and deal with matters that require judgment and experience.

🔧 Deep dive

A few technical decisions that determined the success of this project:

  • A private cloud solution. Insurance data is subject to strict regulations. The system runs in a private cloud — the data doesn't leave the controlled environment. That's not optional in a regulated sector.
  • An NLP agent, not a set of rules. Classic rule-based automation (if a document contains word X → route to Y) wasn't enough, because the correspondence is too varied — the same matters are described in dozens of different phrasings, in different languages. An NLP agent recognizes intent; it doesn't search for specific phrases.
  • A staged deployment. In line with the approach cm-opti applies in practice: first organizing the categories and classification rules (stage 1), then automating the routing (stage 2), and finally deploying the NLP agent (stage 3). Without stage 1, there would be nothing to compare the stage-3 results against (cf. What is automation? — the section on the stages of automation).

"The first thing we did in this project had nothing to do with technology. We sat down with the team and wrote out the categories of documents, the paths for handling them and the criteria for urgency. Only once that was clear did we have a foundation for automation. A company that skips this step deploys a system that sorts documents by rules no one defined — and then no one trusts its decisions."

— Karol Jurewicz, Business Process Architect, cm-opti

5. What OCR and NLP won't do — and when a human is needed

⚡ In one sentence

OCR and NLP cope with repetitive, structured documents — but where the content is ambiguous, the context unclear or the stakes high, a human makes the decision.

💡 In plain terms

OCR and NLP are tools — they're neither all-knowing nor infallible. It's worth knowing where their capabilities end:

  • Input quality affects output quality. A blurry scan, a photocopy of a photocopy, handwriting, a document partly covered by a stamp — all of these reduce OCR accuracy. The "garbage in, garbage out" principle applies here without mercy.
  • The ambiguity of language. NLP interprets text on the basis of patterns. But language can be ambiguous: irony, unusual phrasings, industry jargon, grammatical errors — these are situations in which the system can get it wrong.
  • New types of documents. A system trained on invoices won't automatically recognize a construction-works acceptance protocol.
  • High-stakes decisions. Is a claim valid? Does a contract contain risky clauses? Here OCR and NLP prepare the data, but a human makes the decision.

🔧 Deep dive

In implementation practice, an important concept is the confidence score — the level of certainty with which the system classifies a document or extracts data. A well-designed system doesn't make decisions at a low confidence level — instead, it routes the document to a human for verification.

This "human-in-the-loop" approach allows the speed of automation to be combined with the safety of human judgment. The system processes documents with a high confidence level automatically, and routes the unusual or ambiguous ones to a human for verification. Over time, as the system is trained on new examples, the proportion of documents requiring manual verification drops.

An important observation: the confidence score isn't a measure of truth — it's a measure of how closely the pattern of a new document resembles the patterns the system learned on. A high confidence score with poor training data gives a false sense of security. That's why the quality of the training data and regular validation of the results aren't a one-off task, but a continuous process (cf. What is process optimization? — the section on continuous improvement).

6. Where to start with document processing in a company

⚡ In one sentence

Start with one type of document that is processed most often and consumes the most time — that's where OCR and NLP will deliver the fastest, measurable effect.

💡 In plain terms

You don't have to automate the whole document flow right away. The best implementations start with one specific area:

  • Identify the bottleneck. Which type of document consumes the most time? Invoices? Contracts? Reports? Correspondence? Just count: how many documents per day/week, how many minutes per document, how many errors. The Pareto principle (cf. What is automation?) will help you find the 20% of documents that generate 80% of the work.
  • Check the quality of the input. Do the documents arrive in a consistent format? Are the scans legible? Is there one channel (email, system), or five? The more standardized the input, the faster and cheaper the deployment.
  • Define what data you want to extract. Not "everything" — specifically: number, date, amount, counterparty name, category. The more precisely you define the goal, the better the system will work from day one. The extracted data can feed dashboards and BI reports that let you make decisions faster.
  • Start with a human in the loop. The first stage isn't full automation — it's a system that proposes, and a human approves. Over time, the proportion of automatic decisions grows.

A separate question is the method of deployment — a ready-made platform, a no-code tool or a solution built from scratch. That depends on the scale, the complexity of the documents and how non-standard the process is. We develop this topic in the article on systems integration.

Companies in Poland and Germany face the same problem: a growing volume of documents, shrinking teams, pressure for speed. OCR and NLP are proven tools that solve this problem — but only when the deployment starts with the process, not the technology. Our first step is always a diagnosis: what documents, where they come from, where the data goes, who makes the decisions. Only with that picture do we choose the tools — because technology should serve the process, not the other way around.

— The cm-opti perspective

Frequently asked questions (FAQ)

Does OCR handle handwriting?

Partially. Modern OCR systems based on neural networks recognize handwriting written in block letters (e.g. hand-filled forms), but accuracy is lower than for printed text. Free-form handwriting (letters, notes, signatures) is still a challenge — accuracy drops significantly.

How does OCR differ from NLP?

OCR converts an image into text — it sees letters. NLP analyzes text and extracts meaning from it — it "understands" what those letters mean in context. OCR is the eyes, NLP is the brain.

Does implementing OCR and NLP require a large budget?

Not necessarily. A simple OCR deployment on one document type (e.g. invoices) is a project that can be launched in weeks. Costs depend on scale, document complexity and data-security requirements. The first question is not "how much does the technology cost" but "how much does the absence of automation cost" — count the team hours spent on manual re-keying.

Can AI completely replace people in document processing?

No. AI takes over the mechanical part — reading, classifying, extracting data. But decisions that require judgment, experience and knowledge of the context stay with people. A well-deployed system changes the nature of the work: instead of re-keying, people verify and decide.

Related articles in the cm-opti Knowledge base

Concepts explained in this article → Glossary

OCR (Optical Character Recognition), NLP (Natural Language Processing), NER (Named Entity Recognition), tokenization, text classification, multimodal models, confidence score, human-in-the-loop, Transformer

Sources and references

  • Definition of OCR — based on IBM, TechTarget, Wikipedia and publicly available technical knowledge.
  • Definition of NLP — based on publicly available industry knowledge in the field of natural language processing.
  • Insurance case study — a cm-opti project in the German market. Data: 30% reduction in backlog, ~25% less correspondence requiring manual handling, 90% accuracy in recognizing the risk category.
  • OCR accuracy for printed documents (95–99%) — based on OCR benchmarks (AIMultiple 2026, Parsea 2026).