OCR scanned documents into searchable text

Northwind’s archive is full of scanned PDFs — signed contracts, old invoices, delivery notes someone photographed on a phone. They look fine, but they are just pictures of text. Search for a client name and Drive finds nothing, because there is no text to find. The document is invisible to everyone who did not file it.

Google Drive already runs optical character recognition when you convert a PDF into a Google Doc — you just have to ask for it. This script does exactly that: it takes an image-based PDF, converts it to a Doc with the OCR flag switched on, and drops the searchable result into an output folder. Run it over a whole folder and an unsearchable archive becomes a searchable one.

What you’ll need

The Drive API enabled in Advanced Services. In the Apps Script editor, open Services, add Drive API, and keep the identifier as Drive.
A folder of scanned PDFs to process, and a separate output folder for the OCR’d Docs — keeping them apart means you can re-run without reprocessing.
Nothing else. OCR on conversion is a free, built-in Drive feature.

The script

// Folder of scanned PDFs to read from.
const SOURCE_FOLDER_ID = '1abcSourceFolderId';

// Folder where the searchable Google Docs are written.
const OUTPUT_FOLDER_ID = '1abcOutputFolderId';

// Language hint for the OCR engine — improves accuracy on the
// expected language. Use a two-letter ISO 639-1 code.
const OCR_LANGUAGE = 'en';

/**
 * OCRs a single PDF: converts it to a Google Doc with text recognition
 * switched on and files the result in the output folder.
 *
 * @param {string} fileId The Drive ID of the source PDF.
 * @param {string} outputFolderId The folder to write the OCR'd Doc into.
 */
function ocrPdf(fileId, outputFolderId) {
  const source = DriveApp.getFileById(fileId);

  // 1. Describe the new Doc: a searchable name, the Docs MIME type,
  //    and the output folder as its parent.
  const meta = {
    name: source.getName().replace(/\.pdf$/i, '') + ' (OCR)',
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [outputFolderId],
  };

  // 2. Create the Doc from the PDF's bytes. The ocr flag is what makes
  //    Drive run text recognition during the conversion.
  Drive.Files.create(meta, source.getBlob(), {
    ocr: true,
    ocrLanguage: OCR_LANGUAGE,
  });
}

/**
 * OCRs every PDF in a folder, writing each searchable Doc to the
 * output folder.
 *
 * @param {string} folderId The folder of scanned PDFs.
 * @param {string} outputFolderId The folder to write OCR'd Docs into.
 */
function ocrFolder(folderId, outputFolderId) {
  const files = DriveApp.getFolderById(folderId)
    .getFilesByType('application/pdf');

  // Bail out early if the folder has no PDFs at all.
  if (!files.hasNext()) {
    Logger.log('No PDFs in the source folder — nothing to do.');
    return;
  }

  // Walk the folder one PDF at a time, OCR'ing each in turn.
  let count = 0;
  while (files.hasNext()) {
    ocrPdf(files.next().getId(), outputFolderId);
    count++;
  }
  Logger.log('OCR complete — processed ' + count + ' PDF(s).');
}

/**
 * Convenience entry point — OCRs the configured source folder into the
 * configured output folder.
 */
function ocrSourceFolder() {
  ocrFolder(SOURCE_FOLDER_ID, OUTPUT_FOLDER_ID);
}

How it works

ocrSourceFolder is the entry point — it passes the two configured folder IDs to ocrFolder so the rest of the code stays generic.
ocrFolder asks Drive for every PDF in the source folder. If there are none, it logs a message and stops before touching anything.
For each PDF it calls ocrPdf, which reads the original file by ID.
ocrPdf builds a metadata object: the new name with an (OCR) suffix, the Google Docs MIME type, and the output folder as the parent.
Drive.Files.create does the real work. Passing the PDF’s blob alongside ocr: true tells Drive to run text recognition while it converts the image into a Doc — the text inside becomes selectable and searchable.
ocrFolder keeps a running count and logs the total when the folder is done.

Example run

Say SOURCE_FOLDER_ID holds three scanned PDFs:

Source file	What it is
`Contract - Acme Ltd.pdf`	A signed scan, no embedded text
`Invoice 4471.pdf`	A photographed invoice
`Delivery note March.pdf`	A scanned delivery slip

After a run, the output folder holds three Google Docs:

Output Doc	Searchable?
`Contract - Acme Ltd (OCR)`	Yes — search “Acme” now finds it
`Invoice 4471 (OCR)`	Yes — line items are real text
`Delivery note March (OCR)`	Yes

The log reads OCR complete — processed 3 PDF(s). The original PDFs are left untouched, so nothing is lost if a conversion looks wrong.

Run it

This is usually a one-off cleanup or an occasional catch-up, not a daily job:

Set SOURCE_FOLDER_ID and OUTPUT_FOLDER_ID to your two folders.
In the Apps Script editor, select ocrSourceFolder and click Run.
Approve the authorisation prompt the first time.
Open the output folder and spot-check a Doc — search inside it to confirm the text came through.

If new scans arrive regularly, point a time-driven trigger at ocrSourceFolder and move processed PDFs out of the source folder so each run only handles what is new.

Watch out for

OCR accuracy depends on the scan. Crisp, high-contrast pages convert well; faint, skewed, or handwritten pages will have errors. Treat the output as searchable, not as a perfect transcription.
The conversion produces a Google Doc, not a searchable PDF. If you need the PDF format back, export the Doc to PDF afterwards — but the Doc itself is what Drive search indexes.
Re-running ocrFolder on the same source folder reprocesses every PDF and creates duplicate Docs. Move or tag processed files, or run it once and archive the source.
ocrLanguage is a hint, not a filter. Set it to the language you expect; mixed-language pages will still convert, just less accurately on the unexpected stretches.
Large or numerous PDFs can push past the script runtime limit. If a folder is huge, split it across several runs rather than OCR’ing everything at once.

OCR scanned documents into searchable text

What you’ll need

The script

How it works

Example run

Run it

Watch out for

Related

Build a recurring file-delivery system

Build a Drive search index in Sheets

Build a shared-folder onboarding kit

Route saved email attachments to project folders

Bundle a folder of images into one PDF