OCR scanned documents into searchable text
Extract text from image-based PDFs in Drive — Drive's free OCR on conversion to Docs.
Published Oct 4, 2025
Northwind’s archive is full of scanned PDFs — signed contracts, old invoices, delivery notes someone photographed on a phone. They look fine, but they are just pictures of text. Search for a client name and Drive finds nothing, because there is no text to find. The document is invisible to everyone who did not file it.
Google Drive already runs optical character recognition when you convert a PDF into a Google Doc — you just have to ask for it. This script does exactly that: it takes an image-based PDF, converts it to a Doc with the OCR flag switched on, and drops the searchable result into an output folder. Run it over a whole folder and an unsearchable archive becomes a searchable one.
What you’ll need
- The Drive API enabled in Advanced Services. In the Apps Script editor,
open Services, add Drive API, and keep the identifier as
Drive. - A folder of scanned PDFs to process, and a separate output folder for the OCR’d Docs — keeping them apart means you can re-run without reprocessing.
- Nothing else. OCR on conversion is a free, built-in Drive feature.
The script
// Folder of scanned PDFs to read from.
const SOURCE_FOLDER_ID = '1abcSourceFolderId';
// Folder where the searchable Google Docs are written.
const OUTPUT_FOLDER_ID = '1abcOutputFolderId';
// Language hint for the OCR engine — improves accuracy on the
// expected language. Use a two-letter ISO 639-1 code.
const OCR_LANGUAGE = 'en';
/**
* OCRs a single PDF: converts it to a Google Doc with text recognition
* switched on and files the result in the output folder.
*
* @param {string} fileId The Drive ID of the source PDF.
* @param {string} outputFolderId The folder to write the OCR'd Doc into.
*/
function ocrPdf(fileId, outputFolderId) {
const source = DriveApp.getFileById(fileId);
// 1. Describe the new Doc: a searchable name, the Docs MIME type,
// and the output folder as its parent.
const meta = {
name: source.getName().replace(/\.pdf$/i, '') + ' (OCR)',
mimeType: MimeType.GOOGLE_DOCS,
parents: [outputFolderId],
};
// 2. Create the Doc from the PDF's bytes. The ocr flag is what makes
// Drive run text recognition during the conversion.
Drive.Files.create(meta, source.getBlob(), {
ocr: true,
ocrLanguage: OCR_LANGUAGE,
});
}
/**
* OCRs every PDF in a folder, writing each searchable Doc to the
* output folder.
*
* @param {string} folderId The folder of scanned PDFs.
* @param {string} outputFolderId The folder to write OCR'd Docs into.
*/
function ocrFolder(folderId, outputFolderId) {
const files = DriveApp.getFolderById(folderId)
.getFilesByType('application/pdf');
// Bail out early if the folder has no PDFs at all.
if (!files.hasNext()) {
Logger.log('No PDFs in the source folder — nothing to do.');
return;
}
// Walk the folder one PDF at a time, OCR'ing each in turn.
let count = 0;
while (files.hasNext()) {
ocrPdf(files.next().getId(), outputFolderId);
count++;
}
Logger.log('OCR complete — processed ' + count + ' PDF(s).');
}
/**
* Convenience entry point — OCRs the configured source folder into the
* configured output folder.
*/
function ocrSourceFolder() {
ocrFolder(SOURCE_FOLDER_ID, OUTPUT_FOLDER_ID);
}
How it works
ocrSourceFolderis the entry point — it passes the two configured folder IDs toocrFolderso the rest of the code stays generic.ocrFolderasks Drive for every PDF in the source folder. If there are none, it logs a message and stops before touching anything.- For each PDF it calls
ocrPdf, which reads the original file by ID. ocrPdfbuilds a metadata object: the new name with an(OCR)suffix, the Google Docs MIME type, and the output folder as the parent.Drive.Files.createdoes the real work. Passing the PDF’s blob alongsideocr: truetells Drive to run text recognition while it converts the image into a Doc — the text inside becomes selectable and searchable.ocrFolderkeeps a running count and logs the total when the folder is done.
Example run
Say SOURCE_FOLDER_ID holds three scanned PDFs:
| Source file | What it is |
|---|---|
Contract - Acme Ltd.pdf | A signed scan, no embedded text |
Invoice 4471.pdf | A photographed invoice |
Delivery note March.pdf | A scanned delivery slip |
After a run, the output folder holds three Google Docs:
| Output Doc | Searchable? |
|---|---|
Contract - Acme Ltd (OCR) | Yes — search “Acme” now finds it |
Invoice 4471 (OCR) | Yes — line items are real text |
Delivery note March (OCR) | Yes |
The log reads OCR complete — processed 3 PDF(s). The original PDFs are left
untouched, so nothing is lost if a conversion looks wrong.
Run it
This is usually a one-off cleanup or an occasional catch-up, not a daily job:
- Set
SOURCE_FOLDER_IDandOUTPUT_FOLDER_IDto your two folders. - In the Apps Script editor, select
ocrSourceFolderand click Run. - Approve the authorisation prompt the first time.
- Open the output folder and spot-check a Doc — search inside it to confirm the text came through.
If new scans arrive regularly, point a time-driven trigger at
ocrSourceFolder and move processed PDFs out of the source folder so each
run only handles what is new.
Watch out for
- OCR accuracy depends on the scan. Crisp, high-contrast pages convert well; faint, skewed, or handwritten pages will have errors. Treat the output as searchable, not as a perfect transcription.
- The conversion produces a Google Doc, not a searchable PDF. If you need the PDF format back, export the Doc to PDF afterwards — but the Doc itself is what Drive search indexes.
- Re-running
ocrFolderon the same source folder reprocesses every PDF and creates duplicate Docs. Move or tag processed files, or run it once and archive the source. ocrLanguageis a hint, not a filter. Set it to the language you expect; mixed-language pages will still convert, just less accurately on the unexpected stretches.- Large or numerous PDFs can push past the script runtime limit. If a folder is huge, split it across several runs rather than OCR’ing everything at once.
Related
Build a recurring file-delivery system
Drop a fresh report file into a Northwind client folder weekly — they don't even ask.
Updated Dec 15, 2025
Build a Drive search index in Sheets
Make Northwind's file metadata searchable in a Sheet — like Spotlight for Drive.
Updated Dec 7, 2025
Build a shared-folder onboarding kit
Auto-grant new Northwind hires the folders they need on day one.
Updated Nov 29, 2025
Route saved email attachments to project folders
File Gmail attachments into the right Northwind client folder based on subject keywords.
Updated Nov 25, 2025
Bundle a folder of images into one PDF
Combine Northwind scans into a single deliverable PDF using a generation service.
Updated Nov 17, 2025