Build an AI invoice and receipt parser
Read Northwind documents into structured ledger rows — vendor, amount, line items.
Published Oct 2, 2025
Northwind’s bookkeeping bottleneck is not the maths — it is the typing. Receipts and supplier invoices arrive as PDFs, and someone has to open each one, find the vendor, the date and the total, and key it into the ledger. For a busy month that is an hour of dull, error-prone copying.
This script hands the reading to Claude. It walks a Drive folder of PDFs, pulls
the text out of each one, and asks Claude to return the key fields — vendor,
date, total, and the individual line items — as strict JSON. Each parsed
document becomes one row in a Ledger sheet, with a link back to the original
file so anything that looks odd is one click away from the source.
What you’ll need
- A Google Drive folder containing the receipt and invoice PDFs. Pass its folder ID to the function.
- A
Ledgersheet with a header row. The script appends rows in this order:date,vendor,total,items,fileUrl. - An Anthropic API key saved as
ANTHROPIC_API_KEYin Script Properties — see Store API keys and secrets securely. - PDFs that contain a real text layer. Scanned images with no embedded text produce no extractable text — see “Watch out for”.
The script
// The spreadsheet that holds the ledger.
const LEDGER_SHEET_ID = '1abcLedgerId';
// How many characters of each PDF's text to send to Claude. Keeps the
// prompt within a sensible token budget — most receipts fit easily.
const MAX_CHARS = 4000;
/**
* Parses every PDF in a Drive folder into a structured ledger row.
*
* @param {string} folderId The ID of the Drive folder of PDFs.
*/
function parseReceiptsInFolder(folderId) {
const files = DriveApp.getFolderById(folderId)
.getFilesByType('application/pdf');
const sheet = SpreadsheetApp.openById(LEDGER_SHEET_ID).getSheets()[0];
let parsed = 0;
// 1. Walk every PDF in the folder.
while (files.hasNext()) {
const file = files.next();
// 2. Pull the text out of the PDF and trim it to a safe length.
const text = file.getBlob().getDataAsString().slice(0, MAX_CHARS);
// 3. Skip files with no extractable text (e.g. scanned images).
if (!text.trim()) {
Logger.log('No text in ' + file.getName() + ' — skipped.');
continue;
}
// 4. Ask Claude for a fixed JSON shape. A strict schema is the
// difference between a parseable result and a parsing nightmare.
const prompt =
'Extract this receipt as JSON in exactly this shape: ' +
'{"vendor": string, "date": "YYYY-MM-DD", "total": number, ' +
'"items": [{"desc": string, "amount": number}]}. ' +
'Return ONLY the JSON — no prose, no markdown.\n\n' + text;
// 5. Sonnet handles the extraction; strip any code fence, then parse.
const reply = callClaude(prompt, 'claude-sonnet-4-6', 600);
const data = JSON.parse(stripFences(reply));
// 6. Append one ledger row, keeping the line items as JSON and a
// link back to the source file for spot checks.
sheet.appendRow([
data.date,
data.vendor,
data.total,
JSON.stringify(data.items),
file.getUrl(),
]);
parsed++;
}
Logger.log('Parsed ' + parsed + ' documents into the ledger.');
}
/**
* Claude occasionally wraps JSON in a ```json code fence. Strip it so
* JSON.parse never chokes on the markdown.
*/
function stripFences(text) {
return text.replace(/```(?:json)?/g, '').trim();
}
/**
* Minimal Anthropic API call. The key lives in Script Properties — it
* is never pasted into the code.
*/
function callClaude(prompt, model = 'claude-haiku-4-5-20251001', maxTokens = 400) {
const key = PropertiesService.getScriptProperties()
.getProperty('ANTHROPIC_API_KEY');
const res = UrlFetchApp.fetch('https://api.anthropic.com/v1/messages', {
method: 'post',
contentType: 'application/json',
headers: { 'x-api-key': key, 'anthropic-version': '2023-06-01' },
payload: JSON.stringify({
model,
max_tokens: maxTokens,
messages: [{ role: 'user', content: prompt }],
}),
muteHttpExceptions: true,
});
return JSON.parse(res.getContentText()).content[0].text.trim();
}
How it works
parseReceiptsInFolderopens the given Drive folder and gets an iterator over every PDF file inside it.- For each file it reads the blob as a string and trims it to
MAX_CHARS, keeping the prompt within a sensible token budget. - If the file yields no text at all — common for scanned images — it logs a note and moves on rather than sending an empty prompt.
- It builds a prompt that pins the output to a strict JSON schema: a vendor, an ISO date, a numeric total, and an array of line items.
- It calls Claude Sonnet, since accurate field extraction over a whole
document benefits from the stronger model.
stripFencesremoves any code fence Claude added, thenJSON.parseturns the reply into a real object. - It appends one row to the
Ledgersheet — date, vendor, total, the line items stored as a JSON string, and a link to the original file.
Example run
A folder holds three PDFs. After a run, the Ledger sheet has three new rows:
| date | vendor | total | items | fileUrl |
|---|---|---|---|---|
| 2025-09-14 | Paper & Co | 48.20 | [{“desc”:“A4 card”,“amount”:31.00},{“desc”:“Delivery”,“amount”:17.20}] | drive.google.com/… |
| 2025-09-18 | Bright Lighting | 210.00 | [{“desc”:“Studio lamp”,“amount”:210.00}] | drive.google.com/… |
| 2025-09-22 | Cafe Verde | 12.40 | [{“desc”:“Client lunch”,“amount”:12.40}] | drive.google.com/… |
The items column holds JSON so the line-item detail survives without needing
extra columns — expand it later with a formula or a second script. The
fileUrl link means any figure that looks wrong is one click from the
original receipt.
Run it
This runs whenever you have a batch of documents to process.
- Drop the PDFs into a Drive folder and copy the folder ID from its URL.
- In the Apps Script editor, call
parseReceiptsInFolder('your-folder-id')— the simplest way is a tiny wrapper function with the ID filled in. - Approve the authorisation prompt the first time.
- Open the
Ledgersheet to check the new rows.
To run it on a schedule — say, end of each week — add a time-driven trigger that calls a wrapper with the folder ID already set.
Watch out for
- The parser re-reads every PDF in the folder on each run, so running it twice
duplicates rows. Move processed files to a
Donesubfolder, or record processed file IDs and skip them. getDataAsStringextracts the raw text layer of a PDF. Scanned receipts that are really images contain no text — they need OCR (for example, importing the PDF into Google Docs) before this script can read them.- Totals and dates come from a model reading messy documents. Spot-check the first few rows of each batch against the linked source, especially currency symbols and date formats.
MAX_CHARScaps the text sent per document. A long multi-page invoice may be truncated — raise the cap andmax_tokenstogether if line items go missing.UrlFetchApphas a daily call quota, and each call costs API credit. A large folder means many calls — batch the work or watch your usage.
Related
Parse semi-structured listings into tables
Extract recipes, specs, or ads from Northwind content into clean spreadsheet rows.
Updated Mar 7, 2026
Extract follow-ups from call transcripts
Turn Northwind sales calls into actionable tasks — owner + task + due date per extracted item.
Updated Jan 10, 2026
Build an AI data-enrichment pipeline
Fill missing company and contact fields on Northwind's prospect list — Claude infers from the row.
Updated Nov 23, 2025
Extract entities and relationships from text
Build a structured graph from Northwind prose — people, companies, and how they connect.
Updated Sep 24, 2025
Build an AI data-cleaning assistant
Standardise messy Northwind names, addresses, and categories into clean canonical values.
Updated Sep 4, 2025