Build an AI data-cleaning assistant
Standardise messy Northwind names, addresses, and categories into clean canonical values.
Published Sep 4, 2025
Northwind’s data arrives from everywhere — web forms, imported CSVs, the odd copy-paste — and it shows. The same company turns up as “Larkfield Joinery”, “larkfield joinery ltd”, and “Larkfield”. The same county is spelt three ways. None of it is wrong enough to fix one cell at a time, and all of it is wrong enough to break a pivot table.
This script cleans a column without you writing a single normalisation rule.
It walks the messy column, asks Claude to map each value to a single
canonical form, and writes the result into a matching _clean column. The
raw data stays put as an audit trail; the clean column is what your reports
read from.
What you’ll need
- A Google Sheet where the messy values sit in a named column — say
company— and a sibling column namedcompany_cleanis ready for the output. - An Anthropic API key saved as
ANTHROPIC_API_KEYin Script Properties — see Store API keys and secrets securely.
The script
// The spreadsheet that holds the data to clean.
const DATA_SHEET_ID = '1abcDataId';
// Suffix appended to a column name to find its clean-output column.
const CLEAN_SUFFIX = '_clean';
/**
* Walks one column, normalises each value to a canonical form, and
* writes the result into the matching "<column>_clean" column.
* @param {string} columnName The header of the messy column to clean.
*/
function cleanColumn(columnName) {
const sheet = SpreadsheetApp.openById(DATA_SHEET_ID).getSheets()[0];
// 1. Read the whole sheet and split off the header row.
const [h, ...rows] = sheet.getDataRange().getValues();
// 2. Locate the messy column and its clean-output sibling.
const col = h.indexOf(columnName);
const cleanCol = h.indexOf(columnName + CLEAN_SUFFIX);
if (col === -1 || cleanCol === -1) {
Logger.log('Missing "' + columnName + '" or its ' + CLEAN_SUFFIX +
' column — check the headers.');
return;
}
// 3. Walk each row, skipping ones already cleaned.
rows.forEach((r, i) => {
if (r[cleanCol] || !r[col]) return;
// 4. Ask Claude for the canonical form, and nothing else.
const prompt =
'Normalise this ' + columnName + ' to a canonical form. ' +
'Return only the cleaned value, with no explanation.\n\n' + r[col];
sheet.getRange(i + 2, cleanCol + 1).setValue(callClaude(prompt));
});
Logger.log('Cleaning pass complete for "' + columnName + '".');
}
/**
* Minimal Anthropic API call. Haiku is plenty for single-value
* normalisation, and the key lives in Script Properties — never in code.
*/
function callClaude(prompt, model = 'claude-haiku-4-5-20251001', maxTokens = 400) {
const key = PropertiesService.getScriptProperties()
.getProperty('ANTHROPIC_API_KEY');
const res = UrlFetchApp.fetch('https://api.anthropic.com/v1/messages', {
method: 'post',
contentType: 'application/json',
headers: { 'x-api-key': key, 'anthropic-version': '2023-06-01' },
payload: JSON.stringify({
model,
max_tokens: maxTokens,
messages: [{ role: 'user', content: prompt }],
}),
muteHttpExceptions: true,
});
return JSON.parse(res.getContentText()).content[0].text.trim();
}
How it works
cleanColumnopens the data spreadsheet and reads the full range, splitting the header row from the data.- It looks up two column indexes — the messy column you named, and its
_cleansibling — and bails out with a clear log line if either is missing, so a header typo fails loudly instead of silently. - It walks each row, skipping any that are already cleaned or have no source value — so a re-run only processes new rows and never burns API calls twice.
- For each remaining row it builds a prompt asking for the canonical form and nothing else, which keeps the reply a single clean value.
- It calls Claude Haiku — fast and cheap, which matters one cell at a time —
and writes the cleaned value into the
_cleancolumn. - The original column is never touched, so you always have the raw value to compare against if a normalisation looks wrong.
Example run
Call cleanColumn('company') against a sheet like this:
| company | company_clean |
|---|---|
| larkfield joinery ltd | |
| LARKFIELD JOINERY | |
| Coastline Tours (Devon) | |
| coastline tours |
After a run, the _clean column collapses the variants:
| company | company_clean |
|---|---|
| larkfield joinery ltd | Larkfield Joinery Ltd |
| LARKFIELD JOINERY | Larkfield Joinery Ltd |
| Coastline Tours (Devon) | Coastline Tours |
| coastline tours | Coastline Tours |
Group by company_clean and the four rows become two — which is what a
report needs.
Run it
This is an on-demand job — run it after each data import:
- In the Apps Script editor, create a small wrapper that calls
cleanColumn('company')(or whichever column) and click Run. - Approve the authorisation prompt the first time.
- Spot-check the
_cleancolumn against the originals.
To let the team clean a column without the editor, add a custom menu:
function onOpen() {
SpreadsheetApp.getUi()
.createMenu('Data tools')
.addItem('Clean company column', 'cleanCompany')
.addToUi();
}
function cleanCompany() {
cleanColumn('company');
}
Watch out for
- Claude decides the canonical form. Two slightly different inputs should map to the same output — but spot-check, because it has no fixed master list unless you give it one in the prompt.
- For a closed set of valid values — a fixed county list, say — paste that list into the prompt so Claude maps onto it rather than inventing a form.
- Cleaned cells are skipped on re-runs. If you correct a source value and want
it re-cleaned, clear its
_cleancell first. - One API call per row adds up. A large import means a long run and a real cost — clean in batches and watch the six-minute execution limit.
Related
Parse semi-structured listings into tables
Extract recipes, specs, or ads from Northwind content into clean spreadsheet rows.
Updated Mar 7, 2026
Extract follow-ups from call transcripts
Turn Northwind sales calls into actionable tasks — owner + task + due date per extracted item.
Updated Jan 10, 2026
Build an AI data-enrichment pipeline
Fill missing company and contact fields on Northwind's prospect list — Claude infers from the row.
Updated Nov 23, 2025
Build an AI invoice and receipt parser
Read Northwind documents into structured ledger rows — vendor, amount, line items.
Updated Oct 2, 2025
Extract entities and relationships from text
Build a structured graph from Northwind prose — people, companies, and how they connect.
Updated Sep 24, 2025