appscript.dev
Automation Intermediate Sheets

Build an AI data-cleaning assistant

Standardise messy Northwind names, addresses, and categories into clean canonical values.

Published Sep 4, 2025

Northwind’s data arrives from everywhere — web forms, imported CSVs, the odd copy-paste — and it shows. The same company turns up as “Larkfield Joinery”, “larkfield joinery ltd”, and “Larkfield”. The same county is spelt three ways. None of it is wrong enough to fix one cell at a time, and all of it is wrong enough to break a pivot table.

This script cleans a column without you writing a single normalisation rule. It walks the messy column, asks Claude to map each value to a single canonical form, and writes the result into a matching _clean column. The raw data stays put as an audit trail; the clean column is what your reports read from.

What you’ll need

  • A Google Sheet where the messy values sit in a named column — say company — and a sibling column named company_clean is ready for the output.
  • An Anthropic API key saved as ANTHROPIC_API_KEY in Script Properties — see Store API keys and secrets securely.

The script

// The spreadsheet that holds the data to clean.
const DATA_SHEET_ID = '1abcDataId';

// Suffix appended to a column name to find its clean-output column.
const CLEAN_SUFFIX = '_clean';

/**
 * Walks one column, normalises each value to a canonical form, and
 * writes the result into the matching "<column>_clean" column.
 * @param {string} columnName The header of the messy column to clean.
 */
function cleanColumn(columnName) {
  const sheet = SpreadsheetApp.openById(DATA_SHEET_ID).getSheets()[0];

  // 1. Read the whole sheet and split off the header row.
  const [h, ...rows] = sheet.getDataRange().getValues();

  // 2. Locate the messy column and its clean-output sibling.
  const col = h.indexOf(columnName);
  const cleanCol = h.indexOf(columnName + CLEAN_SUFFIX);

  if (col === -1 || cleanCol === -1) {
    Logger.log('Missing "' + columnName + '" or its ' + CLEAN_SUFFIX +
      ' column — check the headers.');
    return;
  }

  // 3. Walk each row, skipping ones already cleaned.
  rows.forEach((r, i) => {
    if (r[cleanCol] || !r[col]) return;

    // 4. Ask Claude for the canonical form, and nothing else.
    const prompt =
      'Normalise this ' + columnName + ' to a canonical form. ' +
      'Return only the cleaned value, with no explanation.\n\n' + r[col];

    sheet.getRange(i + 2, cleanCol + 1).setValue(callClaude(prompt));
  });
  Logger.log('Cleaning pass complete for "' + columnName + '".');
}

/**
 * Minimal Anthropic API call. Haiku is plenty for single-value
 * normalisation, and the key lives in Script Properties — never in code.
 */
function callClaude(prompt, model = 'claude-haiku-4-5-20251001', maxTokens = 400) {
  const key = PropertiesService.getScriptProperties()
    .getProperty('ANTHROPIC_API_KEY');
  const res = UrlFetchApp.fetch('https://api.anthropic.com/v1/messages', {
    method: 'post',
    contentType: 'application/json',
    headers: { 'x-api-key': key, 'anthropic-version': '2023-06-01' },
    payload: JSON.stringify({
      model,
      max_tokens: maxTokens,
      messages: [{ role: 'user', content: prompt }],
    }),
    muteHttpExceptions: true,
  });
  return JSON.parse(res.getContentText()).content[0].text.trim();
}

How it works

  1. cleanColumn opens the data spreadsheet and reads the full range, splitting the header row from the data.
  2. It looks up two column indexes — the messy column you named, and its _clean sibling — and bails out with a clear log line if either is missing, so a header typo fails loudly instead of silently.
  3. It walks each row, skipping any that are already cleaned or have no source value — so a re-run only processes new rows and never burns API calls twice.
  4. For each remaining row it builds a prompt asking for the canonical form and nothing else, which keeps the reply a single clean value.
  5. It calls Claude Haiku — fast and cheap, which matters one cell at a time — and writes the cleaned value into the _clean column.
  6. The original column is never touched, so you always have the raw value to compare against if a normalisation looks wrong.

Example run

Call cleanColumn('company') against a sheet like this:

companycompany_clean
larkfield joinery ltd
LARKFIELD JOINERY
Coastline Tours (Devon)
coastline tours

After a run, the _clean column collapses the variants:

companycompany_clean
larkfield joinery ltdLarkfield Joinery Ltd
LARKFIELD JOINERYLarkfield Joinery Ltd
Coastline Tours (Devon)Coastline Tours
coastline toursCoastline Tours

Group by company_clean and the four rows become two — which is what a report needs.

Run it

This is an on-demand job — run it after each data import:

  1. In the Apps Script editor, create a small wrapper that calls cleanColumn('company') (or whichever column) and click Run.
  2. Approve the authorisation prompt the first time.
  3. Spot-check the _clean column against the originals.

To let the team clean a column without the editor, add a custom menu:

function onOpen() {
  SpreadsheetApp.getUi()
    .createMenu('Data tools')
    .addItem('Clean company column', 'cleanCompany')
    .addToUi();
}

function cleanCompany() {
  cleanColumn('company');
}

Watch out for

  • Claude decides the canonical form. Two slightly different inputs should map to the same output — but spot-check, because it has no fixed master list unless you give it one in the prompt.
  • For a closed set of valid values — a fixed county list, say — paste that list into the prompt so Claude maps onto it rather than inventing a form.
  • Cleaned cells are skipped on re-runs. If you correct a source value and want it re-cleaned, clear its _clean cell first.
  • One API call per row adds up. A large import means a long run and a real cost — clean in batches and watch the six-minute execution limit.

Related