Build a self-cleaning data import pipeline
Pull, dedupe, and normalise external data on every refresh — no manual cleanup required.
Published Sep 20, 2025
Northwind imports vendor invoices from a partner API, and the raw feed is never tidy. Vendor names arrive with stray double spaces, invoice numbers come in mixed case, amounts are wrapped in currency symbols, and the same invoice often appears twice because the partner re-sends a record after an edit. Left alone, the mess flows straight into every report built on top of it.
This script turns that messy feed into a clean tab nobody has to touch. On
every run it fetches the API, normalises each row into a consistent shape,
drops duplicates by vendor and invoice number, and rewrites a Clean invoices
sheet from scratch. The result is a dependable dataset that finance can trust
without a manual cleanup pass.
What you’ll need
- A partner API endpoint that returns invoice records as JSON. Northwind’s
returns an array of objects with
vendor,invoice_number,amount, andissued_atfields. - A Google Sheet to hold the cleaned output. The script writes to the first tab and manages the header itself.
- If the API needs an API key, store it in Script Properties rather than in the code — see Store API keys and secrets securely.
The script
// The partner API that returns raw invoice records as JSON.
const API_URL = 'https://api.partner.example/invoices';
// The spreadsheet that receives the cleaned, deduped output.
const TARGET_SHEET_ID = '1abcCleanInvoicesSheetId';
// The columns written to the clean sheet, in order.
const HEADERS = ['vendor', 'invoiceNumber', 'amount', 'issuedAt'];
/**
* Fetches the partner invoice feed, normalises and dedupes it, then
* rewrites the clean sheet from scratch.
*/
function refreshInvoices() {
// 1. Pull the raw feed and parse it into an array of objects.
const response = UrlFetchApp.fetch(API_URL, { muteHttpExceptions: true });
if (response.getResponseCode() !== 200) {
Logger.log('API returned ' + response.getResponseCode() + ' — keeping the existing sheet.');
return;
}
const raw = JSON.parse(response.getContentText());
// 2. Normalise every row and drop any that fail validation.
const cleaned = raw.map(normalise).filter(Boolean);
// 3. Drop duplicates, keyed on vendor + invoice number.
const deduped = dedupe(cleaned, (r) => `${r.vendor}|${r.invoiceNumber}`);
// 4. Rewrite the clean sheet: header first, then the rows.
const sheet = SpreadsheetApp.openById(TARGET_SHEET_ID).getSheets()[0];
sheet.clearContents();
sheet.getRange(1, 1, 1, HEADERS.length).setValues([HEADERS]);
if (deduped.length === 0) {
Logger.log('No valid rows after cleaning — wrote header only.');
return;
}
sheet.getRange(2, 1, deduped.length, HEADERS.length).setValues(
deduped.map((r) => [r.vendor, r.invoiceNumber, r.amount, r.issuedAt])
);
Logger.log('Wrote ' + deduped.length + ' clean invoices.');
}
/**
* Turns one raw API row into a consistent shape, or returns null if the
* row is missing the fields it cannot do without.
*/
function normalise(row) {
// An invoice with no number or no amount is not usable — skip it.
if (!row.invoice_number || !row.amount) return null;
return {
// Collapse runs of whitespace so "Acme Ltd" and "Acme Ltd" match.
vendor: String(row.vendor || '').trim().replace(/\s+/g, ' '),
// Upper-case invoice numbers so "inv-01" and "INV-01" are one row.
invoiceNumber: String(row.invoice_number).toUpperCase().trim(),
// Strip currency symbols and separators, then parse to a number.
amount: parseFloat(String(row.amount).replace(/[^\d.-]/g, '')),
issuedAt: new Date(row.issued_at),
};
}
/**
* Returns the input rows with duplicates removed, where two rows are
* duplicates if keyFn produces the same key.
*/
function dedupe(rows, keyFn) {
const seen = new Set();
return rows.filter((r) => {
const k = keyFn(r);
if (seen.has(k)) return false;
seen.add(k);
return true;
});
}
How it works
refreshInvoicesfetches the partner API. If the request returns anything other than HTTP 200, it logs the code and stops, leaving the existing sheet untouched rather than wiping it.- It parses the response and runs every row through
normalise, which returns a clean object ornull. Thefilter(Boolean)drops the nulls. dedupewalks the cleaned rows, building a key ofvendor|invoiceNumberfor each. The first time a key is seen it is kept; any later row with the same key is discarded.- The clean sheet is cleared, the header written, and the deduped rows written
in one
setValuescall. normalisedoes the heavy lifting: it collapses whitespace in vendor names, upper-cases invoice numbers so case differences do not split a row in two, and strips currency symbols from amounts before parsing them as numbers.
Example run
The API returns three records — two of them the same invoice, re-sent after an edit:
| vendor | invoice_number | amount | issued_at |
|---|---|---|---|
Acme Ltd | inv-01 | "$1,200.00" | 2025-09-01 |
Acme Ltd | INV-01 | 1200 | 2025-09-02 |
Globex | gx-88 | "€450" | 2025-09-03 |
After a run, the Clean invoices sheet holds two normalised, deduped rows:
| vendor | invoiceNumber | amount | issuedAt |
|---|---|---|---|
| Acme Ltd | INV-01 | 1200 | 01/09/2025 |
| Globex | GX-88 | 450 | 03/09/2025 |
The duplicate INV-01 is gone, and the survivor is fully cleaned.
Trigger it
Run the import on a schedule so the clean sheet is always current:
- In the Apps Script editor, open Triggers (the clock icon).
- Add a trigger for
refreshInvoices, time-driven, on an hourly timer. - Save. The clean sheet now refreshes every hour with no manual work.
Watch out for
clearContentswipes the sheet before the new rows land. The HTTP 200 guard means a failed API call leaves the old data in place, but a successful call that returns an empty array will still clear you down to just the header.- The dedupe keeps the first occurrence of a key, not the newest. If the
partner re-sends a corrected invoice, sort the cleaned rows by
issuedAtdescending before callingdedupeso the latest version wins. parseFloatreturnsNaNfor an amount it cannot read. If a row’s amount is malformed, it will pass validation (it is truthy) but writeNaN— consider adding aNumber.isFinitecheck insidenormalise.new Date(row.issued_at)depends on the API’s date format being one the engine understands. ISO 8601 strings are safe; anything ambiguous is not.UrlFetchAppis capped by daily quota and a per-call timeout. A very large feed may need pagination rather than a single fetch.- To be told when the feed itself breaks a rule, pair this with Alert the owner when imported data breaks a rule.
Related
Build an OKR tracker with progress rollups
Aggregate Northwind's key-result progress into objective-level scores automatically.
Updated Jan 14, 2026
Build a recurring-task generator
Spawn new Northwind task rows on a daily or weekly cadence from a Recurring sheet.
Updated Jan 10, 2026
Build a multi-sheet search-and-jump tool
Find a value across every tab of a workbook and click through to the cell that contains it.
Updated Jan 7, 2026
Auto-rebuild grouped summaries on edit
Refresh pivot-style rollups the instant data changes — no manual recompute.
Updated Jan 4, 2026
Build a data-quality scorecard
Grade any sheet on completeness, validity, and freshness — surface gaps as a single score.
Updated Dec 31, 2025