Compare answers across multiple AI models
A/B test Claude Haiku vs Sonnet on Northwind prompts — see which suits each job.
Published Dec 25, 2025
Northwind reaches for Claude in a dozen automations, but picking the right model for each one is mostly guesswork. Haiku is fast and cheap; Sonnet reasons better but costs more. Without seeing the two side by side on a real prompt, that trade-off is just a hunch — and hunches quietly overspend on jobs a small model would have handled fine.
This script runs the same prompt through every model in a list and logs each answer next to how long it took. After a few prompts you have a comparison sheet that turns the model choice into evidence: where Haiku is good enough, and where the extra reasoning is worth paying for.
What you’ll need
- A Google Sheet to collect the results. The first tab needs a header row:
Run,Model,Prompt,Answer,Latency (ms). - An Anthropic API key saved as
ANTHROPIC_API_KEYin Script Properties — see Store API keys and secrets securely.
The script
// The spreadsheet that collects the comparison results.
const COMPARE_SHEET_ID = '1abcCompareId';
// The models to test, in the order they should run.
const MODELS = ['claude-haiku-4-5-20251001', 'claude-sonnet-4-6'];
// Token budget for each answer — keeps replies comparable in length.
const MAX_TOKENS = 500;
// How many characters of the prompt to store in the log column.
const PROMPT_PREVIEW_LENGTH = 100;
/**
* Runs the same prompt through every model in MODELS and logs each
* answer, with its latency, to the comparison sheet.
*
* @param {string} prompt The prompt to test across all models.
*/
function compareModels(prompt) {
if (!prompt) {
Logger.log('No prompt supplied — nothing to compare.');
return;
}
const sheet = SpreadsheetApp.openById(COMPARE_SHEET_ID).getSheets()[0];
// Run the prompt through each model and time the round trip.
for (const model of MODELS) {
const start = Date.now();
const answer = callClaude(prompt, model);
const latency = Date.now() - start;
// Log one row per model: timestamp, model, prompt preview, answer, latency.
sheet.appendRow([
new Date(),
model,
prompt.slice(0, PROMPT_PREVIEW_LENGTH),
answer,
latency,
]);
}
Logger.log('Compared ' + MODELS.length + ' models.');
}
/**
* Minimal Anthropic API call. The key lives in Script Properties — it
* is never pasted into the code.
*/
function callClaude(prompt, model) {
const key = PropertiesService.getScriptProperties()
.getProperty('ANTHROPIC_API_KEY');
const res = UrlFetchApp.fetch('https://api.anthropic.com/v1/messages', {
method: 'post',
contentType: 'application/json',
headers: { 'x-api-key': key, 'anthropic-version': '2023-06-01' },
payload: JSON.stringify({
model,
max_tokens: MAX_TOKENS,
messages: [{ role: 'user', content: prompt }],
}),
muteHttpExceptions: true,
});
return JSON.parse(res.getContentText()).content[0].text.trim();
}
How it works
compareModelstakes a prompt and bails out early if it is empty, so an empty call never hits the API.- It loops over
MODELS— Haiku then Sonnet — and for each one records the start time before the call. - It calls Claude with the same prompt and the same
MAX_TOKENSbudget, so the answers are fair to compare on both length and quality. - After each call it measures the round-trip latency by subtracting the start time from the current time.
- It appends one row per model: a timestamp, the model name, a short preview of the prompt, the full answer, and the latency in milliseconds.
Example run
Call compareModels('Summarise this client brief in two sentences: ...') and
the comparison sheet gains two rows:
| Run | Model | Prompt | Answer | Latency (ms) |
|---|---|---|---|---|
| 2025-12-25 14:02 | claude-haiku-4-5-20251001 | Summarise this client brief… | The client wants a refreshed brand and a one-page site… | 1180 |
| 2025-12-25 14:02 | claude-sonnet-4-6 | Summarise this client brief… | The client is seeking a brand refresh paired with a single-page site… | 2640 |
Run the same prompt a few times across different jobs and a pattern emerges: for plain summarising the Haiku answer is close enough at half the latency, so that automation can safely use Haiku; for anything needing nuance, Sonnet earns its cost.
Run it
This is an evaluation job you run by hand while deciding which model an automation should use:
- In the Apps Script editor, write a one-line wrapper that calls
compareModelswith the prompt you want to test, then select it and click Run. - Approve the authorisation prompt the first time.
- Open the comparison sheet and read the two rows side by side.
Watch out for
- Latency is not fixed. Network conditions and API load both vary, so time the same prompt a few times before drawing conclusions from a single run.
- The log stores only the first 100 characters of the prompt
(
PROMPT_PREVIEW_LENGTH). That keeps the sheet readable, but means you cannot rerun a prompt straight from the log — keep the full prompts elsewhere. - The models run one after another, not in parallel. Comparing many prompts is slow; for a long batch, watch the six-minute Apps Script execution limit.
- Quality is a judgement call. The sheet shows you the answers and the latency, but deciding which answer is actually better still needs a human read.
Related
Generate and test email subject lines
A/B test AI-written Northwind subject lines for open rate — outputs ranked by past performance.
Updated Mar 3, 2026
Build retrieval-augmented Q&A over your data
Answer Northwind questions grounded in your own Sheet data — pass relevant rows as context.
Updated Feb 27, 2026
Build an AI weekly-report narrator
Turn Northwind metrics into a written executive summary — numbers in, prose out.
Updated Feb 23, 2026
Build a multi-step AI agent workflow
Chain Claude prompts to complete a Northwind task end to end — research → draft → critique → finalise.
Updated Feb 11, 2026
Adapt marketing copy per region
Localise Northwind tone and references by market with AI — same message, regional flavour.
Updated Jan 30, 2026