appscript.dev
Automation Intermediate Sheets

Compare answers across multiple AI models

A/B test Claude Haiku vs Sonnet on Northwind prompts — see which suits each job.

Published Dec 25, 2025

Northwind reaches for Claude in a dozen automations, but picking the right model for each one is mostly guesswork. Haiku is fast and cheap; Sonnet reasons better but costs more. Without seeing the two side by side on a real prompt, that trade-off is just a hunch — and hunches quietly overspend on jobs a small model would have handled fine.

This script runs the same prompt through every model in a list and logs each answer next to how long it took. After a few prompts you have a comparison sheet that turns the model choice into evidence: where Haiku is good enough, and where the extra reasoning is worth paying for.

What you’ll need

  • A Google Sheet to collect the results. The first tab needs a header row: Run, Model, Prompt, Answer, Latency (ms).
  • An Anthropic API key saved as ANTHROPIC_API_KEY in Script Properties — see Store API keys and secrets securely.

The script

// The spreadsheet that collects the comparison results.
const COMPARE_SHEET_ID = '1abcCompareId';

// The models to test, in the order they should run.
const MODELS = ['claude-haiku-4-5-20251001', 'claude-sonnet-4-6'];

// Token budget for each answer — keeps replies comparable in length.
const MAX_TOKENS = 500;

// How many characters of the prompt to store in the log column.
const PROMPT_PREVIEW_LENGTH = 100;

/**
 * Runs the same prompt through every model in MODELS and logs each
 * answer, with its latency, to the comparison sheet.
 *
 * @param {string} prompt The prompt to test across all models.
 */
function compareModels(prompt) {
  if (!prompt) {
    Logger.log('No prompt supplied — nothing to compare.');
    return;
  }

  const sheet = SpreadsheetApp.openById(COMPARE_SHEET_ID).getSheets()[0];

  // Run the prompt through each model and time the round trip.
  for (const model of MODELS) {
    const start = Date.now();
    const answer = callClaude(prompt, model);
    const latency = Date.now() - start;

    // Log one row per model: timestamp, model, prompt preview, answer, latency.
    sheet.appendRow([
      new Date(),
      model,
      prompt.slice(0, PROMPT_PREVIEW_LENGTH),
      answer,
      latency,
    ]);
  }
  Logger.log('Compared ' + MODELS.length + ' models.');
}

/**
 * Minimal Anthropic API call. The key lives in Script Properties — it
 * is never pasted into the code.
 */
function callClaude(prompt, model) {
  const key = PropertiesService.getScriptProperties()
    .getProperty('ANTHROPIC_API_KEY');
  const res = UrlFetchApp.fetch('https://api.anthropic.com/v1/messages', {
    method: 'post',
    contentType: 'application/json',
    headers: { 'x-api-key': key, 'anthropic-version': '2023-06-01' },
    payload: JSON.stringify({
      model,
      max_tokens: MAX_TOKENS,
      messages: [{ role: 'user', content: prompt }],
    }),
    muteHttpExceptions: true,
  });
  return JSON.parse(res.getContentText()).content[0].text.trim();
}

How it works

  1. compareModels takes a prompt and bails out early if it is empty, so an empty call never hits the API.
  2. It loops over MODELS — Haiku then Sonnet — and for each one records the start time before the call.
  3. It calls Claude with the same prompt and the same MAX_TOKENS budget, so the answers are fair to compare on both length and quality.
  4. After each call it measures the round-trip latency by subtracting the start time from the current time.
  5. It appends one row per model: a timestamp, the model name, a short preview of the prompt, the full answer, and the latency in milliseconds.

Example run

Call compareModels('Summarise this client brief in two sentences: ...') and the comparison sheet gains two rows:

RunModelPromptAnswerLatency (ms)
2025-12-25 14:02claude-haiku-4-5-20251001Summarise this client brief…The client wants a refreshed brand and a one-page site…1180
2025-12-25 14:02claude-sonnet-4-6Summarise this client brief…The client is seeking a brand refresh paired with a single-page site…2640

Run the same prompt a few times across different jobs and a pattern emerges: for plain summarising the Haiku answer is close enough at half the latency, so that automation can safely use Haiku; for anything needing nuance, Sonnet earns its cost.

Run it

This is an evaluation job you run by hand while deciding which model an automation should use:

  1. In the Apps Script editor, write a one-line wrapper that calls compareModels with the prompt you want to test, then select it and click Run.
  2. Approve the authorisation prompt the first time.
  3. Open the comparison sheet and read the two rows side by side.

Watch out for

  • Latency is not fixed. Network conditions and API load both vary, so time the same prompt a few times before drawing conclusions from a single run.
  • The log stores only the first 100 characters of the prompt (PROMPT_PREVIEW_LENGTH). That keeps the sheet readable, but means you cannot rerun a prompt straight from the log — keep the full prompts elsewhere.
  • The models run one after another, not in parallel. Comparing many prompts is slow; for a long batch, watch the six-minute Apps Script execution limit.
  • Quality is a judgement call. The sheet shows you the answers and the latency, but deciding which answer is actually better still needs a human read.

Related