appscript.dev
Automation Advanced Sheets

Scrape a web table into cells with one formula

Pull HTML tables into Sheets as a custom function — no IMPORTHTML quirks.

Published Jul 30, 2025

Northwind’s pricing team tracks a handful of competitor and supplier pages that publish their numbers in a plain HTML table. The built-in IMPORTHTML formula can read those tables, but it is fragile: it counts tables by position, breaks the moment a site adds a stray layout table, and gives you no control over how the cells are cleaned.

A custom function fixes that. FETCH_TABLE fetches the page, finds the table you ask for, strips the HTML out of every cell, and spills the result into the grid — the same shape IMPORTHTML returns, but as code you own and can adjust when a site changes.

What you’ll need

  • A Google Sheet where you want the scraped table to land.
  • The URL of a public page that contains the table — no login, no JavaScript rendering required (see “Watch out for”).
  • The index of the table on that page, counting from 0. Most pages have the table you want at index 0; if not, try 1, 2, and so on.

The script

// How long to keep a fetched page in cache, in seconds. Six hours keeps
// re-calculations off the network without the data going badly stale.
const FETCH_TABLE_CACHE_SECONDS = 6 * 60 * 60;

/**
 * Fetches an HTML table from a web page and spills it into the grid.
 * Returns the same row/column shape as IMPORTHTML, so it can be dropped
 * straight into a cell.
 *
 * @param {string} url The page to fetch.
 * @param {number} index Which table on the page to read, counting from 0.
 * @return {string[][]} The table cells, or an error marker.
 * @customfunction
 */
function FETCH_TABLE(url, index = 0) {
  // 1. Bail out early if no URL was passed.
  if (!url) return '';

  // 2. Reuse a recently fetched page if we have one. Custom functions are
  //    called once per cell, so caching avoids re-fetching on every edit.
  const cache = CacheService.getScriptCache();
  const cacheKey = 'table:' + url;
  let html = cache.get(cacheKey);

  // 3. Fetch the page only on a cache miss.
  if (!html) {
    html = UrlFetchApp
      .fetch(url, { muteHttpExceptions: true })
      .getContentText();
    cache.put(cacheKey, html, FETCH_TABLE_CACHE_SECONDS);
  }

  // 4. Pull out every <table> block on the page.
  const tables = html.match(/<table[\s\S]*?<\/table>/g) || [];
  if (!tables[index]) return '#NO_TABLE';

  // 5. Split the chosen table into rows, and each row into cleaned cells.
  const rows = (tables[index].match(/<tr[\s\S]*?<\/tr>/g) || []).map((tr) =>
    (tr.match(/<t[dh][^>]*>([\s\S]*?)<\/t[dh]>/g) || []).map((cell) =>
      cleanCell(cell)
    )
  );

  // 6. Drop fully empty rows (spacer rows some sites add for layout).
  return rows.filter((row) => row.some((cell) => cell !== ''));
}

/**
 * Strips the HTML out of a single table cell and tidies the whitespace.
 * Keeps the text content, throws away tags, and decodes the few entities
 * that show up most often in tables.
 *
 * @param {string} cellHtml The raw <td> or <th> HTML.
 * @return {string} The plain-text cell value.
 */
function cleanCell(cellHtml) {
  return cellHtml
    .replace(/<[^>]+>/g, '')        // remove every tag
    .replace(/&nbsp;/g, ' ')        // decode the common entities
    .replace(/&amp;/g, '&')
    .replace(/&lt;/g, '<')
    .replace(/&gt;/g, '>')
    .replace(/\s+/g, ' ')           // collapse runs of whitespace
    .trim();
}

How it works

  1. FETCH_TABLE returns an empty string straight away if the cell passes no URL, so a half-typed formula never throws.
  2. It checks CacheService for a copy of the page. A custom function is re-evaluated on every recalculation, so without a cache a busy sheet would hammer the target site.
  3. On a cache miss it fetches the page with UrlFetchApp and stores the HTML for six hours (FETCH_TABLE_CACHE_SECONDS).
  4. A regular expression pulls out every <table>...</table> block. If the requested index does not exist, the function returns #NO_TABLE so you can see the problem in the cell.
  5. The chosen table is split into <tr> rows, and each row into <td>/<th> cells. cleanCell strips the tags, decodes common entities, and collapses whitespace.
  6. Fully empty rows — spacer rows some sites add for layout — are dropped, and the remaining grid is returned. Sheets spills it across the cells below.

Example run

Say https://example.com/prices contains this table:

<table>
  <tr><th>SKU</th><th>Product</th><th>Price</th></tr>
  <tr><td>NW-1001</td><td>Oak desk</td><td>&pound;420</td></tr>
  <tr><td>NW-1002</td><td>Steel shelf</td><td>&pound;180</td></tr>
</table>

Typing =FETCH_TABLE("https://example.com/prices", 0) into A1 spills:

SKUProductPrice
NW-1001Oak desk£420
NW-1002Steel shelf£180

Use it

Type the formula into the top-left cell where you want the table to appear:

=FETCH_TABLE("https://example.com/prices", 0)

The result spills down and to the right automatically, so leave those cells empty. Custom functions cache aggressively and Sheets will not re-fetch unless the formula text changes. To force an hourly refresh, fold the current hour into the URL so the formula changes once an hour:

=FETCH_TABLE("https://example.com/prices?v=" & TEXT(NOW(), "HH"), 0)

Watch out for

  • This reads the raw HTML the server sends. Pages that build their table with JavaScript in the browser will look empty to UrlFetchApp and return #NO_TABLE.
  • Regex parsing of HTML is good enough for clean, simple tables. Nested tables, cells that span rows or columns (rowspan/colspan), and unusual markup will not line up correctly — inspect the result before trusting it.
  • The six-hour cache means a freshly published change can take up to six hours to appear unless you use the hourly-refresh trick above.
  • Scraping a site without permission may breach its terms of service, and a layout change there can break this function without warning. Where a site offers a proper data feed, use it instead — see Sync invoices to QuickBooks or Xero for an example of working against a real API.

Related