appscript.dev
Automation Intermediate Docs Drive

Build a Doc-to-clean-HTML exporter

Convert formatted Docs into publish-ready HTML — strip the Google CSS, keep the structure.

Published Jul 20, 2025

Northwind writes blog drafts in Google Docs because that is where the editing and comments happen. The trouble starts at publish time: Docs’ built-in File → Download → Web Page export wraps everything in pages of inline <style> rules and class attributes that fight the CMS theme. Paste it in and the post looks broken.

This script exports a Doc to HTML and then scrubs it down to the structural tags a CMS actually wants — headings, paragraphs, lists, and links — with all the Google-generated styling removed. The writer gets a clean .html file they can paste straight into the editor.

What you’ll need

  • A Google Doc to export — Northwind’s blog drafts live in a shared Drive folder.
  • The Doc’s file ID, taken from its URL (/document/d/THIS_PART/edit).
  • A Drive folder to hold the exported .html file. This example drops it in the root of My Drive.
  • No extra setup — the export uses the Doc service’s own authorised token, so there is no API key to manage.

The script

// The Doc to export when running exportCurrentDraft directly.
const BLOG_DRAFT_ID = '1abcBlogDraftId';

// Filename written to Drive for the cleaned export.
const OUTPUT_FILENAME = 'draft.html';

/**
 * Exports a Google Doc to HTML and strips out the Google-generated
 * styling, returning publish-ready markup.
 * @param {string} docId The file ID of the Doc to export.
 * @return {string} Cleaned HTML — structural tags only.
 */
function docToCleanHtml(docId) {
  // 1. Ask the Docs export endpoint for the HTML version of the file.
  const url = `https://docs.google.com/feeds/download/documents/export/Export?id=${docId}&exportFormat=html`;
  const raw = UrlFetchApp.fetch(url, {
    headers: { Authorization: `Bearer ${ScriptApp.getOAuthToken()}` },
  }).getContentText();

  // 2. Strip the export down to clean, structural HTML.
  return raw
    // Remove the inline stylesheet Docs injects.
    .replace(/<style[\s\S]*?<\/style>/g, '')
    // Remove meta tags and the whole head.
    .replace(/<meta[^>]*>/g, '')
    .replace(/<head>[\s\S]*?<\/head>/, '')
    // Drop class and id attributes that hook into the dropped CSS.
    .replace(/ class="[^"]*"/g, '')
    .replace(/ id="[^"]*"/g, '')
    // Unwrap span tags — they only ever carried styling.
    .replace(/<span[^>]*>([\s\S]*?)<\/span>/g, '$1')
    // Collapse anchor tags down to a plain href.
    .replace(/<a[^>]*href="([^"]+)"[^>]*>/g, '<a href="$1">')
    // Delete empty paragraphs (Docs leaves a lot of these).
    .replace(/<p[^>]*>(\s|<br>)*<\/p>/g, '');
}

/**
 * Exports the configured blog draft and saves the cleaned HTML to Drive.
 */
function exportCurrentDraft() {
  const html = docToCleanHtml(BLOG_DRAFT_ID);
  const file = DriveApp.createFile(OUTPUT_FILENAME, html, 'text/html');
  Logger.log(`Saved cleaned HTML to ${file.getUrl()}`);
}

How it works

  1. docToCleanHtml builds the URL for the Docs export endpoint, asking for the html export format.
  2. It fetches that URL with UrlFetchApp, passing the script’s own OAuth token as a bearer header — that is what authorises the download without an API key.
  3. It then runs the raw export through a chain of regular-expression replacements. First it removes the injected <style> block, the <meta> tags, and the whole <head> section.
  4. Next it strips every class and id attribute, since those only existed to target the CSS that was just deleted.
  5. It unwraps <span> tags — keeping their text content — because Docs uses spans purely as styling hooks.
  6. It simplifies each <a> tag down to a bare href, dropping the Google redirect wrappers and styling attributes.
  7. Finally it deletes empty paragraphs, which Docs scatters generously through the export.
  8. exportCurrentDraft calls the cleaner for the configured draft and writes the result to Drive as an .html file.

Example run

A Doc heading and paragraph export from Google as something like:

<p class="c3"><span class="c1">Northwind ships faster</span></p>
<p class="c2"><span class="c0">Our new release cuts setup time in half.</span></p>
<p class="c2"></p>

After docToCleanHtml runs, the same content comes out clean:

<p>Northwind ships faster</p>
<p>Our new release cuts setup time in half.</p>

The classes, spans, and the trailing empty paragraph are gone — what is left pastes cleanly into the CMS.

Run it

This is an on-demand job — run it whenever a draft is ready to publish:

  1. Set BLOG_DRAFT_ID to the file ID of the Doc you want to export.
  2. In the Apps Script editor select exportCurrentDraft and click Run.
  3. Approve the authorisation prompt the first time.
  4. Open the logged URL to download draft.html, then paste its contents into the CMS.

Watch out for

  • Regex cleaning is a pragmatic tool, not a full HTML parser. It handles the predictable patterns in a Docs export well, but unusual content — nested tables, deeply styled spans — may leave stray tags. Spot-check the output.
  • The export keeps heading levels (<h1>, <h2>) and lists, but inline formatting like bold and italic survives as <b>/<i> tags. If your CMS expects <strong>/<em>, add another replacement.
  • Images in the Doc export as base64 data URIs or linked Google URLs, not as files. For image-heavy posts, upload images to the CMS separately.
  • The export endpoint is unofficial. It has been stable for years, but it is not a documented API — if it ever changes, fall back to the Drive API’s export method.
  • DriveApp.createFile drops the file in the root of My Drive. Pass a folder via DriveApp.getFolderById(...).createFile(...) if you want it filed away.

Related