appscript.dev
Automation Advanced Drive Sheets

Find and stage duplicate files for deletion

Detect identical files across Drive by hash — stage them for review, don't auto-delete.

Published Jul 4, 2025

Northwind’s Drive accumulates copies the way every shared Drive does: someone downloads a deck, re-uploads it, “makes a copy” to edit safely, or a sync tool duplicates a folder. Over a couple of years that is gigabytes of identical bytes sitting in different folders, and no easy way to tell which copy is the one people actually use.

This script finds genuine duplicates — files with identical content, not just identical names — by hashing every file it walks and grouping the matches. It writes the groups to a sheet for review. It deliberately does not delete anything: a hash match tells you two files are byte-identical, but only a person knows which copy to keep, so the script stages the decision rather than making it.

What you’ll need

  • A root folder to scan. The script walks it and every subfolder.
  • A blank Google Sheet to receive the report — the script clears and rewrites its first tab on each run.
  • Both IDs: the folder ID from its URL after /folders/, and the spreadsheet ID from its URL after /d/.

The script

// The Drive folder to scan, recursively.
const ROOT_FOLDER_ID = '1abcRootFolderId';

// The spreadsheet that receives the duplicate report.
const REPORT_SHEET_ID = '1abcReportSheetId';

// Skip files above this size — hashing them reads the whole file into
// memory and is rarely worth it. See "Watch out for".
const MAX_HASH_BYTES = 25 * 1024 * 1024; // 25 MB

/**
 * Walks the root folder, hashes every file, and writes any group of
 * two or more byte-identical files to the report sheet for review.
 */
function findDuplicates() {
  // 1. Hash everything under the root folder, keyed by MD5 digest.
  const seen = new Map(); // md5 -> [File, File, ...]
  walkAndHash(DriveApp.getFolderById(ROOT_FOLDER_ID), seen);

  // 2. Keep only the hashes that map to more than one file.
  const rows = [];
  for (const [hash, files] of seen) {
    if (files.length < 2) continue;
    for (const f of files) {
      rows.push([hash, f.getName(), f.getUrl(), f.getSize()]);
    }
  }

  // 3. Nothing duplicated — leave a clear note and stop.
  const sheet = SpreadsheetApp.openById(REPORT_SHEET_ID).getSheets()[0];
  sheet.clear();
  if (!rows.length) {
    sheet.getRange(1, 1).setValue('No duplicates found.');
    return;
  }

  // 4. Write the header, then one row per duplicate file.
  sheet.getRange(1, 1, 1, 4).setValues([['Hash', 'Name', 'Link', 'Size (bytes)']]);
  sheet.getRange(2, 1, rows.length, 4).setValues(rows);
}

/**
 * Recursively hashes every file under `folder` and groups them in
 * `map` by MD5 digest. Files above MAX_HASH_BYTES are skipped.
 */
function walkAndHash(folder, map) {
  // Hash each file in this folder.
  const files = folder.getFiles();
  while (files.hasNext()) {
    const f = files.next();
    if (f.getSize() > MAX_HASH_BYTES) continue; // skip very large files

    const hash = Utilities.base64Encode(
      Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, f.getBlob().getBytes()));

    if (!map.has(hash)) map.set(hash, []);
    map.get(hash).push(f);
  }

  // Then descend into every subfolder.
  const subs = folder.getFolders();
  while (subs.hasNext()) walkAndHash(subs.next(), map);
}

How it works

  1. findDuplicates starts a Map keyed by hash and calls walkAndHash on the root folder.
  2. walkAndHash reads each file’s bytes, computes an MD5 digest, and pushes the file onto the list for that digest. Two files with the same content produce the same digest, so they land in the same bucket. It then recurses into every subfolder.
  3. Back in findDuplicates, the script keeps only the buckets holding two or more files — those are the actual duplicates — and flattens them into rows of hash, name, link and size.
  4. If no bucket has a match, it writes a single “No duplicates found” cell so the sheet clearly shows the run happened.
  5. Otherwise it writes a header and one row per duplicate file. Rows sharing a hash are the same content — that grouping is what you sort and review by.

Example run

After a run, the report sheet might look like this. The two rows sharing a hash are byte-identical files living in different folders:

HashNameLinkSize (bytes)
kZ9f…==Brand deck v4.pptxhttps://drive.google.com/…/abc4 812 004
kZ9f…==Brand deck FINAL.pptxhttps://drive.google.com/…/xyz4 812 004
pM2t…==Logo.pnghttps://drive.google.com/…/def88 210
pM2t…==Logo copy.pnghttps://drive.google.com/…/ghi88 210

Brand deck v4 and Brand deck FINAL are the same file under two names — now you can open both, decide which to keep, and delete the other by hand.

Run it

This is an occasional clean-up job, not something to schedule:

  1. In the Apps Script editor, select findDuplicates and click Run.
  2. Approve the authorisation prompt the first time.
  3. Open the report sheet, sort by the Hash column to group matches together, and review each group before deleting anything in Drive.

Watch out for

  • The script stages duplicates; it never deletes. A hash match proves the bytes are identical, not which copy is canonical — keep deletion a manual step.
  • Hashing reads every file’s full content into memory, which is why files above MAX_HASH_BYTES are skipped. Raising the cap risks hitting memory and runtime limits on a large Drive.
  • Native Google files — Docs, Sheets, Slides — have no stable downloadable byte stream, so getBlob() returns an export (often a PDF) rather than the real file. Treat hash matches as reliable for uploaded files (PDFs, images, Office documents) and unreliable for native Google files.
  • A big Drive can exceed the six-minute runtime limit, since every file is read in full. If you hit it, scan one subtree at a time by changing ROOT_FOLDER_ID.
  • Two different files can in theory share an MD5 digest. It is vanishingly unlikely for ordinary office files, but it is one more reason a person should eyeball each group before deleting.

Related