Find and stage duplicate files for deletion
Detect identical files across Drive by hash — stage them for review, don't auto-delete.
Published Jul 4, 2025
Northwind’s Drive accumulates copies the way every shared Drive does: someone downloads a deck, re-uploads it, “makes a copy” to edit safely, or a sync tool duplicates a folder. Over a couple of years that is gigabytes of identical bytes sitting in different folders, and no easy way to tell which copy is the one people actually use.
This script finds genuine duplicates — files with identical content, not just identical names — by hashing every file it walks and grouping the matches. It writes the groups to a sheet for review. It deliberately does not delete anything: a hash match tells you two files are byte-identical, but only a person knows which copy to keep, so the script stages the decision rather than making it.
What you’ll need
- A root folder to scan. The script walks it and every subfolder.
- A blank Google Sheet to receive the report — the script clears and rewrites its first tab on each run.
- Both IDs: the folder ID from its URL after
/folders/, and the spreadsheet ID from its URL after/d/.
The script
// The Drive folder to scan, recursively.
const ROOT_FOLDER_ID = '1abcRootFolderId';
// The spreadsheet that receives the duplicate report.
const REPORT_SHEET_ID = '1abcReportSheetId';
// Skip files above this size — hashing them reads the whole file into
// memory and is rarely worth it. See "Watch out for".
const MAX_HASH_BYTES = 25 * 1024 * 1024; // 25 MB
/**
* Walks the root folder, hashes every file, and writes any group of
* two or more byte-identical files to the report sheet for review.
*/
function findDuplicates() {
// 1. Hash everything under the root folder, keyed by MD5 digest.
const seen = new Map(); // md5 -> [File, File, ...]
walkAndHash(DriveApp.getFolderById(ROOT_FOLDER_ID), seen);
// 2. Keep only the hashes that map to more than one file.
const rows = [];
for (const [hash, files] of seen) {
if (files.length < 2) continue;
for (const f of files) {
rows.push([hash, f.getName(), f.getUrl(), f.getSize()]);
}
}
// 3. Nothing duplicated — leave a clear note and stop.
const sheet = SpreadsheetApp.openById(REPORT_SHEET_ID).getSheets()[0];
sheet.clear();
if (!rows.length) {
sheet.getRange(1, 1).setValue('No duplicates found.');
return;
}
// 4. Write the header, then one row per duplicate file.
sheet.getRange(1, 1, 1, 4).setValues([['Hash', 'Name', 'Link', 'Size (bytes)']]);
sheet.getRange(2, 1, rows.length, 4).setValues(rows);
}
/**
* Recursively hashes every file under `folder` and groups them in
* `map` by MD5 digest. Files above MAX_HASH_BYTES are skipped.
*/
function walkAndHash(folder, map) {
// Hash each file in this folder.
const files = folder.getFiles();
while (files.hasNext()) {
const f = files.next();
if (f.getSize() > MAX_HASH_BYTES) continue; // skip very large files
const hash = Utilities.base64Encode(
Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, f.getBlob().getBytes()));
if (!map.has(hash)) map.set(hash, []);
map.get(hash).push(f);
}
// Then descend into every subfolder.
const subs = folder.getFolders();
while (subs.hasNext()) walkAndHash(subs.next(), map);
}
How it works
findDuplicatesstarts aMapkeyed by hash and callswalkAndHashon the root folder.walkAndHashreads each file’s bytes, computes an MD5 digest, and pushes the file onto the list for that digest. Two files with the same content produce the same digest, so they land in the same bucket. It then recurses into every subfolder.- Back in
findDuplicates, the script keeps only the buckets holding two or more files — those are the actual duplicates — and flattens them into rows of hash, name, link and size. - If no bucket has a match, it writes a single “No duplicates found” cell so the sheet clearly shows the run happened.
- Otherwise it writes a header and one row per duplicate file. Rows sharing a hash are the same content — that grouping is what you sort and review by.
Example run
After a run, the report sheet might look like this. The two rows sharing a hash are byte-identical files living in different folders:
| Hash | Name | Link | Size (bytes) |
|---|---|---|---|
| kZ9f…== | Brand deck v4.pptx | https://drive.google.com/…/abc | 4 812 004 |
| kZ9f…== | Brand deck FINAL.pptx | https://drive.google.com/…/xyz | 4 812 004 |
| pM2t…== | Logo.png | https://drive.google.com/…/def | 88 210 |
| pM2t…== | Logo copy.png | https://drive.google.com/…/ghi | 88 210 |
Brand deck v4 and Brand deck FINAL are the same file under two names — now
you can open both, decide which to keep, and delete the other by hand.
Run it
This is an occasional clean-up job, not something to schedule:
- In the Apps Script editor, select
findDuplicatesand click Run. - Approve the authorisation prompt the first time.
- Open the report sheet, sort by the Hash column to group matches together, and review each group before deleting anything in Drive.
Watch out for
- The script stages duplicates; it never deletes. A hash match proves the bytes are identical, not which copy is canonical — keep deletion a manual step.
- Hashing reads every file’s full content into memory, which is why files above
MAX_HASH_BYTESare skipped. Raising the cap risks hitting memory and runtime limits on a large Drive. - Native Google files — Docs, Sheets, Slides — have no stable downloadable
byte stream, so
getBlob()returns an export (often a PDF) rather than the real file. Treat hash matches as reliable for uploaded files (PDFs, images, Office documents) and unreliable for native Google files. - A big Drive can exceed the six-minute runtime limit, since every file is read
in full. If you hit it, scan one subtree at a time by changing
ROOT_FOLDER_ID. - Two different files can in theory share an MD5 digest. It is vanishingly unlikely for ordinary office files, but it is one more reason a person should eyeball each group before deleting.
Related
Detect and report broken file shortcuts
Find Drive shortcuts in Northwind folders pointing at deleted or inaccessible files.
Updated Dec 3, 2025
Build a Drive cleanup recommendation report
Suggest what Northwind can delete or archive — large, stale, duplicate, or untouched files.
Updated Nov 21, 2025
Generate a folder-level changelog
Track additions and deletions in a Northwind folder over time — a written history.
Updated Nov 5, 2025
Track contract expiry from Drive files
Read expiry dates out of Northwind contract Docs and warn before renewals.
Updated Oct 28, 2025
Build a Drive quota early-warning system
Alert Northwind before storage runs out — email when usage crosses 80%.
Updated Oct 20, 2025