Every Conflict Produces the Same Footage — Twice

This week, a video of explosions near a building went viral on social media, captioned as showing missile strikes on Tel Aviv during the Israel-Iran conflict. It wasn't. Fact-checkers at Full Fact traced it to a warehouse fire in China, recorded in 2015. But here's the thing: they'd already debunked the same clip twice before — once when it circulated with false claims during Iran's missile attack on Israel in October 2024, and before that during Russia's invasion of Ukraine in 2022.

One piece of footage. Three conflicts. Seven years.

This isn't an outlier. It's the pattern. During every major crisis — conflict, earthquake, flood, political upheaval — old footage resurfaces with new captions. The Beirut port explosion of 2020 was repackaged as Ukrainian war footage. A Kentucky gun range video was broadcast by ABC News as Turkish military strikes on Kurdish civilians. Footage of a 2025 plane crash in Philadelphia now resurfaces every time a conflict escalates, repeatedly misattributed as scenes from Israel.

The images are real. The context is fabricated. And fact-checkers at organisations like Bellingcat, AFP, and Full Fact are playing an endless game of whack-a-mole — manually tracing, debunking, and re-debunking the same recycled content every time a new crisis breaks.

I wanted to see if there was a better way.


The idea: fingerprint everything, track everything

MediaWatch is an open-source proof of concept I'm building to automate media provenance tracking across social media. The core concept is simple:

Every image, video, or audio file that enters the system gets a unique, content-derived fingerprint. That fingerprint is stored in a searchable database. When the same — or similar — content appears again, the system recognises it, regardless of whether it's been cropped, re-compressed, watermarked, or re-encoded.

The result is a living record of how media circulates. When did this image first appear? On which platform? Who posted it? When did it resurface? In what context?

For a newsroom, this means an analyst can drag-and-drop a suspicious image and instantly see its entire history. For an intelligence team monitoring information operations, it means automated detection of recycled content being used to construct false narratives.


The technology: ISCC and similarity-preserving hashes

The fingerprinting engine behind MediaWatch is the International Standard Content Code (ISCC), published as ISO 24138:2024. Unlike a traditional cryptographic hash (where changing a single pixel produces a completely different hash), ISCC generates similarity-preserving fingerprints. Similar content produces similar codes.

Think of it like this: a cryptographic hash is a passport number — change one digit and it's a completely different identity. An ISCC code is more like a facial recognition scan — you can put on sunglasses and a hat, and the system still recognises you.

ISCC works at multiple levels simultaneously. It generates four sub-codes for every piece of media:

  • A Meta-Code capturing title and caption similarity
  • A Content-Code capturing what the media actually looks and sounds like
  • A Data-Code capturing the binary structure of the file
  • An Instance-Code for exact byte-level matching

The Content-Code is the most interesting for our use case. It survives re-encoding, compression changes, and moderate cropping. The distance between two Content-Codes (measured in Hamming distance — essentially counting how many bits differ) tells you how similar two pieces of media are.

A distance of 0 means identical perceptual content. A distance of 5-15 typically indicates the same image with modifications. Above 30, you're looking at genuinely different content.


The architecture: local, open-source, modular

MediaWatch runs entirely on your machine — no data leaves your infrastructure. This is non-negotiable for newsroom and intelligence use cases where the media under investigation may be sensitive.

The stack is built from open-source components orchestrated via Docker Compose:

  • A FastAPI service wrapping the ISCC Python SDK for fingerprint generation and metadata extraction
  • Milvus as the vector database — it supports native binary vectors with Hamming distance search, which is exactly what ISCC codes need
  • PostgreSQL for structured metadata, sighting histories, and cluster records
  • MinIO for storing the actual media files
  • A Next.js dashboard for analysts to search, explore timelines, and investigate clusters of related media

The pipeline flow is: media arrives → fingerprint is generated → the vector database is searched for matches → if it's new, store it; if it's a duplicate, record the sighting → the analyst sees it all in the dashboard.


What it actually looks like

The dashboard shows media objects flowing through the system — new ingestions, duplicates detected, active events being monitored.

The most powerful feature is upload search: drag an image onto the search page and the system generates its ISCC fingerprint on the fly, queries the database, and returns any matches ranked by similarity. Each result shows the Hamming distance — a visual indicator of how close the match is.

When you click through to a media object, you see its full timeline — every time this image (or a variant of it) was observed on social media, on which platform, by which account, with what caption. Gaps in the timeline show dormancy periods. Clusters of sightings within hours show viral surges.

The cluster explorer uses a force-directed graph to visualise families of related media — the original and all its variants (cropped, watermarked, re-encoded) connected by their similarity scores. An analyst can see at a glance how a single piece of footage has been modified and redistributed.

The event monitor flags the most interesting pattern of all: media from one event appearing in the feed of a different event. That's the recycled footage signal.


Early results and honest limitations

I've been running calibration tests with sets of images subjected to various edits. Some findings:

What works well: Re-compression and format conversion are nearly invisible to ISCC — an image re-saved at 40% JPEG quality still matches the original with a distance close to zero. Resizing, light watermarks, and text overlays produce low distances that fall comfortably in the "near-duplicate" range.

What's harder: Tight crops (isolating a small area of a larger image) produce higher distances because you're discarding most of the perceptual content. A crop keeping 80% of the image is reliably detected; a crop keeping 30% is borderline. This is a fundamental property of perceptual hashing, not a bug — and it's why the system classifies results into tiers (exact, near-duplicate, related) rather than making binary same/different decisions.

What I'd like to add: Semantic similarity search using CLIP or ImageBind embeddings alongside the ISCC codes. This would answer "find all images of tanks in urban settings" rather than just "find all copies of this specific image." Milvus supports multiple vector fields per record, so both binary ISCC vectors and dense CLIP vectors can coexist in the same collection.


Why open-source, why now

The tools for media forensics exist in fragments. Reverse image search. EXIF viewers. Manual databases of known misattributed content. What doesn't exist — at least not as open-source infrastructure — is a systematic, automated pipeline that continuously tracks media provenance across platforms and time.

MediaWatch is a proof of concept, not a finished product. But it demonstrates that the building blocks are mature enough to assemble into something useful. ISCC is an ISO standard with a well-maintained Python SDK. Milvus handles binary vector similarity search at scale.

I first came across ISCC whilst working on The Creative Passport (now https://auracles.io). We were invited to collaborate on a white paper and proposal for an EU Copyright Infrastructure, so I was introduced to ISCC early on.

By itself this tool has limited use but if we look at pooling the creation of the ISCC codes and being able to track how these images flow into social media conversations we can hopefully combat a lot of disinformation.

The code is on GitHub. If you work in journalism, OSINT, or media integrity and this resonates — I'd welcome collaborators, testers, and feedback.

You can email me at mark@geekyoto.com


Tech stack: Python, FastAPI, ISCC (ISO 24138), Milvus, PostgreSQL, MinIO, Next.js, Docker Compose

Tags: #OSINT #MediaForensics #Disinformation #OpenSource #ContentProvenance #ISCC #VectorSearch