How two bugs in LlamaIndex cancel each other
Two independent bugs in llama-index-core happen to mask
each other in the default configuration. A fix to either one in
isolation activates the cost impact of the other. Here is the
three-line patch, verified end to end against the real OpenAI API and
a real S3 bucket.
I found a hash-churn bug in LlamaIndex's ingestion pipeline. I reproduced it end to end against real OpenAI embedding calls. Then I tried to extend the same reproduction to S3, expecting things to be worse there, and was surprised when the bug did not fire at all. The reason turned out to be a second, unrelated bug in a different module. The two happen to cancel each other. Bug 1 needs volatile metadata in the hash to misbehave. Bug 2 accidentally strips volatile metadata from cloud-backed documents.
The pair sits one fix apart from compounding instead of cancelling. When LlamaIndex eventually patches the second bug, or when a user writes a metadata callable that bypasses it, the first bug activates for every fsspec-backed reader in the ecosystem at once.
This is a walk through both bugs, the experiments that demonstrate them, and the small change that decouples them. The reproducers live in a public repo: github.com/stirelli/llamaindex-embedding-churn.
TL;DR
Node.hash,TextNode.hash, and theIngestionCachekey all include metadata viaMetadataMode.ALL, which ignoresexcluded_embed_metadata_keys. Any change to a metadata field that should have been filtered out flips the hash and forces a re-embed of otherwise-unchanged content.SimpleDirectoryReaderover a local filesystem populates date-only timestamps. Any daily or weekly re-indexing cron over a corpus where files get modified between runs silently re-embeds the modified files on the next run, even when content is byte-identical. I verified this end to end against the real OpenAI API.SimpleDirectoryReaderover fsspec cloud backends (s3fs, gcsfs, adlfs) does not populate those timestamps, becausedefault_file_metadata_funcqueries POSIX stat keys that fsspec backends don't emit. The data is available in the backend. It just is not queried.- A one-line change to
default_file_metadata_func, usingfs.modified(path)instead ofstat.get("mtime"), would activate the hash-churn bug for every fsspec-backed reader at once. The bug is not absent from the cloud path. It is hiding behind another bug.
Bug 1: Node.hash includes all metadata
The first piece is in llama-index-core/llama_index/core/schema.py. The hash property on a node computes a SHA-256 over a string that concatenates the text hash with the full metadata string:
@property
def hash(self) -> str:
doc_identities = []
metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
if metadata_str:
doc_identities.append(metadata_str)
# ... audio, image, video resources
if self.text_resource is not None:
doc_identities.append(self.text_resource.hash)
doc_identity = "-".join(doc_identities)
return str(sha256(doc_identity.encode("utf-8", "surrogatepass")).hexdigest())
The relevant detail is MetadataMode.ALL. Documents carry two exclusion lists, excluded_embed_metadata_keys and excluded_llm_metadata_keys, that control which fields are included in the text sent to the embedder and the LLM. MetadataMode.ALL ignores both lists. Every key goes into the hash, regardless of whether it was marked as volatile.
Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embedding. Any field in the metadata dict that can change between ingestion runs will therefore trigger re-embed on the next run, even when the text content is byte-identical.
To confirm the mechanism in isolation, the simplest experiment is a hash comparison under four distinct metadata mutations plus a control:
Scenario Text same? Hash same? Churn
atime True False 4/4 (100%)
mtime True False 4/4 (100%)
size True False 4/4 (100%)
ctime True False 4/4 (100%)
CONTROL True True 0/4 (0%)
Every mutation type flips the hash for 100% of chunks. The control, identical metadata on both runs, produces identical hashes. The hash is deterministic, not noisy. Bug 1 is real.
End-to-end proof against the real OpenAI API
The hash comparison is a mechanical demonstration. The production-relevant question is whether this bug fires in real ingestion flows.
The smallest real flow is SimpleDirectoryReader over a local filesystem, the pattern every LlamaIndex quickstart uses. The reader populates file-stat-derived timestamps via default_file_metadata_func, formatting each via strftime("%Y-%m-%d"). That date-only granularity matters: modifications within the same calendar day do not change the metadata string. Only cross-day transitions flip it.
The production scenario is straightforward. A daily or weekly re-indexing job runs SimpleDirectoryReader over /data/docs. Between runs, a subset of files gets modified by a formatter, a sync tool, an editor save, a git checkout, anything that advances mtime. On the next run, each modified file's last_modified_date string changes because the date crossed a day boundary since the last recorded mtime. The hash flips, the vectors get deleted, and the embedding provider gets called again.
To verify this against real billed usage, I set up a minimal pipeline with three small .md files, mtime pinned to yesterday via os.utime, SimpleDirectoryReader feeding a SentenceSplitter feeding OpenAIEmbedding(model="text-embedding-3-small"). Four phases: initial ingest, re-ingest with no change, touch() to advance mtime to today, re-ingest.
The billed token counts matched a local tiktoken estimate exactly, zero delta. Half of the run's cost was the legitimate first ingest. The other half was a full re-embed of byte-identical content, triggered by a single touch that moved mtime from yesterday to today.
calls tokens cost (USD)
Legitimate (Ph 1) 3 465 $0.00000930
Wasted (Ph 4) 3 465 $0.00000930
TOTAL (actual) 6 930 $0.00001860
Overhead: 100% of the bill was wasted on re-embedding
identical content.
At this scale the cost is a rounding error. At 1M docs per week being touched across a RAG corpus, this is the kind of silent overhead that shows up as embedding bill drift in a CFO review months after it starts.
Trying to extend the finding to S3
At this point the claim was narrow but verified. SimpleDirectoryReader over local filesystem, cross-day modification, date-granular strftime producing a hash flip. The natural next question was whether this extends to the production cloud readers, where the same SimpleDirectoryReader machinery sits on top of fsspec filesystems like s3fs, gcsfs, and adlfs.
I expected the bug to be worse for those readers. S3's LastModified has second precision. If that field ended up in the hash, every S3 re-upload, even sub-second apart, even of byte-identical content, would flip the hash. That would be a much broader trigger surface than "cross-day modifications on local files."
I set up a real S3 bucket, uploaded three .md files, wired up S3Reader with the same counting-embedder pipeline, and ran the same experiment: first ingest, re-ingest without changes, re-upload with two seconds of delay, re-ingest again. The result was:
Phase A1 (first ingest): embed_calls = 3
Phase A2 (no change): embed_calls = 3 (expect unchanged)
Phase A4 (after re-upload): embed_calls = 3 (expect unchanged if bug does NOT fire)
Zero new embed calls after the re-upload. The bug did not fire. I rebuilt the test from scratch twice, suspecting the docstore strategy or the cache configuration. The result was consistent. Re-uploading the same content to S3 and re-running S3Reader.load_data() produces zero new embed calls.
The Document metadata explained why:
>>> docs[0].metadata.keys()
dict_keys(['file_path', 'file_name', 'file_type', 'file_size'])
No last_modified_date. No creation_date. None of the temporal fields that fire the bug on local filesystems are here.
Bug 2: default_file_metadata_func uses POSIX-only keys
The missing temporal fields are not missing because S3 does not have them. s3fs.stat() returns LastModified as a native datetime. Fsspec's standard API exposes it via s3fs.modified(path). The data is there. It just is not being queried.
Look at how default_file_metadata_func in llama-index-core/llama_index/core/readers/file/base.py extracts it:
creation_date = _format_file_timestamp(stat_result.get("created"))
last_modified_date = _format_file_timestamp(stat_result.get("mtime"))
last_accessed_date = _format_file_timestamp(stat_result.get("atime"))
It looks for "created", "mtime", and "atime", the lowercase POSIX-style keys that Python's os.stat() returns for local files. Fsspec backends don't emit those keys. s3fs returns LastModified. gcsfs returns updated. adlfs (Azure Blob) returns last_modified. Each matches its cloud provider's native API shape.
The result is stat_result.get("mtime") returning None for every fsspec backend, _format_file_timestamp(None) returning None, and the default_file_metadata_func postprocessor filtering out None values before returning. Temporal fields silently don't make it into the Document metadata. There's no warning, no log, no indication that the reader is working with a stripped-down metadata set.
This is Bug 2. It is not the same as Bug 1. It is in a different module, with a different failure mode. Its effect is that cloud-backed Documents lose the volatile metadata that Bug 1 would otherwise churn on.
Proof that fixing Bug 2 would activate Bug 1
The third experiment in the reproducer repo demonstrates what happens when a caller bypasses Bug 2 by writing a file_metadata callable that queries fsspec correctly:
def s3_aware_metadata(file_path: str) -> dict:
shared_s3fs.invalidate_cache()
return {
"file_path": file_path,
"file_size": shared_s3fs.size(file_path),
"last_modified_date":
shared_s3fs.modified(file_path).strftime("%Y-%m-%dT%H:%M:%SZ"),
}
reader_c = S3Reader(bucket=BUCKET, prefix=PREFIX,
file_metadata=s3_aware_metadata)
The only difference from Experiment A is the file_metadata argument. The rest of the pipeline is identical. With this in place, the same experiment fires the bug:
Experiment A (default S3Reader.load_data()):
embed_calls = 3 → bug fires: False
Experiment C (custom s3_aware_metadata callable):
embed_calls = 6 → bug fires: True
With correctly-queried datetime in metadata, a 2-second S3 re-upload flips last_modified_date, flips the hash, triggers the delete and re-embed. Three extra embed calls for three re-uploaded files, for zero content change. The mechanism is identical to Bug 1 in the first experiment. The only reason Experiment A did not fire is that the default reader path did not populate the field. Once that gap is closed, the churn bug is back. Now with second precision instead of day precision.
What this means in practice
A few consequences worth stating plainly.
Current behavior, today. If you run SimpleDirectoryReader over a local filesystem with any scheduled re-indexing, any file modified between runs re-embeds on the next run that crosses a calendar day. This is verified end to end. For teams running daily re-indexing cron jobs over corpora that change often, this is silent overhead that scales with modification rate.
Current non-behavior, today. If you use a cloud reader built on top of SimpleDirectoryReader plus an fsspec backend, the bug does not fire, because Bug 2 strips the temporal metadata before Bug 1 can use it. This is verified end to end for S3. The code inspection for gcsfs and adlfs shows identical key-mismatch patterns.
Future behavior, if Bug 2 gets fixed. The natural fix to default_file_metadata_func is to use fsspec's fs.modified(path) API instead of reading POSIX-specific keys out of the stat dict. That fix is a one-liner. The moment it lands, Bug 1 activates for every fsspec-backed reader in the ecosystem, now at second or sub-second precision because that is what cloud APIs expose. Every S3 object re-upload, every GCS object overwrite, every Azure Blob update becomes a re-embed trigger, even for byte-identical content.
Future behavior, if a user writes a custom callable. Same activation, scoped to that user's project. Anyone who wants proper freshness tracking in their metadata naturally reaches for fs.modified(), and the bug comes with it.
Reproducing it
All five levels of reproducer are in a public repo: github.com/stirelli/llamaindex-embedding-churn.
git clone https://github.com/stirelli/llamaindex-embedding-churn
cd llamaindex-embedding-churn
uv venv
uv pip install -r requirements.txt
uv run python verify_embedding_churn_lvl1.py # hash comparison
uv run python verify_embedding_churn_lvl2.py # counting embedder
uv run python verify_embedding_churn_lvl4.py # reader format survey
Levels 1, 2, and 4 run in under ten seconds with no external dependencies. Level 3 needs an OpenAI API key (the cost per run is around two thousandths of a cent, and your dashboard will reflect the exact tokens). Level 5 needs AWS credentials and an S3 bucket (cost: fractions of a cent in S3 request fees).
Suggested fix
The minimal set of changes is three lines across two files. All three sites use MetadataMode.ALL today. Changing them to MetadataMode.EMBED makes the hash align with the text that is actually sent to the embedder, which is the semantically correct thing to hash.
- metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+ metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
This respects excluded_embed_metadata_keys, the mechanism that already exists for marking fields as not-content-relevant. SimpleDirectoryReader populates that list with the volatile file-stat fields by default. The fix closes the churn without touching the original behavior added to detect meaningful metadata changes.
I filed the findings upstream as issue #21461 and submitted PR #21462 with the three-line fix plus a regression test that covers both directions: volatile metadata must not force a re-embed, content-relevant metadata still must. Both are public. Both are verifiable.
Closing
Two bugs cancelling each other is a stable equilibrium in an unstable sense. It survived thirteen months in the codebase because neither half produces a visible failure on its own. Bug 1 looks like normal re-embedding to the user who has never measured it. Bug 2 looks like the reader just not emitting temporal metadata, which is fine if nobody notices. Together they make a cloud-backed RAG pipeline look correct. Until someone closes the fsspec gap, or writes the right callable, and Bug 1 wakes up at sub-second precision across the ecosystem.
The narrow takeaway is: audit your ingestion pipeline's hash key against the input your embedder actually sees. If the two are not the same, you are paying for work that produces the same output. The broader takeaway is that this kind of interaction sits quietly in large codebases, and a well-intentioned cleanup PR is all it takes to release it.