Why OpenViking Memory Retrieval Kept Failing Until I Switched to an Older Model

2025-09-29

For three days, I was trying to figure out why Hermes Agent, wired up to OpenViking for long-term memory, never seemed to search its own knowledge base. Every conversation felt like a fresh start, as if nothing had ever been stored.

My first assumption was that I had misconfigured something. OpenViking is an open-source project from ByteDance, so a basic implementation mistake on their side seemed less likely than a mistake on mine. The version was v0.3.14. I initially ran the VLM locally through Ollama, then switched it to the MiniMax API. Embeddings were handled locally with bge-small-zh. Health checks returned 200 OK, the logs showed no obvious errors, and on the surface everything looked fine.

And yet viking_search always returned nothing.

First check: the vector database was empty

The most basic question was whether anything had actually been written into storage.

curl http://127.0.0.1:1933/api/v1/observer/vikingdb

The result showed Vector Count = 0. The vector database was completely empty. At the same time, the Semantic-Nodes queue showed 3335 items either being processed or already processed, which meant the pipeline was running but data had not landed in the database yet.

A few minutes later, I checked again. The count had gone from 0 to 856.

So that part turned out not to be broken at all. It was just an asynchronous queue backlog. Once processing caught up, search started working.

But that only explained resource indexing through viking_add_resource. Session memory written by viking_remember was a separate issue entirely.

Second check: the VLM extracted content, but no memory was saved

Calls to viking_remember returned stored, and the messages did enter the session queue. Session commit was triggered normally, the task status became completed, and the VLM had clearly been invoked:

llm_token_usage: {prompt_tokens: 30893, completion_tokens: 4050, total_tokens: 34943}

That is a prompt of over 30,000 tokens and a completion of over 4,000. The model was definitely doing work.

But memories_extracted came back as an empty object {}. All eight categories were zero.

That was the part that made no sense. The VLM had been called, it had produced an output, the task had completed successfully, and still no memory was stored.

The answer finally showed up in the systemd logs:

Direct model validation failed: 1 validation error for StructuredMemoryOperations
profile.name
Extra inputs are not permitted [type=extra_forbidden]

So the model had not failed to extract anything. It had actually extracted useful content. It correctly identified OpenViking's five tools, the version number, and the configuration details. But the JSON it generated looked like this:

{"entities": [...], "tools": [...], "profile": {"name": "default", "content": "..."}, "identity": {...}}

OpenViking, however, expected a Pydantic schema like this:

{"reasoning": "...", "write_uris": [...], "edit_uris": [...], "delete_uris": [...]}

In other words, M2.5 had decided to invent its own structure. Instead of following the schema in the prompt, it reorganized memory into keys like entities, tools, and profile, and even added a name field under profile that the schema did not allow.

Because OpenViking uses Pydantic with extra='forbid', the entire output was rejected immediately. Around 35,000 tokens of extracted memory were effectively thrown away in one second.

That behavior is not really OpenViking's fault. From an engineering perspective, extra='forbid' is a sensible guardrail. If arbitrary model-generated fields are accepted, malformed or dirty data ends up in storage. Models like GPT and Claude are usually strong enough at instruction following not to drift like this. What surprised me was how weak M2.5 turned out to be in this particular area. It was making up field names, wrapping strings inside dictionaries, and generally ignoring the schema defined in the prompt.

Third check: newer was not better

At that point I started wondering whether this was specific to M2.5, so I tested M2.1 and M2 with the same default template and the same session, changing only the model. I had also tried M2.7 earlier, but its API rate limiting was too aggressive to evaluate properly. Semantic summarization during resource indexing kept hitting 429 errors.

The practical comparison ended up looking like this:

<table> <thead> <tr> <th>Model</th> <th>Generation</th> <th>Extraction with default template</th> <th>Logs</th> </tr> </thead> <tbody> <tr> <td>M2.5</td> <td>newer</td> <td>0 memories</td> <td>extra_forbidden errors</td> </tr> <tr> <td>M2.1</td> <td>middle</td> <td>1 memory</td> <td>no errors</td> </tr> <tr> <td>M2</td> <td>oldest</td> <td>4 memories</td> <td>no errors</td> </tr> </tbody> </table>

The oldest model performed best. The newer M2.5 extracted nothing usable.

That result felt counterintuitive at first, but it makes sense if you think about how newer models often behave. They tend to be more opinionated and more willing to reorganize output in ways they believe are better. That can help with creative tasks. It can be a liability when the task is mechanical structured JSON generation.

There have been similar cases elsewhere too: model upgrades raise overall benchmark scores while format-following performance gets worse. This was just the first time I ran into it directly in production-like debugging.

What actually fixed it

I switched the VLM back to M2 and left everything else unchanged. After that, the whole pipeline behaved normally:

VLM: minimax/MiniMax-M2 (litellm)
memories_extracted: 6 write + 1 edit
向量库: 3770 vectors
检索: 零报错，正常返回

I also tried patching the template along the way by adding explicit JSON field constraints to each category description. That worked immediately: M2.5 went from 0 extracted memories to 7.

But that was not really the right fix. A pip install --upgrade would overwrite the template anyway, and once PR #1045, the memory v2 refactor, is released, the extractor is expected to become more tolerant. At that point, a local template patch would just be unnecessary maintenance overhead.

So the final decision was simple: use M2 for the VLM and leave the rest alone until the official update lands.

What I took away from the debugging process

If viking_search returns empty results, first check whether the vector database count is still 0. Async queue backlog is normal, and sometimes the right move is just to wait a few minutes.
OpenViking's VLM extraction pipeline and its retrieval pipeline are independent. Extraction depends on the configured VLM model in the vlm section of ov.conf, while retrieval depends on the embedding model in the embedding section. One can fail while the other still works.
M2.7 is limited by fairly tight API rate caps. During resource indexing, semantic summary generation can saturate the quota and trigger frequent 429 responses. That was another reason not to use M2.7 in practice.
Do not assume a newer model will be better at structured output. Higher benchmark numbers do not automatically mean better instruction obedience. In this case, the oldest M2 was the most reliable choice.
Avoid patching templates when switching models can solve the problem. Configuration hacks are temporary and upgrades tend to wipe them out.

What looked like a memory system failure turned out to be something more specific: the model was smart enough to produce a well-organized answer, but not disciplined enough to stay inside the schema. In this workflow, that distinction mattered more than raw capability.

GeekShared

Why OpenViking Memory Retrieval Kept Failing Until I Switched to an Older Model

First check: the vector database was empty

Second check: the VLM extracted content, but no memory was saved

Third check: newer was not better

What actually fixed it

What I took away from the debugging process

Popular Posts

GeekShared

Why OpenViking Memory Retrieval Kept Failing Until I Switched to an Older Model

First check: the vector database was empty

Second check: the VLM extracted content, but no memory was saved

Third check: newer was not better

What actually fixed it

What I took away from the debugging process

Related Posts

Java int to byte[] Conversion: Little-Endian and Big-Endian Examples

How to Configure IP, Gateway, Netmask, MAC Address, and DNS on Linux

Late May in the Countryside: Wheat Harvest, Family Weekends, and a Children’s Day Out

Popular Posts