Why OpenViking Memory Retrieval Kept Failing Until I Switched to an Older Model
For three days, I was trying to figure out why Hermes Agent, wired up to OpenViking for long-term memory, never seemed to search its own knowledge base. Every conversation felt like a fresh start, as if nothing had ever been stored.
My first assumption was that I had misconfigured something. OpenViking is an open-source project from ByteDance, so a basic implementation mistake on their side seemed less likely than a mistake on mine. The version was v0.3.14. I initially ran the VLM locally through Ollama, then switched it to the MiniMax API. Embeddings were handled locally with bge-small-zh. Health checks returned 200 OK, the logs showed no obvious errors, and on the surface everything looked fine.
And yet viking_search always returned nothing.
First check: the vector database was empty
The most basic question was whether anything had actually been written into storage.
curl http://127.0.0.1:1933/api/v1/observer/vikingdb
The result showed Vector Count = 0. The vector database was completely empty. At the same time, the Semantic-Nodes queue showed 3335 items either being processed or already processed, which meant the pipeline was running but data had not landed in the database yet.
A few minutes later, I checked again. The count had gone from 0 to 856.
So that part turned out not to be broken at all. It was just an asynchronous queue backlog. Once processing caught up, search started working.
But that only explained resource indexing through viking_add_resource. Session memory written by viking_remember was a separate issue entirely.
Second check: the VLM extracted content, but no memory was saved
Calls to viking_remember returned stored, and the messages did enter the session queue. Session commit was triggered normally, the task status became completed, and the VLM had clearly been invoked:
llm_token_usage: {prompt_tokens: 30893, completion_tokens: 4050, total_tokens: 34943}
That is a prompt of over 30,000 tokens and a completion of over 4,000. The model was definitely doing work.
But memories_extracted came back as an empty object {}. All eight categories were zero.
That was the part that made no sense. The VLM had been called, it had produced an output, the task had completed successfully, and still no memory was stored.
The answer finally showed up in the systemd logs:
Direct model validation failed: 1 validation error for StructuredMemoryOperations
profile.name
Extra inputs are not permitted [type=extra_forbidden]
So the model had not failed to extract anything. It had actually extracted useful content. It correctly identified OpenViking's five tools, the version number, and the configuration details. But the JSON it generated looked like this:
{"entities": [...], "tools": [...], "profile": {"name": "default", "content": "..."}, "identity": {...}}
OpenViking, however, expected a Pydantic schema like this:
{"reasoning": "...", "write_uris": [...], "edit_uris": [...], "delete_uris": [...]}
In other words, M2.5 had decided to invent its own structure. Instead of following the schema in the prompt, it reorganized memory into keys like entities, tools, and profile, and even added a name field under profile that the schema did not allow.
Because OpenViking uses Pydantic with extra='forbid', the entire output was rejected immediately. Around 35,000 tokens of extracted memory were effectively thrown away in one second.
That behavior is not really OpenViking's fault. From an engineering perspective, extra='forbid' is a sensible guardrail. If arbitrary model-generated fields are accepted, malformed or dirty data ends up in storage. Models like GPT and Claude are usually strong enough at instruction following not to drift like this. What surprised me was how weak M2.5 turned out to be in this particular area. It was making up field names, wrapping strings inside dictionaries, and generally ignoring the schema defined in the prompt.
Third check: newer was not better
At that point I started wondering whether this was specific to M2.5, so I tested M2.1 and M2 with the same default template and the same session, changing only the model. I had also tried M2.7 earlier, but its API rate limiting was too aggressive to evaluate properly. Semantic summarization during resource indexing kept hitting 429 errors.
The practical comparison ended up looking like this:
<table> <thead> <tr> <th>Model</th> <th>Generation</th> <th>Extraction with default template</th> <th>Logs</th> </tr> </thead> <tbody> <tr> <td>M2.5</td> <td>newer</td> <td>0 memories</td> <td>extra_forbidden errors</td>
</tr>
<tr>
<td>M2.1</td>
<td>middle</td>
<td>1 memory</td>
<td>no errors</td>
</tr>
<tr>
<td>M2</td>
<td>oldest</td>
<td>4 memories</td>
<td>no errors</td>
</tr>
</tbody>
</table>
The oldest model performed best. The newer M2.5 extracted nothing usable.
That result felt counterintuitive at first, but it makes sense if you think about how newer models often behave. They tend to be more opinionated and more willing to reorganize output in ways they believe are better. That can help with creative tasks. It can be a liability when the task is mechanical structured JSON generation.
There have been similar cases elsewhere too: model upgrades raise overall benchmark scores while format-following performance gets worse. This was just the first time I ran into it directly in production-like debugging.
What actually fixed it
I switched the VLM back to M2 and left everything else unchanged. After that, the whole pipeline behaved normally:
VLM: minimax/MiniMax-M2 (litellm)
memories_extracted: 6 write + 1 edit
向量库: 3770 vectors
检索: 零报错,正常返回
I also tried patching the template along the way by adding explicit JSON field constraints to each category description. That worked immediately: M2.5 went from 0 extracted memories to 7.
But that was not really the right fix. A pip install --upgrade would overwrite the template anyway, and once PR #1045, the memory v2 refactor, is released, the extractor is expected to become more tolerant. At that point, a local template patch would just be unnecessary maintenance overhead.
So the final decision was simple: use M2 for the VLM and leave the rest alone until the official update lands.
What I took away from the debugging process
- If
viking_searchreturns empty results, first check whether the vector database count is still 0. Async queue backlog is normal, and sometimes the right move is just to wait a few minutes. - OpenViking's VLM extraction pipeline and its retrieval pipeline are independent. Extraction depends on the configured VLM model in the
vlmsection ofov.conf, while retrieval depends on the embedding model in theembeddingsection. One can fail while the other still works. - M2.7 is limited by fairly tight API rate caps. During resource indexing, semantic summary generation can saturate the quota and trigger frequent 429 responses. That was another reason not to use M2.7 in practice.
- Do not assume a newer model will be better at structured output. Higher benchmark numbers do not automatically mean better instruction obedience. In this case, the oldest M2 was the most reliable choice.
- Avoid patching templates when switching models can solve the problem. Configuration hacks are temporary and upgrades tend to wipe them out.
What looked like a memory system failure turned out to be something more specific: the model was smart enough to produce a well-organized answer, but not disciplined enough to stay inside the schema. In this workflow, that distinction mattered more than raw capability.