- Implement HermesClient for interacting with the Hermes CLI. - Create judge module for grading QA outputs from Hermes memory. - Develop LoCoMo dataset parsing and formatting utilities. - Introduce run_eval script to facilitate memory evaluation using LoCoMo-style datasets.
140 lines
4.4 KiB
Markdown
140 lines
4.4 KiB
Markdown
# Hermes Memory Evaluation
|
|
|
|
This is a small LoCoMo-style memory evaluation runner for Hermes Agent.
|
|
It follows the same shape as `openclaw-eval`: ingest historical conversations, ask QA questions with the same user id, then use an LLM judge to score the answers.
|
|
|
|
## 1. Configure Hermes Memory
|
|
|
|
Install or copy the `memory_system` Hermes plugin, then put Memory System settings in `/home/tom/.hermes/memory_system.env`:
|
|
|
|
```dotenv
|
|
MEMORY_SYSTEM_ENDPOINT=http://127.0.0.1:1934
|
|
MEMORY_SYSTEM_USER_ID=default
|
|
MEMORY_SYSTEM_SEARCH_USE_LLM=false
|
|
MEMORY_SYSTEM_COMMIT_EVERY_TURNS=1
|
|
MEMORY_SYSTEM_COMMIT_INTERVAL_SECONDS=0
|
|
```
|
|
|
|
The eval runner overrides `MEMORY_SYSTEM_USER_ID` per LoCoMo sample, so one sample maps to one memory user.
|
|
|
|
## 2. Prepare Config
|
|
|
|
Copy and edit:
|
|
|
|
```bash
|
|
cp eval/hermes_memory_eval/config.example.yaml eval/hermes_memory_eval/config.yaml
|
|
```
|
|
|
|
For a stable eval, keep:
|
|
|
|
```yaml
|
|
memory:
|
|
commit_every_turns: 1
|
|
commit_interval_seconds: 0
|
|
```
|
|
|
|
## 3. Ingest Conversations
|
|
|
|
Before ingest, verify the eval Hermes home can see the plugin:
|
|
|
|
```bash
|
|
HERMES_HOME=/home/tom/memory-gateway/eval/hermes_memory_eval/hermes_home hermes memory status
|
|
```
|
|
|
|
The status must show `memory_system` as installed and active.
|
|
|
|
Run a small smoke test first:
|
|
|
|
```bash
|
|
python eval/hermes_memory_eval/run_eval.py ingest /path/to/locomo10_small.json \
|
|
--config eval/hermes_memory_eval/config.yaml \
|
|
--sample 0 \
|
|
--sessions 1-2 \
|
|
--output output/hermes_ingest.jsonl
|
|
```
|
|
|
|
This sends each selected session to:
|
|
|
|
```bash
|
|
hermes chat -Q --source memory-eval -q "<formatted session>"
|
|
```
|
|
|
|
## 4. Ask QA Questions
|
|
|
|
Use the same sample and user mapping:
|
|
|
|
```bash
|
|
python eval/hermes_memory_eval/run_eval.py qa /path/to/locomo10_small.json \
|
|
--config eval/hermes_memory_eval/config.yaml \
|
|
--sample 0 \
|
|
--count 10 \
|
|
--output output/hermes_qa.jsonl
|
|
```
|
|
|
|
Each QA runs in a fresh Hermes CLI call, so the answer should come from persistent memory rather than the prior short-term chat context.
|
|
The default QA prompt explicitly asks Hermes to call `memory_system_search` before answering.
|
|
If Memory System API does not log `POST /memory-system/search`, inspect the session JSON to confirm whether the model made a tool call.
|
|
|
|
## 5. Judge Answers
|
|
|
|
Use the `judge` section in `config.yaml`:
|
|
|
|
```yaml
|
|
judge:
|
|
base_url: "https://api.openai.com/v1"
|
|
api_key_env: "OPENAI_API_KEY"
|
|
model: "gpt-4o-mini"
|
|
parallel: 4
|
|
timeout_seconds: 120
|
|
```
|
|
|
|
Then run:
|
|
|
|
```bash
|
|
OPENAI_API_KEY=sk-... python eval/hermes_memory_eval/judge.py output/hermes_qa.jsonl \
|
|
--config eval/hermes_memory_eval/config.yaml \
|
|
--output output/hermes_grades.json
|
|
```
|
|
|
|
For Ark/Doubao-style endpoints:
|
|
|
|
```yaml
|
|
judge:
|
|
base_url: "https://ark.cn-beijing.volces.com/api/v3"
|
|
api_key_env: "ARK_API_KEY"
|
|
model: "doubao-seed-2-0-pro-260215"
|
|
```
|
|
|
|
```bash
|
|
ARK_API_KEY=... python eval/hermes_memory_eval/judge.py output/hermes_qa.jsonl \
|
|
--config eval/hermes_memory_eval/config.yaml \
|
|
--output output/hermes_grades.json
|
|
```
|
|
|
|
## Recommended Comparisons
|
|
|
|
Run the same dataset in these modes:
|
|
|
|
- no external memory
|
|
- `MEMORY_SYSTEM_SEARCH_USE_LLM=false`
|
|
- `MEMORY_SYSTEM_SEARCH_USE_LLM=true`
|
|
|
|
Compare final QA score and inspect failed examples. If search recall is high but QA accuracy is low, Hermes is not using retrieved memory well. If search recall is low, the issue is likely write/extract/search quality.
|
|
|
|
## Current Small Dataset Result
|
|
|
|
On `locomo10_small.json` sample `conv-26`, the current smoke test results are:
|
|
|
|
| Mode | Score | Category 1 | Category 2 | Category 3 | Category 4 |
|
|
| --- | ---: | ---: | ---: | ---: | ---: |
|
|
| Memory System enabled | 5/35 (14.29%) | 0/5 (0.00%) | 1/9 (11.11%) | 1/2 (50.00%) | 3/19 (15.79%) |
|
|
| No external memory | 0/35 (0.00%) | 0/5 (0.00%) | 0/9 (0.00%) | 0/2 (0.00%) | 0/19 (0.00%) |
|
|
|
|
This means the Memory System path is contributing signal over the no-memory baseline, but the absolute score is still low. The main follow-up is to inspect failed QA examples and separate retrieval failure from answer-use failure:
|
|
|
|
- If `POST /memory-system/search` does not appear during QA, Hermes did not call the memory tool.
|
|
- If search results do not contain the evidence/gold answer, the write/extract/search path needs improvement.
|
|
- If search results contain the evidence but the answer is wrong, Hermes is not using retrieved memory effectively.
|
|
|
|
For future runs, keep a fresh `user_prefix` per mode so OpenViking/EverOS memory from prior runs does not contaminate results.
|