
A major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).
recall skillsA new framework developed by researchers at several universities addresses this bottleneck by giving agents the ability to develop their skills on their own. "Add your continuous learning capacity to the existing offer in the current market, such as OpenClaw and Claude Code," Jun Wang, co-author of the paper, told VentureBeat.
Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.
For enterprise teams running agents in production, that’s important. The alternative (adjusting model weights or developing skills manually) comes with significant operational overhead and data requirements. Memento-Skills prevents both.
The challenges of building self-evolving agents
Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and anything that fits into its immediate context window.
Giving the model an external memory structure allows it to improve without the costly and time-consuming process of retraining. However, current approaches to agent adaptation rely heavily on manually designed skills to handle new tasks. While there are some automated skill learning methods, they mostly produce text-only guides that amount to quick optimization. Other approaches simply record single-task trajectories that do not transfer between different tasks.
Furthermore, when these agents attempt to retrieve knowledge relevant to a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on the standard RAG could retrieve a "reset password" script to solve a "refund processing" consultation simply because the documents share business terminology.
"Most recovery augmented generation (RAG) systems are based on similarity-based recovery. However, when skills are represented as executable artifacts, such as markdown documents or code snippets, similarity alone may not select the most effective skill." Wang said.
How Memento-Skills stores and upgrades skills
To address the limitations of current agent systems, researchers built Memento-Skills. The article describes the system as “a generalist, continuously learning LLM agent system that functions as an agent designer.” Instead of maintaining a passive record of past conversations, Memento-Skills creates a set of skills that act as a persistent and evolving external memory.
These skills are stored as structured markdown files and serve as the agent’s evolving knowledge base. Each reusable skill artifact is made up of three core elements. It contains declarative specifications that describe what the skill is and how it should be used. It includes specialized instructions and prompts that guide the reasoning of the language model. And it houses the executable code and auxiliary scripts that the agent runs to solve the task.
Memento-Skills achieves continuous learning through its "Reflective reading and writing learning" mechanism, which frames memory updates as an active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill (not just the most semantically similar) and executes it.
After the agent executes the skill and receives feedback, the system reflects on the result to close the learning loop. Instead of simply adding a record of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means that you directly update the code or request to patch the specific failure mode. If necessary, create a completely new skill.
Memento-Skills also updates the skills router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlay. "The true value of a skill lies in how it contributes to the agency’s overall workflow and subsequent execution,” Wang said. “Therefore, reinforcement learning provides a more suitable framework, allowing the agent to evaluate and select skills based on their long-term usefulness."
To avoid regression in a production environment, automated skill mutations are protected by an automatic unit testing gate. The system generates a synthetic test case, runs it through the updated skill, and verifies the results before saving the changes to the global library.
By continually rewriting and refining its own executable tools, Memento-Skills allows a frozen language model to develop robust muscle memory and progressively expand its end-to-end capabilities.
Testing the self-evolving agent
The researchers evaluated Memento-Skills against two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multimodal handling, web navigation, and tool use. The second is The last test of humanityor HLE, an expert-level benchmark covering eight diverse academic subjects, including mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.
The system was compared to a reading and writing base that recovers skills and collects feedback, but does not have automatic evolution functions. The researchers also tested their custom skills router against standard semantic retrieval baselines, including BM25 and Qwen3 Embeds.
The results showed that actively evolving memory vastly outperforms a static library of skills. On the high-diversity GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. In the HLE benchmark, where the domain structure allowed for massive reuse of skills across tasks, the system more than doubled baseline performance, jumping from 17.9% to 38.7%.
Additionally, Memento-Skills’ specialized skill router avoids the classic catch-up trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills increases end-to-end task success rates to 80%, compared to just 50% for standard BM25 recovery.
The researchers observed that Memento-Skills manages this performance through structured and highly organic skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. In the GAIA benchmark, the agent autonomously expanded this initial set into a compact library of 41 skills to handle the various tasks. In the expert-level HLE benchmark, the system dynamically expanded its library to 235 different skills.
Finding the company sweet spot
The researchers have published the code for Memento-Skills on GitHuband is available for use.
For enterprise architects, the effectiveness of this system depends on domain alignment. Rather than simply looking at benchmark scores, the business’s main drawback lies in whether its agents handle isolated tasks or structured workflows.
"Skill transfer depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction." In such dispersed environments, transfer between tasks is limited. "Second, when tasks share a substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge is transferred between tasks, allowing the agent to perform well on new problems with little or no additional interaction."
Since the system requires recurring task patterns to consolidate knowledge, business leaders need to know exactly where to implement it today and where to wait.
"Workflows are probably the most appropriate environment for this approach, as they provide a structured environment in which skills can be built, evaluated, and improved." Wang said.
However, he warned against over-deployment in areas that are not yet suitable for the framework. "Physical agents remain largely unexplored in this context and require further investigation. Additionally, tasks with longer horizons may require more advanced approaches, such as multi-agent LLM systems, to enable sustained coordination, planning, and execution across extended sequences of decisions."
As the industry moves toward agents autonomously rewriting their own production code, governance and security remain paramount. While Memento-Skills employs fundamental guardrails such as automated unit testing gates, a broader framework will likely be needed for enterprise adoption.
"To enable reliable self-improvement, we need a well-designed assessment or evaluation system that can evaluate performance and provide consistent guidance." Wang said. "Rather than allowing unrestricted self-modification, the process should be structured as a guided form of self-development, where feedback guides the agent toward better designs."





