
AI developers are watching agent engineering evolve in real time, with leading teams openly sharing what works. One principle keeps showing up from the front lines: build within the LLM’s constraints.
In practice, two constraints dominate:
So “just add more context” isn’t a reliable strategy due to the quadratic cost of attention mechanisms and the degradation of reasoning capabilities as context fills up. The winning pattern is external memory + disciplined retrieval: store state outside the prompt (artifacts, decisions, tool outputs), then pull back only what matters for the current loop.
There’s also a useful upside: because models are trained on internet-era developer workflows, they’re unusually competent with developer-native interfaces — repos, folders, markdown, logs, and CLI-style interactions. That’s why filesystems keep showing up in modern agent stacks.
This is where the debate heats up: “files are all you need” for agent memory. Most arguments collapse because they treat interface, storage, and deployment as the same decision. They aren’t.
Filesystems are winning as an interface because models already know how to list directories, grep for patterns, read ranges, and write artifacts. Databases are winning as a substrate because once memory must be shared, audited, queried, and made reliable under concurrency, you either adopt database guarantees — or painfully reinvent them.

In this piece, we give a systematic comparison of filesystems and databases for agent memory: where each approach shines, where it breaks down, and a decision framework for choosing the right foundation as you move from prototype to production.
Our aim is to educate AI developers on various approaches to agent memory, backed by performance guidance and working code.
All code presented in this piece can be found here.
Let’s take the common use case of building a Research Assistant with Agentic capabilities.
You build a Research Assistant agent that performs brilliantly in a demo; in the current execution, it can search arXiv, summarize papers, and draft a clean answer in a single run. Then you come back the next morning, start from a clean run, and then prompt the agent: “Continue from where we left off, and also compare Paper A to Paper B.” The agent responds as if it has never met you because LLMs are inherently stateless. Unless you send prior context back in, the model has no durable awareness of what happened in previous turns or previous sessions.

Once you move beyond single-turn Q&A into long-horizon tasks, deep research, multi-step workflows, and multi-agent coordination, you need a way to preserve continuity when the context window truncates, sessions restart, or multiple workers act on shared state. This takes us into the realm of leveraging systems of record for agents and introduces the concept of Agent Memory.

Agent memory is the set of system components and techniques that enable an AI agent to store, recall, and update information over time so it can adapt to new inputs and maintain continuity across long-horizon tasks.
Core components typically include the language and embedding model, information retrieval mechanisms, and a persistent storage layer such as a database.
In practical systems, agent memory is usually classified into two distinct forms:
Concepts and techniques associated with agent memory all come together within the agent loop and the agent harness, as demonstrated in this notebook and explained later in this article.
The agent loop is the iterative execution cycle in which an LLM receives instructions from the environment and decides whether to generate a response or make a tool call based on its internal reasoning about the input provided in the current loop. This process repeats until the LLM produces a final output or an exit criterion is met. At a high level, the following operations are present within the agent loop:
Anthropic’s guidance on long-running agents directly points to this: they describe harness practices that help agents quickly re-understand the state of work when starting with a fresh context window, including maintaining explicit progress artifacts.
The agent harness is the surrounding runtime and rules that make the loop reliable: how you wire tools, where you write artifacts, how you log/trace behavior, how you manage memory, and how you prevent the agent from drowning in context.
To complete the picture, the discipline of context engineering is heavily involved in the agent loop and aspect of the agent harness itself. Context engineering is the systematic design and curation of the content placed in an LLM’s context window so that the model receives high-signal tokens and produces the intended, reliable output within a fixed budget.
In this piece, we implement context engineering as a set of repeatable techniques inside the agent harness:
The concepts and explanations above set us up for the rest of the comparison we introduce in this piece. Now that we have the “why” and the moving parts (stateless models, the agent loop, the agent harness, and memory), we can evaluate the two dominant substrates teams are using today to make memory real: the filesystem and the database.
A filesystem-based memory architecture is not “the agent remembers everything forever”. It is the agent that can persist state and artifacts outside the context window and then pull them back selectively when needed. This aligns with two of the earlier-mentioned LLM constraints: a limited context window and statelessness.
In our Research Assistant, the filesystem becomes the memory substrate. Rather than injecting a large number of tools and extensive documentation into the LLM’s context window (which would inflate the token count and trigger early summarization), we store them on disk and let the agent search and selectively read what it needs. This matches with what the Applied AI team at Cursor calls “Dynamic Context Discovery”: write large output to files, then let the agent `tail` and read ranges as required.
Our FSAgent and demo are using valid filesystem-OS related operations (such as tail and cat to read the contents of files), but this is a very “simplified” approach, with a limited number of operations for demonstration purposes, and the capabilities offered in the file system can be optimized (with other commands and implementations).
On the other hand, it’s a great start for people to get familiarized with tool access and how file system memory is achieved.

Concretely, filesystem agent memory typically emerges as three buckets:
Before we jump into the code, here’s the minimal tool surface we provide to the agent in the table below. Notice the pattern: instead of inventing specialized “memory APIs,” we expose a small set of filesystem primitives and let the agent compose them (very Unix).

This design directly reflects what the AI ecosystem is converging on: a filesystem and a handful of core tools, rather than an explosion of bespoke tools.
The first memory principle implementation is simple: don’t load large files unless you must. Filesystems are excellent at sequential read/write and work naturally with tools like grep and log-style access. This makes them a strong fit for append-only transcript and artifact storage.
That’s why we implement three reading tools:
The tools below were implemented in Python and converted into objects callable by a langchain agent using the @tool decorator from the langchain agent module.
First is the read_file tool, the “load it all” option. This tool is useful when the file is small, or you truly need the full artifact, but it’s intentionally not the default because it can expand the context window.
@tool
def read_file(path: str) -> str:
p = Path(path)
if not p.exists():
return f"File not found: {path}"
return p.read_text(encoding="utf-8")
The tail_file function is the first step for large files. It grabs the end of a log/transcript to quickly see the latest or most relevant portion before deciding whether to read more.
@tool
def tail_file(path: str, n_lines: int = 80) -> str:
p = Path(path)
if not p.exists():
return f"File not found: {path}"
lines = p.read_text(encoding="utf-8").splitlines()
return "\n".join(lines[-max(1, n_lines):])
The read_file_range function is seen as the surgical tool that is used once you’ve located the right region (often via grep or after a tail), pulls in just the exact line span you need, so the agent stays token-efficient and grounded.
@tool
def read_file_range(path: str, start_line: int, end_line: int) -> str:
p = Path(path)
if not p.exists():
return f"File not found: {path}"
lines = p.read_text(encoding="utf-8").splitlines()
start = max(0, start_line)
end = min(len(lines), end_line)
if start >= end:
return f"Empty range: {start_line}:{end_line} (file has {len(lines)} lines)"
return "\n".join(lines[start:end])
Again, this is essentially dynamic context discovery in a microcosm: load a small view first, then expand only when needed.
A filesystem-based agent should quickly find relevant material and pull only the exact slices it needs. This is why grep is such a recurring theme in the agent tooling conversation: it gives the model a fast way to locate relevant regions before spending tokens to pull content.
Here’s a simple grep-like tool that returns line-numbered hits so the agent can immediately jump to read_file_range:
@tool
def grep_files(
pattern: str,
root_dir: str = "semantic",
file_glob: str = "**/*.md",
max_matches: int = 200,
ignore_case: bool = True,
) -> str:
root = Path(root_dir)
if not root.exists():
return f"Directory not found: {root_dir}"
flags = re.IGNORECASE if ignore_case else 0
try:
rx = re.compile(pattern, flags)
except re.error as e:
return f"Invalid regex pattern: {e}"
matches = []
for fp in root.glob(file_glob):
if not fp.is_file():
continue
try:
with open(fp, "r", encoding="utf-8", errors="ignore") as f:
for i, line in enumerate(f, start=1):
if rx.search(line):
matches.append(f"{fp.as_posix()}:{i}: {line.strip()}")
if len(matches) >= max_matches:
return "\n".join(matches) + "\n\n[TRUNCATED: max_matches reached]"
except Exception:
continue
if not matches:
return "No matches found."
return "\n".join(matches)
One subtle but important detail in our grep_files implementation is how we read files. Rather than loading entire files into memory with read_text().splitlines(), we iterate lazily with for line in open(fp), which streams one line at a time and keeps memory usage constant regardless of file size.
This aligns with the “find first, read second” philosophy: locate what you need without loading everything upfront. For readers interested in maximum performance, the full notebook also includes a grep_files_os_based variant that shells out to ripgrep or grep, leveraging OS-level optimizations like memory-mapped I/O and SIMD instructions. In practice, this pattern (“search first, then read a range”) is one reason filesystem agents can feel surprisingly strong on focused corpora: the agent iteratively narrows the context instead of relying on a single-shot retrieval query.
One of the fastest ways to blow up your context window is to return large JSON payloads from tools. Cursor’s approach is to write these results to files and let the agent inspect them on demand (often starting with tail).
That’s exactly why our folder structure includes a tool_outputs/<session_id>/ directory: it acts like an “evidence locker” for everything the agent did, without forcing those payloads into the current context.
{
"ts_utc": "2026-01-27T12:41:12.135396+00:00",
"tool": "arxiv_search_candidates",
"input": "{'query': 'memgpt'}",
"output": "content='[\\n {\\n \"arxiv_id\": \"2310.08560v2\",\\n \"entry_id\": \"http://arxiv.org/abs/2310.08560v2\",\\n \"title\": \"MemGPT: Towards LLMs as Operating Systems\",\\n \"authors\": \"Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez\",\\n \"published\": \"2024-02-12\",\\n \"abstract\": ...msPnaMxOl8Pa'"
}Before we create the agent, we bundle the tools into a small, composable toolbox. This matches the broader trend: agents often perform better with a smaller tool surface, less choice paralysis (aka context confusion), fewer weird and overlapping tool schemas, and more reliance on proven filesystem workflows.
FS_TOOLS = [
arxiv_search_candidates, # search arXiv for relevant research papers
fetch_and_save_paper, # fetch paper text (PDF->text) and save to semantic/knowledge_base/<id>.md
read_file, # read a file in full (use sparingly)
tail_file, # read end of file first
read_file_range, # read a specific line range
conversation_to_file, # append conversation entries to episodic memory
summarise_conversation_to_file, # save transcript + compact summary
monitor_context_window, # estimate token usage
list_papers, # list saved papers
grep_files # grep-like search over files
]</id>
Filesystem tools alone aren’t enough — you also need a reading policy that keeps the agent’s token usage efficient and grounded. This is the same reason CLAUDE.md / AGENTS.md/ SKILLS.md, and the “rules files” matter: they’re procedural memory that is applied consistently across sessions.
Key policies we encode below:
Below is the implementation of an agent using the langchain framework.
fs_agent = create_agent(
model=f"openai:{os.getenv('OPENAI_MODEL', 'gpt-4o-mini')}",
tools=FS_TOOLS,
system_prompt=(
"You are a conversational research ingestion agent.\n\n"
"Core behavior:\n"
"- When asked to find a paper: use arxiv_search_candidates, pick the best arxiv_id, "
"then call fetch_and_save_paper to store the full text in semantic/knowledge_base/.\n"
"- Papers/knowledge base live in semantic/knowledge_base/.\n"
"- Conversations (transcripts) live in episodic/conversations/ (one file per run).\n"
"- Summaries live in episodic/summaries/.\n"
"- Conversation may be summarised externally; respect summary + transcript references.\n"
),
)
After running the agent, you end up with a directory layout that makes the agent’s “memory” tangible and inspectable. In your example, the agent produces:
That is exactly the point of filesystem-first memory: the model doesn’t “remember” by magically retaining state; it “remembers” because it can re-open, search, and selectively read its prior artifacts.
This is also why so many teams keep rediscovering the same pattern: files are a simple abstraction, and agents are surprisingly good at using them.
In the previous section, we showed what a filesystem‑first memory harness looks like in practice: the agent writes durable artifacts (papers, tool outputs, transcripts) to disk, then “remembers” by searching and selectively reading only the parts it needs.
This approach works because it directly addresses two core constraints of LLMs: limited context windows and inherent statelessness. Once those constraints are handled, it becomes clear why file systems so often become the default interface for early agent systems.
In practice, filesystem memory excels when the workload is artifact‑heavy (research notes, paper dumps, transcripts), when you want a clear audit trail, and when iteration speed matters more than sophisticated retrieval. It also encourages good agent hygiene: write outputs down, cite sources, and load only what you need.
But, unfortunately, it doesn’t end there. The same strengths that make files attractive, simplicity, relatively low cost, and fast implementation, can quickly become bottlenecks once you promote these systems into production, where they are expected to behave like a shared, reliable memory platform.
As soon as an agent moves beyond single-user prototypes into real-world scenarios, where concurrent reads and writes are the norm and robustness under load is non-negotiable, filesystems start to show their limits.
The core pattern is that filesystem memory stays attractive until you need correctness under concurrency, semantic retrieval, or structured guarantees. At that point, you either accept the limitations (and keep the agent single-user/single-process) or you adopt a database.
By this point, most AI developers can see why filesystem first agent implementations are having a moment. It is a familiar interface, easy to prototype with, and our agents can “remember” by writing artifacts to disk and reloading them later via search plus selective reads. For a single developer on a laptop, that is often enough. But once we move beyond “it works on my laptop” and start supporting developers who ship to thousands or millions of users, memory stops being a folder of helpful files and becomes a shared system that has to behave predictably under load.
Databases were created for the exact moment when “a pile of files” stops being good enough because too many people and processes are touching the same data. One of the most-cited origin stories of the database dates to the Apollo era. IBM, alongside partners, built what became IMS to manage complex operational data for the program, and early versions were installed in 1968 at the Rockwell Space Division, supporting NASA. The point was not simply storage. It was coordination, correctness, and the ability to trust shared data while many activities were happening simultaneously.
That same production reality is what pushes agent memory toward databases today.
When agent memory must handle concurrent reads and writes, preserve an auditable history of what happened, support fast retrieval across many sessions, and enforce consistent updates, we want database guarantees rather than best-effort file conventions.
Oracle has been solving these exact problems since 1979, when we shipped the first commercial SQL database. The goal then was the same as now: make shared state reliable, portable, and trustworthy under load.
On that note, allow us to show how this can work in practice.
In the filesystem first section, our Research Assistant “remembered” by writing artifacts to disk and reloading them later using cheap search plus selective reads. That is a great starting point. But when we want memory that is shared, queryable, and reliable under concurrent use, we need a different foundation.
In this iteration of our agent, we keep the same user experience and the same high-level job. Search arXiv, ingest papers, answer follow-up questions, and maintain continuity across sessions. The difference is that memory now lives in the Oracle AI Database, where we can make it durable, indexed, filterable, and safe for concurrent reads and writes. We also achieve a clean separation between two memory surfaces: structured history in SQL tables and semantic recall via vector search.
The result is what we call a MemAgent, an agent whose memory is not a folder of artifacts, but a queryable system. It is designed to support multi-threaded sessions, store full conversational history, store tool logs for debugging and auditing, and store a semantic knowledge base that can be searched by meaning rather than keywords.
Before we wire up the agent loop, we need to define the tool surface that MemAgent can use to reason, retrieve, and persist knowledge. The design goal here is similar to the filesystem-first approach: keep the toolset small and composable, but shift the memory substrate from files to the database. Instead of grepping folders and reading line ranges, MemAgent uses vector similarity search to retrieve semantically relevant context, and it persists what it learns in a way that is queryable and reliable across sessions.
In practice, that means two things.
The table below summarizes the minimal set of tools we expose to MemAgent and where each tool stores its outputs.

FSAgent and MemAgent can look similar from the outside because both can ingest papers, answer questions, and maintain continuity. The difference is what powers that continuity and how retrieval works when the system grows.
FSAgent relies on the operating system as its memory surface, which is great for iteration speed and human inspectability, but it typically relies on keyword-style discovery and file traversal. MemAgent treats memory as a database concern, which adds setup overhead, but unlocks indexed retrieval, stronger guarantees under concurrency, and richer ways to query and filter what the agent has learned.

Before we start defining tables and vector stores, it is worth being explicit about the stack we are using and why. In this implementation, we are not building a bespoke agent framework from scratch.
We use LangChain as the LLM framework to abstract the agent loop, tool calling, and message handling, then pair it with a model provider for reasoning and generation, and with Oracle AI Database as the unified memory core that stores both structured history and semantic embeddings.
This separation is important because it mirrors how production agent systems are typically built. The agent logic evolves quickly, the model can be swapped, and the memory layer must remain reliable and queryable.
Think of this as the agent stack. Each layer has a clear job, and together they create an agent that is both practical to build and robust enough to scale.
With that stack in place, the first step is simply to connect to the Oracle Database and initialize an embedding model. The database connection serves as the foundation for all memory operations, and the embedding model enables us to store and retrieve knowledge semantically through the vector store layer.
def connect_oracle(user, password, dsn="127.0.0.1:1521/FREEPDB1", program="langchain_oracledb_demo"):
return oracledb.connect(user=user, password=password, dsn=dsn, program=program)
database_connection = connect_oracle(
user="VECTOR",
password="VectorPwd_2025",
dsn="127.0.0.1:1521/FREEPDB1",
program="devrel.content.filesystem_vs_dbs",
)
print("Using user:", database_connection.username)
embedding_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-mpnet-base-v2"
)
Next, we define the database schema to store our agent’s memory and prepare a clean slate for the demo. We separate memory into distinct tables so each type can be managed, indexed, and queried appropriately.
Installing the Oracle Database integration in the LangChain ecosystem is straightforward. You can add it to your environment with a single pip command:
pip install -U langchain-oracledb
Conversational history and logs are naturally tabular, while semantic and summary memory are stored in vector-backed tables through OracleVS. For reproducibility, we drop any existing tables from previous runs, making the notebook deterministic and avoiding confusing results when you re-run the walkthrough.
from langchain_oracledb.vectorstores import OracleVS
from langchain_oracledb.vectorstores.oraclevs import create_index
from langchain_community.vectorstores.utils import DistanceStrategy
CONVERSATIONAL_TABLE = "CONVERSATIONAL_MEMORY"
KNOWLEDGE_BASE_TABLE = "SEMANTIC_MEMORY"
LOGS_TABLE = "LOGS_MEMORY"
SUMMARY_TABLE = "SUMMARY_MEMORY"
ALL_TABLES = [
CONVERSATIONAL_TABLE,
KNOWLEDGE_BASE_TABLE,
LOGS_TABLE,
SUMMARY_TABLE
]
for table in ALL_TABLES:
try:
with database_connection.cursor() as cur:
cur.execute(f"DROP TABLE {table} PURGE")
except Exception as e:
if "ORA-00942" in str(e):
print(f" - {table} (not exists)")
else:
print(f" [FAIL] {table}: {e}")
database_connection.commit()
For this section, it is worth explaining what a “vector store” actually is in the context of agents. A vector store is a storage system that persists embeddings alongside metadata and supports similarity search, so the agent can retrieve items by meaning rather than keywords.
Instead of asking “which file contains this exact phrase”, the agent asks “which chunks are semantically closest to my question” and pulls back the best matches.
Under the hood, that usually means an approximate nearest neighbor index, because scanning every vector becomes prohibitively expensive as your knowledge base grows. HNSW is one of the most common indexing approaches for this style of retrieval.
In the code below, we create two vector stores using the langchain_oracledb module OracleVS, one for the knowledge base and one for summaries, both using cosine distance.
Second, it builds HNSW indexes so similarity search stays fast as memory grows, which is exactly what you want once your Research Assistant starts ingesting many papers and running over long-lived threads.
knowledge_base_vs = OracleVS(
client=database_connection,
embedding_function=embedding_model,
table_name=KNOWLEDGE_BASE_TABLE,
distance_strategy=DistanceStrategy.COSINE,
)
summary_vs = OracleVS(
client=database_connection,
embedding_function=embedding_model,
table_name=SUMMARY_TABLE,
distance_strategy=DistanceStrategy.COSINE,
)
def safe_create_index(conn, vs, idx_name):
try:
create_index(
client=conn,
vector_store=vs,
params={"idx_name": idx_name, "idx_type": "HNSW"}
)
print(f" Created index: {idx_name}")
except Exception as e:
if "ORA-00955" in str(e):
print(f" [SKIP] Index already exists: {idx_name}")
else:
raise
print("Creating vector indexes...")
safe_create_index(database_connection, knowledge_base_vs, "kb_hnsw_cosine_idx")
safe_create_index(database_connection, summary_vs, "summary_hnsw_cosine_idx")
print("All indexes created!")
In the code below, we create a custom Memory manager. The Memory manager is the abstraction layer that turns raw database operations into “agent memory behaviours”. This is the part that makes reasoning about the database-first agent easy.
from langchain.tools import tool
from typing import List, Dict
class MemoryManager:
"""
A simplified memory manager for AI agents using Oracle AI Database.
"""
def __init__(self, conn, conversation_table: str, knowledge_base_vs, summary_vs, tool_log_table):
self.conn = conn
self.conversation_table = conversation_table
self.knowledge_base_vs = knowledge_base_vs
self.summary_vs = summary_vs
self.tool_log_table = tool_log_table
def write_conversational_memory(self, content: str, role: str, thread_id: str) -> str:
thread_id = str(thread_id)
with self.conn.cursor() as cur:
id_var = cur.var(str)
cur.execute(f"""
INSERT INTO {self.conversation_table} (thread_id, role, content, metadata, timestamp)
VALUES (:thread_id, :role, :content, :metadata, CURRENT_TIMESTAMP)
RETURNING id INTO :id
""", {"thread_id": thread_id, "role": role, "content": content, "metadata": "{}", "id": id_var})
record_id = id_var.getvalue()[0] if id_var.getvalue() else None
self.conn.commit()
return record_id
def load_conversational_history(self, thread_id: str, limit: int = 50) -> List[Dict[str, str]]:
thread_id = str(thread_id)
with self.conn.cursor() as cur:
cur.execute(f"""
SELECT role, content FROM {self.conversation_table}
WHERE thread_id = :thread_id AND summary_id IS NULL
ORDER BY timestamp ASC
FETCH FIRST :limit ROWS ONLY
""", {"thread_id": thread_id, "limit": limit})
results = cur.fetchall()
return [{"role": str(role), "content": content.read() if hasattr(content, 'read') else str(content)} for role, content in results]
def mark_as_summarized(self, thread_id: str, summary_id: str):
thread_id = str(thread_id)
with self.conn.cursor() as cur:
cur.execute(f"""
UPDATE {self.conversation_table}
SET summary_id = :summary_id
WHERE thread_id = :thread_id AND summary_id IS NULL
""", {"summary_id": summary_id, "thread_id": thread_id})
self.conn.commit()
print(f" Marked messages as summarized (summary_id: {summary_id})")
def write_knowledge_base(self, text: str, metadata_json: str = "{}"):
metadata = json.loads(metadata_json)
self.knowledge_base_vs.add_texts([text], [metadata])
def read_knowledge_base(self, query: str, k: int = 5) -> str:
results = self.knowledge_base_vs.similarity_search(query, k=k)
content = "\n".join([doc.page_content for doc in results])
return f"""## Knowledge Base Memory: This are general information that is relevant to the question
### How to use: Use the knowledge base as background information that can help answer the question
{content}"""
def write_summary(self, summary_id: str, full_content: str, summary: str, description: str):
self.summary_vs.add_texts(
[f"{summary_id}: {description}"],
[{"id": summary_id, "full_content": full_content, "summary": summary, "description": description}]
)
return summary_id
def read_summary_memory(self, summary_id: str) -> str:
results = self.summary_vs.similarity_search(
summary_id,
k=5,
filter={"id": summary_id}
)
if not results:
return f"Summary {summary_id} not found."
doc = results[0]
return doc.metadata.get('summary', 'No summary content.')
def read_summary_context(self, query: str = "", k: int = 5) -> str:
results = self.summary_vs.similarity_search(query or "summary", k=k)
if not results:
return "## Summary Memory\nNo summaries available."
lines = ["## Summary Memory", "Use expand_summary(id) to get full content:"]
for doc in results:
sid = doc.metadata.get('id', '?')
desc = doc.metadata.get('description', 'No description')
lines.append(f" - [ID: {sid}] {desc}")
return "\n".join(lines)
Then we instantiate it:
memory_manager = MemoryManager(
conn=database_connection,
conversation_table=CONVERSATION_HISTORY_TABLE,
knowledge_base_vs=knowledge_base_vs,
tool_log_table=TOOL_LOG_TABLE,
summary_vs=summary_vs
)
The database-first agent follows a simple, production-friendly pattern.
On top of that, the agent actively manages context: it tracks token usage and periodically rolls older dialogue and intermediate state into durable summaries (and/or “memory” tables), so the working prompt stays small while the full history remains available on demand.
Ingest papers into the knowledge base vector store
This is the database-first equivalent of “fetch and save paper”. Instead of writing markdown files, we do three steps:
from datetime import datetime, timezone
from langchain_core.tools import tool
from langchain_community.document_loaders import ArxivLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
@tool
def fetch_and_save_paper_to_kb_db(
arxiv_id: str,
chunk_size: int = 1500,
chunk_overlap: int = 200,
) -> str:
loader = ArxivLoader(
query=arxiv_id,
load_max_docs=1,
doc_content_chars_max=None,
)
docs = loader.load()
if not docs:
return f"No documents found for arXiv id: {arxiv_id}"
doc = docs[0]
title = (
doc.metadata.get("Title")
or doc.metadata.get("title")
or f"arXiv {arxiv_id}"
)
entry_id = doc.metadata.get("Entry ID") or doc.metadata.get("entry_id") or ""
published = doc.metadata.get("Published") or doc.metadata.get("published") or ""
authors = doc.metadata.get("Authors") or doc.metadata.get("authors") or ""
full_text = doc.page_content or ""
if not full_text.strip():
return f"Loaded arXiv {arxiv_id} but extracted empty text (PDF parsing issue)."
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
)
chunks = splitter.split_text(full_text)
ts_utc = datetime.now(timezone.utc).isoformat()
metadatas = []
for i in range(len(chunks)):
metadatas.append(
{
"source": "arxiv",
"arxiv_id": arxiv_id,
"title": title,
"entry_id": entry_id,
"published": str(published),
"authors": str(authors),
"chunk_id": i,
"num_chunks": len(chunks),
"ingested_ts_utc": ts_utc,
}
)
knowledge_base_vs.add_texts(chunks, metadatas)
return (
f"Saved arXiv {arxiv_id} to {KNOWLEDGE_BASE_TABLE}: "
f"{len(chunks)} chunks (title: {title})."
)
We create two more tools below:
import os
from langchain.tools import tool
@tool
def search_knowledge_base(query: str, k: int = 5) -> str:
return memory_manager.read_knowledge_base(query, k)
@tool
def store_to_knowledge_base(text: str, metadata_json: str = "{}") -> str:
memory_manager.write_knowledge_base(text, metadata_json)
return "Successfully stored text to knowledge base."
Now we build the LangChain agent using the database-first tools.
from langchain.agents import create_agent
MEM_AGENT = create_agent(
model=f"openai:{os.getenv('OPENAI_MODEL', 'gpt-4o-mini')}",
tools=[search_knowledge_base, store_to_knowledge_base, arxiv_search_candidates, fetch_and_save_paper_to_kb_db],
)
At this point, the difference between a filesystem agent and a database-backed agent should feel less like a philosophical debate and more like an engineering trade-off. Both approaches can “remember” in the sense that they can persist state, retrieve context, and answer follow-up questions. The real test is what happens when you leave the tidy laptop demo and hit production realities: larger corpora, fuzzier queries, and concurrent workloads.
To make that concrete, we ran an end-to-end benchmark and measured the full agent loop per query — retrieval, context assembly, tool calls, model invocations, and the final answer — across three scenarios:

From the result shown in the image above, two conclusions immediately stand out. First, latency and answer quality.
In our run, MemAgent generally finished faster end-to-end than FSAgent. That might sound counterintuitive if you assume “database equals overhead,” and sometimes it does.
But the agent loop is not dominated by raw storage primitives. It is dominated by how quickly you can find the right information and how little unnecessary context you force into the model, also known as context engineering. Semantic retrieval tends to return fewer, more relevant chunks(subject to tuning of the retrieval pipelines), which means less scanning, less paging through files, and fewer tokens burned on irrelevant text.
In this particular run, both agents produced similar-quality answers. That is not surprising. When the questions are retrieval-friendly and the corpus is small enough, both approaches can find the right passages. FSAgent gets there through keyword search and careful reading. MemAgent gets there through similarity search over embedded chunks. Different roads, similar destination.
And I think it’s worth zooming in on one nuance here. When the information to traverse is minimal in terms of character length and the query is keyword-friendly, the retrieval quality of both agents tends to converge.
At that scale, “search” is barely a problem, so the dominant factor becomes the model’s ability to read and synthesize, not the retrieval substrate. The gap only starts to widen when the corpus grows, the wording becomes fuzzier, and the system must retrieve reliably under real-world constraints such as noise, paraphrases, and concurrency. Which it eventually does.
We also scored answers using an LLM-as-a-judge prompt. It is a pragmatic way to get directional feedback when you do not have labeled ground truth, but it is not a silver bullet. Judges can be sensitive to prompt phrasing, can over-reward fluency, and can miss subtle grounding failures.
If you are building this for production, treat LLM judging as a starting signal, not the finish line. The more reliable approach is a mix of:
Even with a lightweight judge, the directional story remains consistent. As retrieval becomes more difficult and the system becomes busier, database-backed memory tends to perform better.
The large-corpus test is designed to stress the exact weakness of keyword-first memory. We intentionally made the search problem harder by growing the corpus and making the queries less “exact match.”
FSAgent with a concatenated corpus
When you merge many papers into large markdown files, FSAgent becomes dependent on grep-style discovery followed by paging the right sections into the context window. It can work, but it gets brittle as the corpus grows:
MemAgent with chunked, embedded memory
Chunking plus embeddings makes retrieval more forgiving and more stable:
The narrative takeaway is simple. Filesystems feel great when the corpus is small and the queries are keyword-friendly. As the corpus grows and the questions get fuzzier, semantic retrieval becomes the differentiator, and database-backed memory becomes the more dependable default.
The quality gap widens with scale. On a handful of documents, grep can brute-force its way to a reasonable answer: the agent finds a keyword match, pulls surrounding context, and responds.
But scatter the same information across hundreds of files, and keyword search starts missing the forest for the trees. It returns too many shallow hits or none when the user’s phrasing doesn’t match the source text verbatim. Semantic search, by contrast, surfaces conceptually relevant chunks even when the vocabulary differs. The result isn’t just faster retrieval, it’s more coherent answers with fewer hallucinated gaps. This is evident in our LLM judge evaluation on the large corpus benchmark, where FSAgent achieved a score of 29.7% while MemAgent reached 87.1%.

We find that the real breaking point for filesystem memory is rarely retrieval. It is concurrency.
We ran three versions of the same workload under concurrent writes:
Then we measured two things:

What we observed maps to what many teams discover the hard way.
Naive filesystem writes can be fast and still be wrong.
Without locking, concurrent writes conflict with each other eventually. You might get good throughput and still lose memory entries. If your agent’s “memory” is used for downstream reasoning, silent loss is not a performance issue. It is a correctness failure.
Locking fixes integrity, but now correctness is your job.
With explicit locking, you can make filesystem writes safe. But you inherit the complexity. Lock scope, lock contention, platform differences, network filesystem behavior, and failure recovery all become part of your agent engineering work.
Databases make correctness the default.
Transactions and isolation are exactly what databases were designed for. Yes, there is overhead. But the key difference is that you are not bolting correctness on after a production incident. You start with a system whose job is to protect the shared state.
And of course, you can take the file-locking approach, add atomic writes, build a write-ahead log, introduce retry and recovery logic, maintain indexes for fast lookups, and standardize metadata so you can query it reliably.
Eventually though, you will realize you have not “avoided” a database at all.
You have just rebuilt one, only with fewer guarantees and more edge cases to own.
This isn’t a religious war between “files” and “databases.” It’s a question of what you’re optimizing for — and which failure modes you’re willing to own.
If you’re building single-user or single-writer prototypes, filesystem memory is a great default. It’s simple, transparent, and fast to iterate on. You can open a folder and see exactly what the agent saved, diff it, version it, and replay it with nothing more than a text editor.
If you’re building multi-user agents, background workers, or anything you plan to ship at scale, a database-backed memory store is a safer foundation in that scenario.
At that stage, concurrency, integrity, governance, access control, and auditability matter more than raw simplicity. A practical compromise is a hybrid design: keep file-like ergonomics for artifacts and developer workflows, but store durable memory in a database that can enforce correctness.
And if you insist on filesystem-only memory in production, treat locking, atomic writes, recovery, indexing, and metadata discipline as first-class engineering work. Because the moment you do that seriously, you’re no longer “just using files” — you’re rebuilding a database.
One last trap worth calling out: polyglot persistence.
Many AI stacks drift into an anti-pattern: a vector DB for embeddings, a NoSQL DB for JSON, a graph DB for relationships, and a relational DB for transactions. Each product is “best at its one thing,” until you realize you’re operating four databases, four security models, four backup strategies, four scaling profiles, and four cascading failure points.
Coordination becomes the tax. You end up building glue code and sync pipelines just to make the system feel unified to the agent. This is why converged approaches matter in agent systems: production memory isn’t only about storing vectors — it’s about storing operational history, artifacts, metadata, and semantics under one consistent set of guarantees.
For AI Developers, your application acts as an integration layer for multiple storage engines, each with different access patterns and operational semantics. You end up building glue code, sync pipelines, and reconciliation logic just to make the system feel unified to the agent.
Of course, production data is inherently heterogeneous. You will inevitably deal with structured, semi-structured, unstructured text, embeddings, JSON documents, and relationship-heavy data.
The point is not that “one model wins”.
The point is that when you understand the fundamentals of data management, reliability, indexing, governance, and queryability, you want a platform that can store and retrieve these forms without turning your AI infrastructure into a collection of loosely coordinated subsystems.
This is the philosophy behind Oracle’s converged database approach, which is designed to support multiple data types and workloads natively within a single engine. In the world of agents, that becomes a practical advantage because we can use Oracle as the unified memory core for both operational memory (SQL tables for history and logs) and semantic memory (vector search for retrieval).
Comparing File Systems and Databases for Effective AI Agent Memory Management was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.
BRUSSELS — First, the bad. I would argue that current open source practices and usage are not sustainable, or at the very least, there is a lot of room for improvement. In the current climate, there is a long litany of structural problems.
These include how burnout is becoming a real possibility, as some of the most talented developers are working for free or with little — usually no — compensation, even though that compensation would be well warranted. Burnout is a real issue. Then, at a high level, there is the problem of large tech companies making use of open source but giving little, if anything, back to the community, essentially using free open source resources not to become rich, but to become even richer.
Then there are those I have come in contact with who have long been maintainers of projects and have moved on from the companies where they were paid to work on those projects. Out of love and intellectual curiosity for the work, they continue to maintain and keep a toe in the project. Again, their time is limited, as they are likely working 60 hours a week at their day jobs and would like to have a life. In many cases, the open source project is fun to work on, but it is something else altogether to maintain it over the long term.
Then there is the diversity factor — the huge lack of diversity. Ethical reasons aside, and in my opinion the ethical reasons for advocating diversity in open source development are a major issue and goal, diversity also lends itself to significantly better health for open source projects. A case in point that I have lived through is childcare when kids are involved. The statistics show that women are inordinately tasked with childcare, although in my case childcare was also an issue previously. That does not leave much time to work on an open source project, regardless of how much you love it and enjoy contributing to it, when you have kids to take to doctor’s appointments, baseball games, and school, along with everything else that goes with childcare.
What I really appreciated about the talk that Marga Manterola, an engineering manager at Igalia — who has contributed to several major open source projects Flatcar Container Linux, Inspektor Gadget and Cilium during the past 25 years — gave during the keynote “Free as in Burned Out: Who Really Pays for Open Source?” last week at FOSDEM in Brussels is this: her talk was not just about listing what is wrong with open source — she gave real reasons for how it could be improved and how it could be fixed. She called it utopia. I would argue it is not utopia; it is this or nothing, because open source will otherwise wither — not necessarily die, but if it maintains the current trajectory, it is simply not viable. The current static flow is not viable, in my opinion.
Manterola’s core argument focused on how the status quo excludes a vast demographic of potential contributors. She pointed out that “being able to do a second job for free during your nights and weekends is a privilege” that many lack. This is particularly true for women, who she noted are “disproportionately in charge of caretaking responsibilities,” effectively making open source work a “second shift” they cannot afford to take on. By only paying senior developers who are already established maintainers, the industry fails to create space for new talent or those without the luxury of free time, she said.
To reach this goal, Manterola offered two concrete frameworks for corporate involvement:
The Open Source Pledge: She encouraged companies to donate $2,000 per developer per year to projects they depend on. While she acknowledged this amount might be high for some, she urged companies to start with whatever they could afford, emphasizing that “gaining steady income is more important, even if it’s less”.
The Open Source Employment Pledge: For companies unwilling to donate cash, she proposed a time-based commitment. Under this pledge, for every 20 developers a company employs, they would dedicate 50% of one person’s time to open source development. Critically, she specified this time must be “completely free of company influence,” allowing the developer to maintain the project however they see fit.
The “utopia” Manterola mentioned is one in which open source contributors are organized into professional teams and paid a “steady salary”. In this model, senior engineers would be supported by junior developers helping with “bug reports or documentation,” allowing for a natural progression where new maintainers can eventually take over or start their own projects. Manterola argued that since “97% of software depends on open source,” it is reasonable to expect that anyone wanting to work on it full-time should be fairly compensated rather than “begging for scraps.”
“I advocate for donating a steady amount every month, rather than big lumps of money to different projects, as gaining steady income is more important, even if it’s less,” Manterola said. “I’m proposing the open source employment pledge, which is, well, if you are not willing to donate money, maybe you are willing to donate time of your employees…Every 20 developers in your company, 50% of one person’s time goes to them developing open source and that 50% is like, completely free of company influence.”
The post Is Open Source in Trouble? appeared first on The New Stack.