Back to Insights
AI

Google Open-Sources Always On Memory Agent: Ditching Vector Databases for LLM-Driven Persistence

A new open-source reference implementation challenges the assumption that agents need vector databases for persistent memory. Here is what changes for enterprise developers.

S5 Labs TeamMarch 8, 2026

Google senior AI product manager Shubham Saboo just released an open-source project that tackles one of the thorniest problems in agent design: persistent memory. The Always On Memory Agent, published on the official Google Cloud Platform GitHub under an MIT license, challenges a widely held assumption in the AI industry, that agents need vector databases to store and retrieve context across sessions.

The project matters because it offers a concrete alternative to the retrieval stacks that most agent deployments rely on today. For teams building long-running AI systems, this is a practical reference worth examining.

The Problem With Traditional Agent Memory

Most agent implementations treat memory as a retrieval problem. You chunk documents, generate embeddings, store them in a vector database, and query for similarity at runtime. This approach works, but it adds operational overhead: embedding pipelines, vector storage, indexing logic, and synchronization between the database and your application state.

For prototypes, that overhead is manageable. For production systems that need to run continuously, it becomes a significant infrastructure burden. The Always On Memory Agent asks a different question: what if the model itself handles memory organization?

How the Architecture Works

The agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default. It was built with Google’s Agent Development Kit (ADK) and Gemini 3.1 Flash-Lite, the low-cost model Google released on March 3, 2026.

The repository makes a deliberately provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes structured memory.”

The system uses a multi-agent internal architecture with specialist components handling ingestion, consolidation, and querying. A local HTTP API and Streamlit dashboard are included, and the agent supports text, image, audio, video, and PDF ingestion.

Why Flash-Lite Makes This Viable

The choice of Gemini 3.1 Flash-Lite is not incidental. Google positions the model as its fastest and most cost-efficient option at 0.25permillioninputtokensand0.25 per million input tokens and 1.50 per million output tokens. The company claims it is 2.5 times faster than Gemini 2.5 Flash in time to first token and delivers a 45% increase in output speed while maintaining similar quality.

Pairing a low-cost, fast model with a memory layer that runs 24/7 makes economic sense. Every 30-minute consolidation cycle consumes tokens, and the cost adds up quickly with more expensive models. Flash-Lite keeps the operational cost predictable for high-frequency, long-running workloads.

What This Means for Enterprise Developers

The release is less a product launch and more a signal about where agent infrastructure is heading. The repository packages a vision for long-running autonomy that appeals to support systems, research assistants, internal copilots, and workflow automation.

But enterprise architects are already raising concerns. One reaction on X described the approach as “brilliant leaps for continuous agent autonomy” but warned that an agent “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.”

That critique points to the real challenge: governance. When memory stops being session-bound, teams need answers to difficult questions. Who can write memory? What gets merged? How is retention handled? When are memories deleted? How do you audit what the agent learned over time?

Another important point: removing a vector database does not remove retrieval design. It changes where the complexity lives. The system still has to chunk, index, and retrieve structured memory. It may work well for bounded-memory agents but could break down at scale.

The Tradeoff Developers Need to Understand

This is not a case where one approach is universally better. The lighter stack with LLM-driven memory may be attractive for low-cost, bounded-memory agents where the overhead of a full retrieval pipeline does not make sense. Larger-scale deployments may still demand stricter retrieval controls, explicit indexing strategies, and stronger lifecycle tooling.

For developers, the value of this release is not the code itself but the reference architecture it provides. It demonstrates that persistent memory does not require a vector database as a prerequisite, and it shows how to implement consolidation with a cost-effective model.

If you are building agents that need to maintain context across sessions, this is worth studying. The assumptions your architecture makes about memory will shape what your system can and cannot do at scale.

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.