pgvector at Scale: When Postgres Is Enough

Balys Kriksciunas · Mon Mar 10 2025 · 7 min read

#ai #infrastructure #pgvector #postgres #vector-database #rag #embeddings

pgvector gets faster every release and now handles workloads that used to require a dedicated vector database. When to stick with Postgres, when to graduate, and how to tune either way.

pgvector at Scale: When Postgres Is Enough

When teams ask me which vector database they should use for a new RAG project, my first question is: “Do you already run Postgres?” If the answer is yes and your corpus is under 50M vectors, we’re probably talking about pgvector.

This surprises people. “Postgres for vectors? Doesn’t that need a specialized database?” Once upon a time, yes. In 2025, pgvector is good enough to handle the workloads that 80% of production RAG systems actually have — and it’s enormously cheaper to operate than a second dedicated database.

This article covers what pgvector can and can’t do, how to tune it, and where the ceilings genuinely are.

What pgvector Is

pgvector is a Postgres extension that adds a vector column type and two index types: IVFFlat and HNSW. You install it with CREATE EXTENSION vector; and you have vectors.

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(1536),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

That’s it. You can now do:

SELECT id, content, embedding <=> '[...]' AS distance
FROM documents
WHERE metadata->>'tenant_id' = '42'
ORDER BY embedding <=> '[...]'
LIMIT 10;

<=> is cosine distance, <-> is L2, <#> is negative inner product. HNSW handles the ANN search; Postgres handles the metadata filter.

Why Not Use A Dedicated Vector DB?

Dedicated vector DBs have real advantages at scale, but they come with real costs:

A second database to operate, back up, migrate, patch
A new query language and client library to learn
Your vector data is in a separate system from your application data, so joins require coordination
Another vendor relationship, possibly another bill

Postgres gives you:

Transactional guarantees across vectors + metadata + application data
One backup, one replication setup, one auth model
Every Postgres tool you already use (pgBouncer, pg_stat_statements, pg_dump, logical replication, Timescale, etc.)
Mature ecosystem across every language

For workloads under ~50M vectors, pgvector’s performance is enough and the operational simplicity is a gift. You should graduate when performance becomes your bottleneck, not when you hit an arbitrary scale.

Performance Ceiling: Where pgvector Lives

A rough benchmark (our internal harness, 1536-dim vectors, HNSW index, cosine distance, 16-core c7i.8xlarge, NVMe):

Corpus size	P50 latency	P95 latency	QPS (single conn)	RAM
1M vectors	3ms	8ms	~800	~6 GB
10M vectors	5ms	18ms	~500	~55 GB
50M vectors	12ms	40ms	~280	~250 GB
100M vectors	28ms	95ms	~150	~500 GB

At 1M vectors it’s blisteringly fast. At 10M it’s solid. At 50M it still works, but RAM requirements and single-query latency start to bite. At 100M, you’re feeling it.

Compared to dedicated options (Qdrant, Milvus) at the same scale, pgvector is 2–3x slower on pure ANN throughput at the top end. At 1–10M vectors, the gap narrows to 20–40% and is often dominated by network latency anyway.

The `pgvectorscale` Upgrade

In 2024, Timescale open-sourced pgvectorscale, a companion extension that adds two things:

StreamingDiskANN index — an ANN algorithm (based on Microsoft’s DiskANN) that handles larger-than-RAM corpora via smart disk IO.
Statistical Binary Quantization (SBQ) — compressed vector representations that dramatically reduce memory.

With pgvectorscale, the 100M-vector benchmark above drops to:

Corpus size	P50 latency	P95 latency	QPS	RAM
100M vectors (pgvectorscale SBQ)	8ms	22ms	~450	~60 GB

That’s a 4–5x improvement across the board. For teams serious about pgvector at scale, pgvectorscale is essentially required — it pushes the ceiling from ~50M to ~500M vectors.

Tuning Knobs That Matter

HNSW parameters

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

m: number of connections per graph node. Higher = better recall, more memory. Default 16 is a sensible start; 24–32 for high-recall needs; 8 for memory-constrained.
ef_construction: search width during index build. Higher = slower build, better recall. Default 64; bump to 200 for best quality.

At query time:

SET hnsw.ef_search = 40;  -- default 40; higher = better recall, slower

Typical production: m=16, ef_construction=200, ef_search=40–100. Tune ef_search per-query based on how much recall the application needs.

IVFFlat vs HNSW

IVFFlat is older, simpler, faster to build, uses less memory. HNSW gives better recall at query time but takes much longer to build. For almost every production case, use HNSW. IVFFlat is a reasonable choice for very write-heavy workloads where index build time dominates.

`maintenance_work_mem`

HNSW index builds are memory-bound. Postgres default is 64MB which is laughable. Bump it to 8–16 GB for the index build session:

SET maintenance_work_mem = '16GB';
CREATE INDEX ...

Parallel index build

From Postgres 16 + pgvector 0.6, HNSW builds are parallelizable:

SET max_parallel_maintenance_workers = 8;

On a 16-core machine, a 10M-vector HNSW build drops from ~2 hours to ~20 minutes.

`work_mem`

Vector queries can sort large intermediate result sets. Bump work_mem to 256MB-1GB for queries that are slow.

Filtering and partial indexes

A vector query with a metadata filter can’t use the HNSW index for the filter — it filters after retrieving candidates. If your filter is highly selective (only 1% of vectors match), this can decimate recall.

Fix 1: Over-retrieve. Query for top-200 candidates, then filter down to your top-10. Works for moderate selectivity.

Fix 2: Partitioned indexes. If you filter frequently on a tenant_id or category, partition the table by that field. Each partition has its own HNSW index; queries only hit the relevant partition.

Fix 3: Partial indexes. For a specific hot filter, create a partial index:

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WHERE status = 'published';

Queries with WHERE status = 'published' will use this index exclusively, ignoring draft rows entirely.

Multi-Tenant Patterns

The most common production setup for B2B SaaS: many tenants, each with their own corpus, queries must be tenant-isolated.

Pattern A: Single table, tenant_id column. Simplest. Good up to ~5M vectors per tenant and modest tenant count. Over-retrieve and post-filter.

Pattern B: Partition by tenant. Postgres declarative partitioning, one partition per tenant or per tenant group. Best for 10s–100s of tenants with large per-tenant corpora.

Pattern C: Schema per tenant. Postgres schemas as tenants. Good isolation; ugly migrations. Use only if compliance requires hard isolation.

Pattern D: Database per large tenant. For your top 10–50 accounts on enterprise contracts. Lets you scale vertically and bill them separately.

Most teams end up with Pattern A for small tenants and Pattern D for whales.

Scaling Patterns

Read replicas

Postgres logical or physical replication gets you read-scaling. HNSW indexes replicate fine. Point your read-heavy RAG queries at replicas and let the primary handle writes.

Partitioning

Partition by a dimension that correlates with query filters (tenant, date range, category). Each partition has its own HNSW index. Queries that filter on the partition key only scan relevant partitions.

Horizontal sharding

Citus (now Azure-owned but still open source) shards Postgres across nodes. Vector queries with a partition key work well. Cross-shard vector queries are possible but require coordination.

Connection pooling

pgBouncer in transaction-pooling mode. Long-running vector queries don’t play nicely with session-pooling. Budget ~5–10 connections per inference worker.

When To Graduate

Concrete signals that pgvector is no longer enough:

Your P95 query latency is above your SLO despite tuning
You’ve partitioned and replicated and still can’t keep up
Your corpus is above 500M vectors
You need features pgvector lacks (e.g., real-time index updates without rebuild, specific quantization schemes)
Your application team wants managed vector DB features (e.g., Pinecone’s metadata filtering is genuinely more powerful for complex filters)

If you cross those thresholds, migrate. Qdrant or Pinecone are the usual next stops. Expect 2–4 weeks of migration work: dual-write, backfill, cutover.

Operational Tips From The Field

1. Don’t delete and re-insert; update. An UPDATE to a row with a vector column rewrites the vector. A DELETE + INSERT does the same thing plus fragmenting the index.

2. VACUUM matters more than usual. HNSW indexes accumulate cruft on updates. VACUUM ANALYZE weekly; consider REINDEX monthly on write-heavy tables.

3. Watch for bloat. pg_stat_user_indexes shows you if your index is getting huge. If the ratio of index size to table size is climbing, it’s time to REINDEX.

4. Materialize your embeddings async. Don’t block user-facing writes on embedding generation. Queue the embedding job; UPDATE the row when it completes.

5. Keep pgvector current. pgvector had major performance wins in 0.5.0, 0.6.0, and 0.7.0. Every upgrade is worth it.

A Real Comparison

For a 20M-vector corpus with 1,000 QPS production load:

Pinecone serverless: ~$800–$1,500/month
Qdrant Cloud: ~$400–$700/month
pgvector on RDS r6i.2xlarge + replica: ~$750/month (pays for much more than just vectors)
pgvector on self-managed c7i.4xlarge: ~$450/month

At 20M vectors, pgvector isn’t always cheapest, but it’s competitive and you get a full Postgres for the same spend. For most teams, that changes the calculus.

pgvector at Scale: When Postgres Is Enough

pgvector at Scale: When Postgres Is Enough

What pgvector Is

Why Not Use A Dedicated Vector DB?

Performance Ceiling: Where pgvector Lives

The `pgvectorscale` Upgrade

Tuning Knobs That Matter

HNSW parameters

IVFFlat vs HNSW

`maintenance_work_mem`

Parallel index build

`work_mem`

Filtering and partial indexes

Multi-Tenant Patterns

Scaling Patterns

Read replicas

Partitioning

Horizontal sharding

Connection pooling

When To Graduate

Operational Tips From The Field

A Real Comparison

Further Reading

Related Posts

Choosing a Vector Database in 2024: A Practical Guide

Context Engineering: Storage, Retrieval, and the New Memory Stack

Hybrid Search in Production: BM25 + Dense Retrieval

pgvector at Scale: When Postgres Is Enough

pgvector at Scale: When Postgres Is Enough

What pgvector Is

Why Not Use A Dedicated Vector DB?

Performance Ceiling: Where pgvector Lives

The pgvectorscale Upgrade

Tuning Knobs That Matter

HNSW parameters

IVFFlat vs HNSW

maintenance_work_mem

Parallel index build

work_mem

Filtering and partial indexes

Multi-Tenant Patterns

Scaling Patterns

Read replicas

Partitioning

Horizontal sharding

Connection pooling

When To Graduate

Operational Tips From The Field

A Real Comparison

Further Reading

Related Posts

Choosing a Vector Database in 2024: A Practical Guide

Context Engineering: Storage, Retrieval, and the New Memory Stack

Hybrid Search in Production: BM25 + Dense Retrieval

Don't miss out on AI insights

The `pgvectorscale` Upgrade

`maintenance_work_mem`

`work_mem`