Infrastructure

pgvector at Scale: When Postgres Is Enough

Balys Kriksciunas 7 min read
#ai#infrastructure#pgvector#postgres#vector-database#rag#embeddings

pgvector at Scale: When Postgres Is Enough

When teams ask me which vector database they should use for a new RAG project, my first question is: “Do you already run Postgres?” If the answer is yes and your corpus is under 50M vectors, we’re probably talking about pgvector.

This surprises people. “Postgres for vectors? Doesn’t that need a specialized database?” Once upon a time, yes. In 2025, pgvector is good enough to handle the workloads that 80% of production RAG systems actually have — and it’s enormously cheaper to operate than a second dedicated database.

This article covers what pgvector can and can’t do, how to tune it, and where the ceilings genuinely are.


What pgvector Is

pgvector is a Postgres extension that adds a vector column type and two index types: IVFFlat and HNSW. You install it with CREATE EXTENSION vector; and you have vectors.

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(1536),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

That’s it. You can now do:

SELECT id, content, embedding <=> '[...]' AS distance
FROM documents
WHERE metadata->>'tenant_id' = '42'
ORDER BY embedding <=> '[...]'
LIMIT 10;

<=> is cosine distance, <-> is L2, <#> is negative inner product. HNSW handles the ANN search; Postgres handles the metadata filter.


Why Not Use A Dedicated Vector DB?

Dedicated vector DBs have real advantages at scale, but they come with real costs:

Postgres gives you:

For workloads under ~50M vectors, pgvector’s performance is enough and the operational simplicity is a gift. You should graduate when performance becomes your bottleneck, not when you hit an arbitrary scale.


Performance Ceiling: Where pgvector Lives

A rough benchmark (our internal harness, 1536-dim vectors, HNSW index, cosine distance, 16-core c7i.8xlarge, NVMe):

Corpus sizeP50 latencyP95 latencyQPS (single conn)RAM
1M vectors3ms8ms~800~6 GB
10M vectors5ms18ms~500~55 GB
50M vectors12ms40ms~280~250 GB
100M vectors28ms95ms~150~500 GB

At 1M vectors it’s blisteringly fast. At 10M it’s solid. At 50M it still works, but RAM requirements and single-query latency start to bite. At 100M, you’re feeling it.

Compared to dedicated options (Qdrant, Milvus) at the same scale, pgvector is 2–3x slower on pure ANN throughput at the top end. At 1–10M vectors, the gap narrows to 20–40% and is often dominated by network latency anyway.


The pgvectorscale Upgrade

In 2024, Timescale open-sourced pgvectorscale, a companion extension that adds two things:

  1. StreamingDiskANN index — an ANN algorithm (based on Microsoft’s DiskANN) that handles larger-than-RAM corpora via smart disk IO.
  2. Statistical Binary Quantization (SBQ) — compressed vector representations that dramatically reduce memory.

With pgvectorscale, the 100M-vector benchmark above drops to:

Corpus sizeP50 latencyP95 latencyQPSRAM
100M vectors (pgvectorscale SBQ)8ms22ms~450~60 GB

That’s a 4–5x improvement across the board. For teams serious about pgvector at scale, pgvectorscale is essentially required — it pushes the ceiling from ~50M to ~500M vectors.


Tuning Knobs That Matter

HNSW parameters

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

At query time:

SET hnsw.ef_search = 40;  -- default 40; higher = better recall, slower

Typical production: m=16, ef_construction=200, ef_search=40–100. Tune ef_search per-query based on how much recall the application needs.

IVFFlat vs HNSW

IVFFlat is older, simpler, faster to build, uses less memory. HNSW gives better recall at query time but takes much longer to build. For almost every production case, use HNSW. IVFFlat is a reasonable choice for very write-heavy workloads where index build time dominates.

maintenance_work_mem

HNSW index builds are memory-bound. Postgres default is 64MB which is laughable. Bump it to 8–16 GB for the index build session:

SET maintenance_work_mem = '16GB';
CREATE INDEX ...

Parallel index build

From Postgres 16 + pgvector 0.6, HNSW builds are parallelizable:

SET max_parallel_maintenance_workers = 8;

On a 16-core machine, a 10M-vector HNSW build drops from ~2 hours to ~20 minutes.

work_mem

Vector queries can sort large intermediate result sets. Bump work_mem to 256MB-1GB for queries that are slow.

Filtering and partial indexes

A vector query with a metadata filter can’t use the HNSW index for the filter — it filters after retrieving candidates. If your filter is highly selective (only 1% of vectors match), this can decimate recall.

Fix 1: Over-retrieve. Query for top-200 candidates, then filter down to your top-10. Works for moderate selectivity.

Fix 2: Partitioned indexes. If you filter frequently on a tenant_id or category, partition the table by that field. Each partition has its own HNSW index; queries only hit the relevant partition.

Fix 3: Partial indexes. For a specific hot filter, create a partial index:

CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WHERE status = 'published';

Queries with WHERE status = 'published' will use this index exclusively, ignoring draft rows entirely.


Multi-Tenant Patterns

The most common production setup for B2B SaaS: many tenants, each with their own corpus, queries must be tenant-isolated.

Pattern A: Single table, tenant_id column. Simplest. Good up to ~5M vectors per tenant and modest tenant count. Over-retrieve and post-filter.

Pattern B: Partition by tenant. Postgres declarative partitioning, one partition per tenant or per tenant group. Best for 10s–100s of tenants with large per-tenant corpora.

Pattern C: Schema per tenant. Postgres schemas as tenants. Good isolation; ugly migrations. Use only if compliance requires hard isolation.

Pattern D: Database per large tenant. For your top 10–50 accounts on enterprise contracts. Lets you scale vertically and bill them separately.

Most teams end up with Pattern A for small tenants and Pattern D for whales.


Scaling Patterns

Read replicas

Postgres logical or physical replication gets you read-scaling. HNSW indexes replicate fine. Point your read-heavy RAG queries at replicas and let the primary handle writes.

Partitioning

Partition by a dimension that correlates with query filters (tenant, date range, category). Each partition has its own HNSW index. Queries that filter on the partition key only scan relevant partitions.

Horizontal sharding

Citus (now Azure-owned but still open source) shards Postgres across nodes. Vector queries with a partition key work well. Cross-shard vector queries are possible but require coordination.

Connection pooling

pgBouncer in transaction-pooling mode. Long-running vector queries don’t play nicely with session-pooling. Budget ~5–10 connections per inference worker.


When To Graduate

Concrete signals that pgvector is no longer enough:

If you cross those thresholds, migrate. Qdrant or Pinecone are the usual next stops. Expect 2–4 weeks of migration work: dual-write, backfill, cutover.


Operational Tips From The Field

1. Don’t delete and re-insert; update. An UPDATE to a row with a vector column rewrites the vector. A DELETE + INSERT does the same thing plus fragmenting the index.

2. VACUUM matters more than usual. HNSW indexes accumulate cruft on updates. VACUUM ANALYZE weekly; consider REINDEX monthly on write-heavy tables.

3. Watch for bloat. pg_stat_user_indexes shows you if your index is getting huge. If the ratio of index size to table size is climbing, it’s time to REINDEX.

4. Materialize your embeddings async. Don’t block user-facing writes on embedding generation. Queue the embedding job; UPDATE the row when it completes.

5. Keep pgvector current. pgvector had major performance wins in 0.5.0, 0.6.0, and 0.7.0. Every upgrade is worth it.


A Real Comparison

For a 20M-vector corpus with 1,000 QPS production load:

At 20M vectors, pgvector isn’t always cheapest, but it’s competitive and you get a full Postgres for the same spend. For most teams, that changes the calculus.


Further Reading

Tuning a pgvector deployment or deciding whether to migrate? We can help — we’ve shipped pgvector workloads at every size.

← Back to Blog