Choosing your Compute Add-on
Choosing the right Compute Add-on for your vector workload.
You have two options for scaling your vector workload:
- Increase the size of your database. This guide will help you choose the right size for your workload.
- Spread your workload across multiple databases. You can find more details about this approach in Engineering for Scale.
Dimensionality
The number of dimensions in your embeddings is the most important factor in choosing the right Compute Add-on. In general, the lower the dimensionality the better the performance. We've provided guidance for some of the more common embedding dimensions below. For each benchmark, we used Vecs to create a collection, upload the embeddings to a single table, and create both the IVFFlat
and HNSW
indexes for inner-product
distance measure for the embedding column. We then ran a series of queries to measure the performance of different compute add-ons:
HNSW
384 dimensions
This benchmark uses the dbpedia-entities-openai-1M dataset containing 1,000,000 embeddings of text, regenerated for 384 dimension embeddings. Each embedding is generated using gte-small.
Compute Size | Vectors | m | ef_construction | ef_search | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|---|---|
Micro | 100,000 | 16 | 64 | 60 | 580 | 0.017 sec | 0.024 sec | 1.2 (Swap) | 1 GB |
Small | 250,000 | 24 | 64 | 60 | 440 | 0.022 sec | 0.033 sec | 2 GB | 2 GB |
Medium | 500,000 | 24 | 64 | 80 | 350 | 0.028 sec | 0.045 sec | 4 GB | 4 GB |
Large | 1,000,000 | 32 | 80 | 100 | 270 | 0.073 sec | 0.108 sec | 7 GB | 8 GB |
XL | 1,000,000 | 32 | 80 | 100 | 525 | 0.038 sec | 0.059 sec | 9 GB | 16 GB |
2XL | 1,000,000 | 32 | 80 | 100 | 790 | 0.025 sec | 0.037 sec | 9 GB | 32 GB |
4XL | 1,000,000 | 32 | 80 | 100 | 1650 | 0.015 sec | 0.018 sec | 11 GB | 64 GB |
8XL | 1,000,000 | 32 | 80 | 100 | 2690 | 0.015 sec | 0.016 sec | 13 GB | 128 GB |
12XL | 1,000,000 | 32 | 80 | 100 | 3900 | 0.014 sec | 0.016 sec | 13 GB | 192 GB |
16XL | 1,000,000 | 32 | 80 | 100 | 4200 | 0.014 sec | 0.016 sec | 20 GB | 256 GB |
Accuracy was 0.99 for benchmarks.
960 dimensions
This benchmark uses the gist-960 dataset, which contains 1,000,000 embeddings of images. Each embedding is 960 dimensions.
Compute Size | Vectors | m | ef_construction | ef_search | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|---|---|
Micro | 30,000 | 16 | 64 | 65 | 430 | 0.024 sec | 0.034 sec | 1.2 GB (Swap) | 1 GB |
Small | 100,000 | 32 | 80 | 60 | 260 | 0.040 sec | 0.054 sec | 2.2 GB (Swap) | 2 GB |
Medium | 250,000 | 32 | 80 | 90 | 120 | 0.083 sec | 0.106 sec | 4 GB | 4 GB |
Large | 500,000 | 32 | 80 | 120 | 160 | 0.063 sec | 0.087 sec | 7 GB | 8 GB |
XL | 1,000,000 | 32 | 80 | 200 | 200 | 0.049 sec | 0.072 sec | 13 GB | 16 GB |
2XL | 1,000,000 | 32 | 80 | 200 | 340 | 0.025 sec | 0.029 sec | 17 GB | 32 GB |
4XL | 1,000,000 | 32 | 80 | 200 | 630 | 0.031 sec | 0.050 sec | 18 GB | 64 GB |
8XL | 1,000,000 | 32 | 80 | 200 | 1100 | 0.034 sec | 0.048 sec | 19 GB | 128 GB |
12XL | 1,000,000 | 32 | 80 | 200 | 1420 | 0.041 sec | 0.095 sec | 21 GB | 192 GB |
16XL | 1,000,000 | 32 | 80 | 200 | 1650 | 0.037 sec | 0.081 sec | 23 GB | 256 GB |
Accuracy was 0.99 for benchmarks.
QPS can also be improved by increasing m
and ef_construction
. This will allow you to use a smaller value for ef_search
and increase QPS.
1536 dimensions
This benchmark uses the dbpedia-entities-openai-1M dataset, which contains 1,000,000 embeddings of text. And 224,482 embeddings from Wikipedia articles for compute add-ons large
and below. Each embedding is 1536 dimensions created with the OpenAI Embeddings API.
Compute Size | Vectors | m | ef_construction | ef_search | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|---|---|
Micro | 15,000 | 16 | 40 | 40 | 480 | 0.011 sec | 0.016 sec | 1.2 GB (Swap) | 1 GB |
Small | 50,000 | 32 | 64 | 100 | 175 | 0.031 sec | 0.051 sec | 2.2 GB (Swap) | 2 GB |
Medium | 100,000 | 32 | 64 | 100 | 240 | 0.083 sec | 0.126 sec | 4 GB | 4 GB |
Large | 224,482 | 32 | 64 | 100 | 280 | 0.017 sec | 0.028 sec | 8 GB | 8 GB |
XL | 500,000 | 24 | 56 | 100 | 360 | 0.055 sec | 0.135 sec | 13 GB | 16 GB |
2XL | 1,000,000 | 24 | 56 | 250 | 560 | 0.036 sec | 0.058 sec | 32 GB | 32 GB |
4XL | 1,000,000 | 24 | 56 | 250 | 950 | 0.021 sec | 0.033 sec | 39 GB | 64 GB |
8XL | 1,000,000 | 24 | 56 | 250 | 1650 | 0.016 sec | 0.023 sec | 40 GB | 128 GB |
12XL | 1,000,000 | 24 | 56 | 250 | 1900 | 0.015 sec | 0.021 sec | 38 GB | 192 GB |
16XL | 1,000,000 | 24 | 56 | 250 | 2200 | 0.015 sec | 0.020 sec | 40 GB | 256 GB |
Accuracy was 0.99 for benchmarks.
QPS can also be improved by increasing m
and ef_construction
. This will allow you to use a smaller value for ef_search
and increase QPS. For example, increasing m
to 32 and ef_construction
to 80 for 4XL will increase QPS to 1280.
It is possible to upload more vectors to a single table if Memory allows it (for example, 4XL plan and higher for OpenAI embeddings). But it will affect the performance of the queries: QPS will be lower, and latency will be higher. Scaling should be almost linear, but it is recommended to benchmark your workload to find the optimal number of vectors per table and per database instance.
IVFFlat
384 dimensions
This benchmark uses the dbpedia-entities-openai-1M dataset containing 1,000,000 embeddings of text, regenerated for 384 dimension embeddings. Each embedding is generated using gte-small.
Compute Size | Vectors | Lists | Probes | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|---|
Micro | 100,000 | 500 | 50 | 205 | 0.048 sec | 0.066 sec | 1.2 GB (Swap) | 1 GB |
Small | 250,000 | 1000 | 60 | 160 | 0.062 sec | 0.079 sec | 2 GB | 2 GB |
Medium | 500,000 | 2000 | 80 | 120 | 0.082 sec | 0.104 sec | 3.2 GB | 4 GB |
Large | 1,000,000 | 5000 | 150 | 75 | 0.269 sec | 0.375 sec | 6.5 GB | 8 GB |
XL | 1,000,000 | 5000 | 150 | 150 | 0.131 sec | 0.178 sec | 9 GB | 16 GB |
2XL | 1,000,000 | 5000 | 150 | 300 | 0.066 sec | 0.099 sec | 10 GB | 32 GB |
4XL | 1,000,000 | 5000 | 150 | 570 | 0.035 sec | 0.046 sec | 10 GB | 64 GB |
8XL | 1,000,000 | 5000 | 150 | 1400 | 0.023 sec | 0.028 sec | 12 GB | 128 GB |
12XL | 1,000,000 | 5000 | 150 | 1550 | 0.030 sec | 0.039 sec | 12 GB | 192 GB |
16XL | 1,000,000 | 5000 | 150 | 1800 | 0.030 sec | 0.039 sec | 16 GB | 256 GB |
960 dimensions
This benchmark uses the gist-960 dataset, which contains 1,000,000 embeddings of images. Each embedding is 960 dimensions.
Compute Size | Vectors | Lists | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|
Micro | 30,000 | 30 | 75 | 0.065 sec | 0.088 sec | 1.1 GB (Swap) | 1 GB |
Small | 100,000 | 100 | 78 | 0.064 sec | 0.092 sec | 1.8 GB | 2 GB |
Medium | 250,000 | 250 | 58 | 0.085 sec | 0.129 sec | 3.2 GB | 4 GB |
Large | 500,000 | 500 | 55 | 0.088 sec | 0.140 sec | 5 GB | 8 GB |
XL | 1,000,000 | 1000 | 110 | 0.046 sec | 0.070 sec | 14 GB | 16 GB |
2XL | 1,000,000 | 1000 | 235 | 0.083 sec | 0.136 sec | 10 GB | 32 GB |
4XL | 1,000,000 | 1000 | 420 | 0.071 sec | 0.106 sec | 11 GB | 64 GB |
8XL | 1,000,000 | 1000 | 815 | 0.072 sec | 0.106 sec | 13 GB | 128 GB |
12XL | 1,000,000 | 1000 | 1150 | 0.052 sec | 0.078 sec | 15.5 GB | 192 GB |
16XL | 1,000,000 | 1000 | 1345 | 0.072 sec | 0.106 sec | 17.5 GB | 256 GB |
1536 dimensions
This benchmark uses the dbpedia-entities-openai-1M dataset, which contains 1,000,000 embeddings of text. Each embedding is 1536 dimensions created with the OpenAI Embeddings API.
Compute Size | Vectors | Lists | QPS | Latency Mean | Latency p95 | RAM Usage | RAM |
---|---|---|---|---|---|---|---|
Micro | 20,000 | 40 | 135 | 0.372 sec | 0.412 sec | 1.2 GB (Swap) | 1 GB |
Small | 50,000 | 100 | 140 | 0.357 sec | 0.398 sec | 1.8 GB | 2 GB |
Medium | 100,000 | 200 | 130 | 0.383 sec | 0.446 sec | 3.7 GB | 4 GB |
Large | 250,000 | 500 | 130 | 0.378 sec | 0.434 sec | 7 GB | 8 GB |
XL | 500,000 | 1000 | 235 | 0.213 sec | 0.271 sec | 13.5 GB | 16 GB |
2XL | 1,000,000 | 2000 | 380 | 0.133 sec | 0.236 sec | 30 GB | 32 GB |
4XL | 1,000,000 | 2000 | 720 | 0.068 sec | 0.120 sec | 35 GB | 64 GB |
8XL | 1,000,000 | 2000 | 1250 | 0.039 sec | 0.066 sec | 38 GB | 128 GB |
12XL | 1,000,000 | 2000 | 1600 | 0.030 sec | 0.052 sec | 41 GB | 192 GB |
16XL | 1,000,000 | 2000 | 1790 | 0.029 sec | 0.051 sec | 45 GB | 256 GB |
For 1,000,000 vectors 10 probes results to accuracy of 0.91. And for 500,000 vectors and below 10 probes results to accuracy in the range of 0.95 - 0.99. To increase accuracy, you need to increase the number of probes.
It is possible to upload more vectors to a single table if Memory allows it (for example, 4XL plan and higher for OpenAI embeddings). But it will affect the performance of the queries: QPS will be lower, and latency will be higher. Scaling should be almost linear, but it is recommended to benchmark your workload to find the optimal number of vectors per table and per database instance.
Performance tips
There are various ways to improve your pgvector performance. Here are some tips:
Pre-warming your database
It's useful to execute a few thousand “warm-up” queries before going into production. This helps help with RAM utilization. This can also help to determine that you've selected the right compute size for your workload.
Finetune index parameters
You can increase the Requests per Second by increasing m
and ef_construction
or lists
. This also has an important caveat: building the index takes longer with higher values for these parameters.
Check out more tips and the complete step-by-step guide in Going to Production for AI applications.
Benchmark methodology
We follow techniques outlined in the ANN Benchmarks methodology. A Python test runner is responsible for uploading the data, creating the index, and running the queries. The pgvector engine is implemented using vecs, a Python client for pgvector.
Each test is run for a minimum of 30-40 minutes. They include a series of experiments executed at different concurrency levels to measure the engine's performance under different load types. The results are then averaged.
As a general recommendation, we suggest using a concurrency level of 5 or more for most workloads and 30 or more for high-load workloads.