tl;dr: Use HyperLogLog, it's a reasonable approach with great trade-offs and no large architectural liabilities. For a quick & dirty prototype, use hstore, which also performs the best with integer IDs.
The year is 2022. You're head DBA at the hot new social site, SupaBook... Your startup is seeing eye-boggling growth because everyone loves fitting their hot-takes in posts restricted to VARCHAR(256).
Why VARCHAR(256)? No particular reason, but you don't have time to get hung up on that or ask why -- you just found out that the priority this quarter is tracking content views across all posts in the app.
"It sounds pretty simple" a colleague at the meeting remarks -- "just an increment here and an increment there and we'll know which posts are seen the most on our platform". You start to explain why it will be non-trivial, but the meeting ends before you can finish.
Well, it's time to figure out how you're going to do it. There's been a complexity freeze at the company, so you're not allowed to bring in any new technology, but you don't mind that because for v1 you would have picked Postgres anyway. Postgres's open source pedigree, robust suite of features, stable internals, and awesome mascot Slonik make it a strong choice, and it's what you're already running.
(insert record scratch here)
Sure, this scenario isn't real, but it could be - that last part about Postgres definitely is. Let's see how you might solve this problem, as that imaginary DBA.
-- Create a email domain to represent and constraing email addresses
CREATE DOMAIN email
AS citext
CHECK ( LENGTH(VALUE) <= 255 AND value ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' );
COMMENT ON DOMAIN email is 'lightly validated email address';
-- Create the users table
CREATE TABLE users (
id bigserial PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
uuid uuid NOT NULL DEFAULT uuid_nonmc_v1(),
email email NOT NULL,
name text,
about_html text,
created_at timestamptz NOT NULL DEFAULT NOW()
);
-- Create the posts table
CREATE TABLE posts (
id bigserial PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
uuid uuid NOT NULL DEFAULT uuid_nonmc_v1(),
title text,
content text,
main_image_src text,
main_link_src text,
created_by bigint REFERENCES users(id),
last_hidden_at timestamptz,
last_updated_at timestamptz,
created_at timestamptz NOT NULL DEFAULT NOW()
);
This basic setup has taken the (imaginary) company quite far -- even though the posts table has millions and millions of entries, Postgres chugs along and serves our queries with impressive speed and reliability. Scaling up is the new (and old) scaling out.
Well we can't pat ourselves for our miraculous and suspiciously simple DB architecture all day, let's move on to the task at hand.
Like any good tinkerer we'll start with the simplest solutions and work our way up in complexity to try and get to something outstanding, testing our numbers as we go.
Try #1: The naive way, a simple counter on every Post#
The easiest obvious way to do this is to maintain a counter on every tuple in the posts table. It's obvious, and it's almost guaranteed to work -- but maybe not work well.
The migration to make it happen isn't too difficult:
hideCopy
BEGIN;
ALTER TABLE posts ADD COLUMN seen_by_count;
COMMENT ON COLUMN posts.seen_by_count
IS 'simple count of users who have seen the post';
COMMIT;
There's one obvious glaring issue here -- what if someone sees the same post twice? Every page reload would cause inflated counts in the seen_by_count column, not to mention a lot of concurrent database updates (which isn't necessarily Postgres's forte to begin with).
Clearly there's a better way to do things but before that...
Writing a test suite before the CPUs get hot and heavy#
How will we know which approach is better without numbers?! Measuring complexity and feeling can only get us so far -- we need to get some numbers that tell us the performance of the solution at the stated tasks -- we need benchmarks.
Before we can declare any solution the best, in particular we need a baseline!. The simplest possible incorrect solution (simply incrementing a counter on the Post) is probably a reasonable thing to use as a benchmark, so let's take a moment to write our testing suite.
Let's do this the simplest one might imagine:
Generate a large amount of users
Lets model for 1000, 10k, 100K, 1MM, and 10MM users
Generate an even larger amount of fake posts attributed to those users
This is a bit harder -- we need to define a general distribution for our users that's somewhat informed by real life...
Generate a description of "events" that describe which post was seen by whom, which we can replay.
We want the equivalent of an effect system or monadic computation, which is easier than it sounds -- we want to generate an encoding (JSON, probably) of what to do, without actually doing it
We'll just do consistent "as fast as we can" execution (more complicated analysis would burst traffic to be ab it closer to real life)
Nothing too crazy in there -- we generate a bunch of JSON, and force it out to disk. It's best to avoid trying to keep it in memory so we can handle much larger volumes than we might be able to fit in memory.
Along with users, we need to generate posts that they can view. We'll keep it simple and take an amount of posts to make, generating from 0 to count of those.
It's very similar to the user generation code, with the caveat that we can take into account the 80/20 lurker/poster rule. here's what that looks like:
Nothing too crazy here, and some back of the envelope estimations on how often each operation would normally be called. These numbers could be tweaked more, but we should see a difference between approaches even if we messed up massively here.
Alright, finally we're ready. Let's see what we get on our naive solution. We expect this to be pretty fast, because not only is it wrong, but it's just about the simplest thing you could do.
On my local machine, here's our baseline (output from autocannon):
Now that we've got a basic baseline of our tests, let's continue trying out ideas:
Try #2: Storing the users who did the "see"ing, with hstore#
The next obvious thing (and probably a core requirement if we'd asked around), is knowing who viewed each post. Well if we need to know who, then we probably need to store some more information!
Postgres has native support for arrays and a data structure called a hstore, so let's try those. It's pretty obvious that having hundreds, thousands, or millions of entries in one of these data structures, inside a tuple isn't the greatest idea, but let's try it anyway and let the numbers speak for themselves.
Here's what the migration would look like:
hideCopy
BEGIN;
CREATE EXTENSION IF NOT EXISTS hstore;
ALTER TABLE posts ADD COLUMN seen_count_hstore hstore
NOT NULL DEFAULT ''::hstore;
COMMENT ON COLUMN posts.seen_count_hstore
IS 'count of users that have seen the post, with hstore';
COMMIT;
hstore provides support for both GIST and GIN indices, but after reading the documentation we can conclude that we don't necessarily need those for the current set of functionality.
Well as you might have imagined, this is obviously pretty bad and will eventually be hard to scale. If you expect only 0-50 entries in your column text[] is perfectly fine, but thousands or millions is another ballgame.
Thinking of how to scale this, a few ideas pop to mind:
Not too far off! While we didn't try the pathological case(s) of millions of people liking the same post to hit breaking point, a slightly more random distribution seems to have done decently -- we actually have lower 99.999th percentile latency versus the simple counter.
An average of 2.15ms versus 2.05ms with the simpler counter is a ~4% increase in the average latency (though of course, the p99.999 is lower!).
Try #3: An Association table for remembering who liked what#
A likely requirement from the original scenario that we've completely ignored is remembering which users liked a certain post to. The easiest solution here is an "associative" table like this one:
In SQL:
hideCopy
begin;
create table posts_seen_by_users (
post_id bigint references posts (id),
user_id bigint references users (id),
seen_count bigint not null default 0 check (seen_count > 0),
In production, you're going to want to do a few things to make this even remotely reasonable long term:
PARTITION the table (consider using partition-friendly pg_partman)
Move old partitions off to slower/colder storage and maintain snapshots
Summarize older content that might be seen lots
Consider a partitioning key up front -- post IDs are probably a reasonable thing to use if they're sufficiently randomly distributed
These are good initial stop-gaps, but a realistic setup will have many problems and many more solutions to be discovered.
(It will be a recurring theme but this is a spot where we probably don't necessarily want to use stock Postgres but instead want to use tools like Citus Columnar Storage, ZedStore, or an external choice like ClickHouse).
What's HyperLogLog you ask? Well it's just a probabilistic data structure! Don't worry if you've never heard of it before, it's a reasonably advanced concept.
You may have heard of Bloom Filters and they're somewhat related but they're not quite a great fit for the problem we're solving since we want to know how many people have seen a particular post. Knowing whether one user has seen a particular post is useful too -- but not quite what we're solving for here (and we'd have to double-check our false positives anyway if we wanted to be absolutely sure).
HyperLogLog provides a probabilistic data structure that is good at counting distinct entries, so that means that the count will not be exact, but be reasonably close (depending on how we tune). We won't have false positives (like with a bloom filter) -- we'll have a degree of error (i.e. the actual count may be 1000, but the HLL reports 1004).
We have to take this into account on the UI side but and maybe retrieve the full count if anyone ever really needs to know/view individual users that have seen the content, so we can fall back to our association table there.
NOTE that we still have access to the association table -- and while we still insert rows into it, we can drop the primary key index, and simply update our HLL (and leave ourselves a note on when we last updated it).
There's not much to add to this solution, as the heavy lifting is mostly done by postgresql-hll, but there's one big caveat:
This approach will need a custom Postgres image for this, since hll is not an official contrib module
There are also a few optimizations that are easy to imagine:
Batching inserts to the association table (storing them in some other medium in the meantime -- local disk, redis, etc)
Writing our association table entries in a completely different storage medium altogether (like object storage) and use Foreign Data Wrappers and pg_cron and delay or put off processing all together
Another somewhat nuanced degradation in performance -- while the 99.99%ile latency was nearly 2x higher, the average latency was actually lower than the assoc-table approach @ 2.28ms.
The average latency on the HLL approach is 11% worse than simple-counter, 6% worse than simple-hstore, and faster than assoc-table alone, which is an improvement.
One of the great things about Postgres is it's expansive ecosystem -- while Postgres may (and frankly should not) beat the perfect specialist tool for your use case, it often does an outstanding job in the general case.
Let's look into some more experiments that could be run -- maybe one day in the future we'll get some numbers behind these (community contributions are welcome!).
As is usually the case in academia and practice, we can make our problem drastically easier by simply changing the data structures we use to model our problem!
One such reconfiguration would be storing the information as a graph:
As you might imagine, finding the number of "seen-by" relations would simply be counting the number of edges out of one of the nodes!
Well, the Postgres ecosystem has us covered here too! AGE is an extension that allows you to perform graph related queries in Postgres.
We won't pursue it in this post but it would be a great way to model this problem as well -- thanks to the extensibility of Postgres, this data could live right next to our normal relational data as well.
OK, so what's the answer at the end of the day? What's the best way to get to that useful v1? Here are the numbers:
In tabular form:
Approach
Avg (ms)
99%ile (ms)
99.999%ile (ms)
simple-counter
2.03
6
23
simple-hstore
2.15
6
16
assoc-table
2.5
8
30
hll
2.16
7
27
If we go strictly with the data, the best way looks to be the hstore-powered solution, but I think the HLL is probably the right choice.
The HLL results were quite variable -- some runs were faster than others, so I've taken the best of 3 runs.
Even though the data says hstore, knowing that posts will be seen by more and more people over time, I might choose the HLL solution for an actual implementation. It's far less likely to pose a bloated row problem, and it has the absolute correctness (and later recall) of the assoc-table solution, while performing better over all (as you can imagine, no need to COUNT rows).
Another benefit of the HLL solution is that PostgreSQL tablespaces allow us to put the association table on a different, slower storage mechanism, and keep our posts table fast. Arguably in a real system we might have the HLL in something like redis but for a v1, it looks like Postgres does quite well!
I hope you enjoyed this look down the trunk hole, and you've got an idea of how to implement solutions to surprisingly complex problems like this one with Postgres.
As usual, Postgres has the tools to solve the problem reasonably well (if not completely) before you reach out for more complicated/standalone solutions.
See any problems with the code, solutions that haven't been tried? -- reach out, or open an issue!