Hassan Munir Posted on May 30 • Originally published at hassanmunir.me Three Ways to Set Up CDC from Postgres to ClickHouse # postgres # clickhouse # database # dataengineering You cannot run analytical queries on the same Postgres primary that serves your application without paying for it in CPU and connections. A read replica does not help: Postgres is row-oriented and built for OLTP, not for scanning tens of millions of rows for a GROUP BY . If you want sub-second dashboards over a real dataset you need a column store on the side. We picked ClickHouse. The interesting question is not how to query ClickHouse. The interesting question is how to keep it in sync with Postgres. That is what Change Data Capture (CDC) solves. CDC is mechanical: every insert, update, and delete in Postgres gets captured and forwarded somewhere else. Postgres already writes everything to the Write-Ahead Log (WAL) for crash recovery. CDC reads that log via a logical replication slot and turns it into an ordered stream you can replay into another system. Polling with WHERE updated_at > $last is the wrong tool: it cannot see deletes, it adds load, and your updated_at column will lie eventually. Read the log. There are three common ways to plug Postgres CDC into ClickHouse. We evaluated all three. We ran two of them in production. The trade-offs below come from running them. Postgres prerequisites (all three options) Logical replication needs postgresql.conf set up correctly. This applies to all three approaches. # postgresql.conf wal_level = logical max_wal_senders = 10 max_replication_slots = 10 max_slot_wal_keep_size = 102400 # MB; cap WAL retained by stalled slots Enter fullscreen mode Exit fullscreen mode Then a replication role and a publication on the source side: CREATE USER cdc_user WITH REPLICATION PASSWORD 'redacted' ; GRANT USAGE ON SCHEMA public TO cdc_user ; GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user ; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT O
LIVE
