Postgres Café: Deploying distributed PostgreSQL at scale with Citus Data

It’s time for the fourth episode of Postgres Café, a podcast from our teams at Data Bene and Xata where we discuss PostgreSQL contribution and extension development. In this latest episode, Sarah Conway and Gülçin Yıldırım Jelinek meet with Stéphane Carton to cover Citus Data, a completely open-source extension from Microsoft that provides a solution for deploying distributed PostgreSQL at scale.

Episode 4: Citus Data

The Citus database has experienced 127 releases since Mar 24, 2016 when it was first made freely open-source for open use and contributions. It’s a powerful tool that works natively with PostgreSQL, and seamlessly integrates with all Postgres tools and extensions. Continue reading for a summary of what we covered in this podcast episode!

Addressing scalability, performance, and the management of large datasets

So why does Citus Data exist, and what problems does it solve? Let’s delve into this by category.

Development

Citus is designed to solve the distributed data modeling problem by providing methods in distributed data modeling to map workloads, such as sharding tables based on primary keys (especially useful for microservices and high-throughput workloads).

Scalability

By distributing data across multiple nodes, you’re able to enable the horizontal scaling of PostgreSQL databases.

This allows developers to combine CPU, memory, storage, and I/O capacity across multiple machines for handling large datasets and high traffic workloads. It’s simple to add more worker nodes to the cluster and rebalance the shards as your data volume grows.

Performance

The distributed query engine in Citus is used to maximize efficiency, parallelizing queries and batching execution across multiple worker nodes.

Even in cases where there are thousands to millions of statements being executed per second, data ingestion is still optimized through finding the right shard placements, connecting to the appropriate worker nodes, and performing operations in parallel. All of this ensures high throughput and low latency for real-time data absorption.

High Availability & Redundancy

Through the distributed data model, you can create redundant copies of tables and shard data across multiple nodes. Through this process, you can ensure the database remains resilient and available even when nodes crash, and maintain high availability as a result.

Contributing to Citus

At Data Bene, our goal is to support forward momentum of upstream source code through ongoing development and code contributions. Cédric Villemain, among others on our team, constantly assesses for new feature additions or other improvements that can make a difference for users.

No matter whether you’re part of a DevOps team that is looking to build out distributed architecture for your PostgreSQL instances, or an end user such as a business analyst that is seeking efficient performance when handling vast amounts of data, Citus Data may be the perfect extension to support your use case.

If you have specific feature requests or concerns, our team here at Data Bene will help support you to contribute directly to Citus Data or can do so on your behalf to ensure the longevity of the project and relevance for your projects. Learn more about contributing to Citus Data by referencing the official CONTRIBUTING.md file.

Watch the full episode

Thinking about watching the full discussion? Check it out on YouTube:

Stay tuned for more Postgres tools

More episodes are still being published for Postgres Café! Subscribe to the playlist for more interviews around open-source tools like StatsMgr for efficient statistics management for PostgreSQL, pgzx for the creation of PostgreSQL extensions using Zig, & more. Get ideas from the experts for new extensions to try out and maximize your Postgres deployments.