This blog will unveil the technology under the hood of AWS Redshift based on the official materials showed by AWS.
In general AWS Redshift is a distributed relational database using columnar architecture for OLAP.
- Amazon Redshift is based on PostgreSQL 8.0.2 but implemented with custom needs for OLAP.
- Cluster contains Leader Node and Compute Nodes, data is stored in Compute Nodes. The leader node manages communications with client programs and all communication with compute nodes.The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation. A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space.
- Columnar storage : Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns. Columnar storage is more efficient when running aggregated query on very large data set.
- Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.
- Data compression: Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types.
- And many other things like Query optimizer, Result caching , Compiled code etc.
So this blog shows you under the hood of AWS Redshift design and implementation. It is very interesting and a very good example of a design of distributed system.
This blog is a part of serial of Under the hood.