Redshift

At Nexmo we have used redshift as our Production DWH for around 5 years now, I have been involved with it for over two years now.

Tips & Lessons Learned

  • Views are execute with the permissions of the creator of the view, although the user querying the view will also require USAGE on the schemas the view selects data from
  • The COPY command does not deal very well with large numbers of small files

Issues Encountered

Redshift doesn’t deal well with tables that are large relative to the size of the cluster. When you want to do things like take a backup it creates a deep copy which requires ~2.5x the space of the table. Similarly, loading data back in from S3 (after unload) to a table also requires the space for sorting. This can be quite a common need when dealing with semi-structured data if you need to alter a tables column type due to receiving new data which is longer than fits in the current definition. In the future the ALTER COLUMN type command which enables extension of most VARCHAR’s lengths will help with this.

Some JOIN and aggregation operations are distributive over UNION ALL, however it appears redshift doesn’t take advantage of this and seems to materialise the union of the tables first causing massive inefficiencies.

Resizing

As the size of our cluster grew, we noticed that the downtime (readonly mode) while resizing the cluster wasn’t constant, and we went from a few hours which was acceptable for us to deal with to significantly longer. So if you are in a rapidly evolving environment where data sizes are not known in advanced, this can be a pain.

This issue was somewhat helped by the “elastic resize” feature that was introduced. However it could only be done to some fixed proportions relative to the base cluster size from the last “classic resize”, e.g. if your last classic resize was to 10 nodes, you could go to 5 or 20 nodes (of the same type).