Analytics in Redshift: The Migration Script - Data for Education
14820
post-template-default,single,single-post,postid-14820,single-format-standard,ajax_fade,page_not_loaded,,side_area_uncovered_from_content,qode-theme-ver-10.0,wpb-js-composer js-comp-ver-4.12,vc_responsive

Analytics in Redshift: The Migration Script

Analytics in Redshift: The Migration Script

Inspired by posts from Bitly and Airbnb data teams on Redshift as a replacement for Hadoop we decided to give it a try. And we loved it. We loved it so much that we wrote a Postgres to Redshift migration script and open-sourced it.

Why Redshift?

We needed speed: from our partnership team doing reporting, product and marketing team tracking feature performance, data science building models and querying our open datasets on school needs to hooking up our in-database business intelligence tool called Looker. I’m shouting out Looker here because they have an amazing product that we use for both analytics and data science / discovery and they too highly recommend Redshift.

Performance

Redshift brings map-reduce massively parallel processing to a relational database. Thanks to Amazon, it’s easily scalable and cost effective. We agree with Bitly team that official documentation is good, cost (especially for their 160GB instance) is great for a startup with less data, and that performance is insane (from 60 to 2 seconds, for example). Our query times have dropped 10x – 100x. We are running on 2 nodes which is fast enough for us, as we are an easy going crowd.

performance

Moving from Postgres to Redshift

We have most of our data in Postgres 9x. Nightly, we run a Postgres to Redshift data refresh job (code is open-sourced below). It takes 2.5 hours to dump from postgres / ship over the internet / restore into redshift 40GB of data. If you have your Postgres/other data in AWS, you should be able to half this time and maybe take advantage of AWS Data Pipeline. It only takes 30 minutes to load 40GB on 2 nodes once it’s in S3.

Limitations and Workarounds

The most painful part is to migrate over your schema and your data. I bet your data is not clean, your schema is not 100% Redshift compatible and that you don’t want to invest a bunch of time setting up sortkeys (kind of indexes for Redshift), but you should because they will take you from 5x to 50x performance gains. Redshift docs are loaded with best practices, and we’ve incorporated as many as we could in our script.

Now that we have fresh data in Redshift every morning, we can model our data in Looker. Redshift doesn’t support most sub-correlated queries and has no support for arrays as of Dec 2014. We worked around that by re-writing our data models (SQL) in a way that avoids this functionality. We also had to basically redo all regex. Redshift REGEXP functions don’t have the same regex flavor support as Postgres. That being said, we are doing some really advanced analytics with it using window functions and all sorts of analytics-centered functions that Redshift team has baked in. Overall, the feature set is “production-ready” for us, despite having to change some of our habits. When we model data in Looker, it queries Redshift and spits out data in all sorts of formats. For example, a dashboard:

looker1

The Code

Code and documentation is here

In a nutshell:
– you need to fire up a Redshift instance, get your API keys
– clone the git repo with the script on your server
– make sure you have python and s3cmd installed
– put in our access credentials and a list of tables you want to move over to Redshift into a settings file (the only place you really have to adjust anything to get started. Ideally, you won’t have to code)
– run the script with one command line. Look out for errors in /tmp/p2r* logs to troubleshoot

You need to have some basic shell knowledge. You don’t need python knowledge at all as the script works out of the box. There are several companies that charge $14k+ per year to do simple Postgres or MySQL to Redshift replication. While our script does not support incremental updates (bulk load only) to tables, it can be easily tweaked. If people use this code, we hope it’ll save hundreds of thousands of dollars in costs and help startups (or anybody else) get analytical with Redshift.

Additional stuff we learned while writing the script:

– Despite the fact that the load times take 30 mins, we attained 99.99% availability by restoring into a temp schema and then flipping it into production
– Have some heavy tables with lots of text you don’t want to move around? No problem, the script supports custom tables / scrubbing.
– Redshift supports some constraints but not others, some data types, but not others. Our regex cleans all of that to conform your entire Postgres schema to Redshift schema syntax.
– The part that excites me the most is how we infer sortkeys from every single constraint mentioned in the schema. So if your database has poor constraints or indexes, then you might need to do more work. Luckily, the python script that infers and auto adds sortkeys also supports custom declarations. It’s pretty cool.
– Through trial and error we created a mix of import flags that should handle almost every dirty data situation: from escape quotes, to empty fields when they should be NULLs, to weird dates that are not parsed as dates, etc.

Hope it works for you. Let me know what you think and drop a comment if you end up using our code. We’d love to know if it’s helpful.

Tags:
,
No Comments

Post A Comment