30 May Our Data Science Pipeline: Machine Learning Meets Traditional Business Intelligence
The role of a data science team varies from one organization to the next. Some organizations require a “traditional” business intelligence framework: one in which historical data is turned into actionable insights. Others run impressively complex machine learning algorithms all day long – sometimes in realtime – in order to transform the most up-to-date data into future predictions and personalized user experiences. Here at DonorsChoose.org, we’re somewhere in the middle of this large spectrum. In today’s post, I’d like to walk you through our data science pipeline (with code attached!) to give a better sense of the kind of work we’re doing, and to offer an example pipeline for organizations that are interested in the same kind of work.
We’ll start with how the data is stored. Our analytics database is housed in the cloud on Amazon Redshift. Check out this previous post to hear more about the pros and cons of Redshift, and to see our open-sourced migration script from Postgres. In short, Redshift allows for scalability and fast performance by utilizing many nodes running in parallel, yet listens to traditional SQL so you don’t have to learn a new language.
The Business Intelligence Tool
We’ve found a great home for our data, but how do we go about pulling the data that we need, when we need it? Enter Looker. Gone are the days of writing long custom SQLs for every inquiry. Looker is a business intelligence tool that connects the data to a user-friendly interface and automatically generates SQL to pull data based on user selections, while simultaneously creating beautiful interactive visualizations.
The Analytic Pipeline
So our data lives in the cloud and we’ve written all the instructions for Looker to understand and pull our data. We are now able to generate reports, create dashboards, and analyze past trends… so where does machine learning come in? Suppose we wanted to know the probability that a user will return to our website or their affinity to donate to a specific subject area. Machine learning can help with that. Here’s another important business question that came up recently in our office: How many projects can we expect to be live on our website on a date in the future? The answer could help us anticipate the volume of traffic we’ll have on Teacher’s Appreciation Day, informing us of the staffing we would need then to keep our operations running smoothly. It could also help foster partnerships by giving projections of how much it would cost to fully fund an entire city on Back to School Day.
For this kind of task, we’d want to run a script that pulls the data, performs the prediction, and then loads the data back into the database. We’d also want this to run on a scheduled basis (perhaps daily) so that we could get the most up-to-date info that we need, while being completely hands-off. Check out this script on our GitHub repository that uses both Python and R to predict live project counts a year into the future. Our data is provided so that you can try the forecasting yourself, but the script also includes steps for data downloads and uploads from an Amazon Redshift database in case you wanted to set up a similar architecture.
This is what we start with:
Each colored line is a fiscal year. July 1st is the first point on the left, and June 30th is the last point on the right. Shown are the numbers of live projects on a certain day.
And this is what we get:
Same as above, except the remainder of this fiscal year and all days in the next fiscal year has been forecasted.