Another #CloudGuruChallenge project complete!
"Automate an ETL processing pipeline for COVID-19 data using Python and cloud services": #CloudGuruChallenge – Event-Driven Python on AWS
I saw this challenge when it was first posted in September 2020, but my Python and AWS skills at the time were not nearly good enough to tackle it. Fast-forward ten months and I was finally ready to give it a shot.
The idea is simple: download some data, transform and merge it, load it into a database, and create some sort of visualization for it. In practice, of course, there were lots of choices to make and plenty of new things I needed to learn to be successful.
The data sources are .csv files, updated daily, from the New York Times and Johns Hopkins University, and both are published on GitHub. I started by downloading the raw files locally, extracting them into dataframes with Pandas, and creating a separate module that would do the work of transforming and merging the data. For my local script, I created a container class to act as a database, into which I could write each row of the resulting dataframe. This allowed me to figure out the necessary logic to determine if there was data in the 'database' or not, and therefore whether to write in the entire dataset or just load any new data that wasn't already there.
Along the way, I worked through my first major learning objective of this challenge: unit testing. Somewhat surprisingly, the online bootcamp I took during the winter didn't teach code testing at all, and I was intimidated by the idea. After some research, I chose to go with pytest for its simplicity and easy syntax relative to Python's built-in unittest. With a little experimentation, I was able to write some tests for many of the functions I had written, and even dabbled a bit with some test-first development.
Once my Python function was working locally, I had to decide which step to take next, as there were a couple choices. After some thinking, and discussing my ideas with my mentor, I went with my second learning objective: Terraform. I've worked a little with Infrastructure as Code in the form of AWS CloudFormation and the AWS Serverless Application Model, but I'd been meaning to try the provider-agnostic Terraform for several months.
I started a separate Pycharm project, wrote a quick little Lambda function handler, and dove into the Terraform tutorials. Once I got the hang of the basics, I found a Terraform Lambda module and started plugging my own values into the template. A sticking point here was figuring out how to get Pandas to operate as a Lambda Layer - after failing to correctly build a layer myself (than you, Windows), I found a prebuilt layer that worked perfectly and added it to my Terraform configuration as an S3 upload.
I proved that Terraform worked when deploying locally, and then turned my attention to setting up a GitHub Action for automatic deployment. I combined pytest and Terraform into one workflow, with Terraform being dependent upon all tests passing, so that I had a full CI/CD pipeline from my local computer to GitHub and on to AWS via Terraform Cloud.
With deployment just a 'git push' away, it was time to start utilizing other AWS resources. This brought me to my third big learning objective: boto3. I recall being a bit overwhelmed by boto3 and its documentation last fall when I was working on the Resume Challenge. Fortunately, lots of practice reading documentation in the intervening months paid off, as it wasn't nearly as scary as I'd feared once I actually got started. I added SNS functionality first, so that I would get an email any time the database was updated or an error occurred. With that working nicely, it was time for another decision: what database to use?
I used DynamoDB for the Resume Challenge, but that was just one cell being atomically incremented. Much of my database experience since then has been with various RDS instances, so I wanted to gain some more experience with AWS's serverless NoSQL option. Back to the documentation I went, as well as to Google to figure out the best way to overcome the batch-writing limits. Before long, my Lambda function was behaving exactly how I wanted, with everything still being deployed by Terraform.
At this point, I was cruising along and it was a simple matter to create an Event Bridge scheduled event to trigger my Lambda function once a day. It took a few tries to get the permissions and attachments set up correctly in Terraform, and once that was completed, I had to figure out the data visualization solution. I could have gone with AWS Quicksight, but I explored a bit and settled on using a self-hosted instance of Redash. Since there was already an EC2 AMI with Redash installed, I was able to add that to my Terraform configuration (although I cheated a wee bit and created a security group and IAM role for the instance in the console, in the name of finally finishing this project).
With Redash up and running, and some simple visualizations scheduled to update daily, I reached the end of the project requirements earlier today. Huzzah!
I'm happy with how this project went. I invested nearly 50 hours of time to get it going, due to the number of topics I had to teach myself along the way - a hefty but worthwhile time commitment over the past two weeks. A few things I think could/would get better with more learning and practice:
Many long nights and many more rabbit holes later, I can finally present my finished product!