Editor’s note: This post was written by Jenny Lu (left), one of Zumper’s engineering interns for the summer of 2017.
You never quite know what to expect when you walk into a new internship on day one. You sort of hope you’re not going to be mundanely fixing bugs the entire summer or coding up a project that will never see the light of day. That’s why I was so thrilled to be able to work on and ship two big engineering projects this summer.
I was one of two engineering interns at Zumper through the KPCB fellowship. I worked alongside Sehmon, another KP fellow. We led two main projects: the data pipeline, an event tracking system; and the replay machine, a utility tool we built in our last month (and the main focus of this blog post). Building the replay machine was especially interesting because two of Zumper’s senior engineers, Dan and Rob, would spend several hours every week meeting with us to work on the project. It was really cool to see how experienced engineers approached problems, and we also learned a lot about design challenges and tradeoffs.
The replay machine is a command line tool that replays requests from old log files. Its primary purpose is to replay requests captured in production to test new releases of the site or new servers. This ensures that new deployments of the website are consistent with past versions, essentially acting as a check to ensure new features or fixes didn’t accidentally introduce new bugs. There are 3 key stages: downloading log files, parsing the logs, and replaying the requests using Scrapy.
To download the logs, we wrote a script that downloads ELB log files, which are stored in an S3 bucket in our production AWS account, between user specified start and end dates.
We then parse the logs and collapse the urls, a process that involves four key steps, which we wrapped in a shell script. The key idea in this step is to extract log requests we actually care about and output a set of de-duplicated urls while keeping track of all the responses a particular url received in production.
First, we want to filter out the entries that don’t match our desired host and verb (host=padmapper.com, verb=GET). We also normalize the url in this step, alphabetizing the query parameters while preserving the order of repeated parameters since urls with the same query parameters in a different order will still direct you to the same page. A log entry contains many fields, many of which are not relevant to us. We then write out relevant fields, which, in the current iteration of the project, are the url and status code. We pipe the output to the unix commands sort and uniq -c, which leaves us with the number of times the url has been seen with that corresponding status code. In the final step of this stage, we collapse all the duplicate urls and write them all out to a csv file. For example, we might have something that looks like this:
1 https://www.padmapper.com 201 1 https://www.padmapper.com 301
In this step, they collapse into:
https://www.padmapper.com {201:1, 301:1}
Finally, we’re ready to replay the requests. In the first iteration of the project, we used the python requests library to send a request one by one, which, given the fact that one day’s worth of logs can consist of more than half a million requests, was not optimal. We switched to Scrapy, which allows us to make concurrent requests to the server. We run all the requests on our specified server and output a report consisting of a “score,” url, received status code, user agent (optional), and status codes previously seen in log files. We currently calculate a “score” for each request by looking at the received response code and seeing the fraction of the total previously seen responses it correlates to, essentially a “percentage match.”
For future iterations, we will ostensibly generate more detailed reports and adjust the “scoring” system – most notably, we will probably want to identify the received response code with a “successful” record, not necessarily the majority record (given that one exists).
We did all this in the last month of our internship, and we managed to deploy and test it in production on a new deploy before we left! It was really cool to see the engineering team actually use a tool that we interns built.
See yourself as a future Zumper intern (engineering or otherwise)? Send your resume and cover letter to jobs@zumper.com.