As I reach the midway point of my Outreachy internship, it’s a good time to reflect on the original project plan and adjust course for the second half.
Originally, my timeline was fairly open ended for my exploratory data analysis project. I expected to spend from the beginning of the project through February 1st exploring the data using graphical and quantitative techniques to determine attributes of heavy users of the Firefox browser. Then I would spend a few weeks developing a method to identify heavy users and finish up by preparing and communicating a report of my findings. So far I have followed that plan closely.
First Half ProgressI spent several weeks learning the tools to explore the data. I already had some experience with SQL and Python, but I needed to learn the Spark Python API. I learned the syntax to slice the data in multiple different ways, join it with other data and graph it. I initially made some histograms and quantiles plots from the data. I was very excited in the first several weeks when I wrote a user defined function. Well, I actually modified one that a teammate had used in his notebook, but it still felt like an accomplishment.
Then it was time to do a deep dive into the data. The main table of data I am investigating has 607 columns and 3.9 million rows for 1 day. Determining what those 607 columns contained and which ones were interesting to me was the first task. After determining which columns were of the most interest, I needed to know what the data looked like in those columns. Using summary statistics – count, mean, standard deviation, minimum and maximum – gave me a good overall picture.
I started looking at outliers which led to a whole notebook of odd usage patterns. For example, if a client has 0 URIs for the day and 0 active hours, it was likely the browser was open but not used that day. Should this count as a user in my determination of heavy users? It’s hard to think of them as a user if they didn’t actually use the product that day. How is it possible to be active for 24 hours every day but not access any URIs? Or how can URI counts be very high (above the 99th percentile) and have no active hours?
These usage patterns were very fascinating discoveries and may inform my decision about how to handle the definition of heavy users, but it was also easy to feel like I was going down a rabbit hole and I had to figure out when I’d found enough useful information to come back to my main analysis.
What Took Longer Than Expected
Making visualizations of my data always takes me longer than I expect it to. After struggling in the first several weeks to get some basic graphs to look good, I thought I had a pretty good handle on it. I knew I’d need to invest some time to learn a pretty graphing package, but I thought I had a reasonable grasp of creating a quick graph for analysis. Then I tried to get a graph with multiple box plots so I could see how different slices of data (different days of the week, averaged over a whole week) would compare to each other. It was tricky to get the data from multiple dataframes into the right format for the Matplotlib API for box plots. I used some fancy code from the Matplotlib examples but my graph didn’t look the way I expected. After paring back to a very basic graph and adjusting the y-axis it was looking pretty good. But the line through the box was not at the mean value of my array of data. I looked at several different causes and was making no progress so I moved on to something else. After taking a break I searched for more information about box plots and it was now obvious that the line indicated median, not mean! The line was indeed at the correct location, and a bit of extra programming allowed me to put an indication of the mean onto my box plot. It’s a good lesson that sometimes walking away for a bit is the best plan because the solution is often obvious with fresh eyes.
What I Would Do Differently
One thing I would do differently if I were starting again now is put lots more comments into my code and about my results. It seems obvious when I’m doing my analysis that I’m looking at this slice of data for Tuesday or that slice of data for a particular client with very high URI counts. After looking at the resulting slice of data, I discover the next thing I want to look at – what does the data look like on Wednesday, or how many clients have very high URI counts on a day. And then looking at those results leads me to a different slice of data. However it is a bit of work to go back the next day – or even after lunch – and remember what slice of data I was looking at, why I was looking at it, and what conclusions I drew from it. I learned to make a comment at the top saying what data I’m looking at and why, and then at the bottom saying what pattern I found in the data. This makes it much easier to refer back to the notebook later.
New Plan for Second Half
This week I am working on a proposal for my definition of heavy users. This is where the “art” of data science comes in because there isn’t necessarily one right answer. After discussing my proposal with my mentors we will come up with an outline for the report. I will spend the next week or so filling in the report with data and graphs. After the report is reviewed if I still have time I will look at the attributes of heavy users over different populations – mobile, os, region, etc. With so much data it seems that there are endless ways to explore it.