Modifying Expectations

As I reach the midway point of my Outreachy internship, it’s a good time to reflect on the original project plan and adjust course for the second half.

Original Timeline

Originally, my timeline was fairly open ended for my exploratory data analysis project. I expected to spend from the beginning of the project through February 1st exploring the data using graphical and quantitative techniques to determine attributes of heavy users of the Firefox browser. Then I would spend a few weeks developing a method to identify heavy users and finish up by preparing and communicating a report of my findings. So far I have followed that plan closely.

First Half ProgressI spent several weeks learning the tools to explore the data. I already had some experience with SQL and Python, but I needed to learn the Spark Python API. I learned the syntax to slice the data in multiple different ways, join it with other data and graph it. I initially made some histograms and quantiles plots from the data. I was very excited in the first several weeks when I wrote a user defined function. Well, I actually modified one that a teammate had used in his notebook, but it still felt like an accomplishment.

Then it was time to do a deep dive into the data. The main table of data I am investigating has 607 columns and 3.9 million rows for 1 day. Determining what those 607 columns contained and which ones were interesting to me was the first task. After determining which columns were of the most interest, I needed to know what the data looked like in those columns. Using summary statistics – count, mean, standard deviation, minimum and maximum – gave me a good overall picture.

I started looking at outliers which led to a whole notebook of odd usage patterns. For example, if a client has 0 URIs for the day and 0 active hours, it was likely the browser was open but not used that day. Should this count as a user in my determination of heavy users? It’s hard to think of them as a user if they didn’t actually use the product that day. How is it possible to be active for 24 hours every day but not access any URIs? Or how can URI counts be very high (above the 99th percentile) and have no active hours?
These usage patterns were very fascinating discoveries and may inform my decision about how to handle the definition of heavy users, but it was also easy to feel like I was going down a rabbit hole and I had to figure out when I’d found enough useful information to come back to my main analysis.

What Took Longer Than Expected

Making visualizations of my data always takes me longer than I expect it to. After struggling in the first several weeks to get some basic graphs to look good, I thought I had a pretty good handle on it. I knew I’d need to invest some time to learn a pretty graphing package, but I thought I had a reasonable grasp of creating a quick graph for analysis. Then I tried to get a graph with multiple box plots so I could see how different slices of data (different days of the week, averaged over a whole week) would compare to each other. It was tricky to get the data from multiple dataframes into the right format for the Matplotlib API for box plots. I used some fancy code from the Matplotlib examples but my graph didn’t look the way I expected. After paring back to a very basic graph and adjusting the y-axis it was looking pretty good. But the line through the box was not at the mean value of my array of data. I looked at several different causes and was making no progress so I moved on to something else. After taking a break I searched for more information about box plots and it was now obvious that the line indicated median, not mean! The line was indeed at the correct location, and a bit of extra programming allowed me to put an indication of the mean onto my box plot. It’s a good lesson that sometimes walking away for a bit is the best plan because the solution is often obvious with fresh eyes.

What I Would Do Differently

One thing I would do differently if I were starting again now is put lots more comments into my code and about my results. It seems obvious when I’m doing my analysis that I’m looking at this slice of data for Tuesday or that slice of data for a particular client with very high URI counts. After looking at the resulting slice of data, I discover the next thing I want to look at – what does the data look like on Wednesday, or how many clients have very high URI counts on a day. And then looking at those results leads me to a different slice of data. However it is a bit of work to go back the next day – or even after lunch – and remember what slice of data I was looking at, why I was looking at it, and what conclusions I drew from it. I learned to make a comment at the top saying what data I’m looking at and why, and then at the bottom saying what pattern I found in the data. This makes it much easier to refer back to the notebook later.

New Plan for Second Half

This week I am working on a proposal for my definition of heavy users. This is where the “art” of data science comes in because there isn’t necessarily one right answer. After discussing my proposal with my mentors we will come up with an outline for the report. I will spend the next week or so filling in the report with data and graphs. After the report is reviewed if I still have time I will look at the attributes of heavy users over different populations – mobile, os, region, etc. With so much data it seems that there are endless ways to explore it.

My Data Science Project

My project focuses on Exploratory Data Analysis (EDA). EDA is basically analyzing the data you will be working with through visual and quantitative methods to understand it. That means creating various graphs of the data since it’s often easier to see patterns and relationships through graphs rather than tables of numbers. The quantitative methods generate summary statistics such as the minimum, maximum, mean and standard deviation. This analysis is both univariate to look at one variable or data column at a time and multivariate to explore the relationship between variables.

There are many techniques for doing EDA, but a good starting place is considering the data through a question that you’d like to address. The question that I’m looking at is who are Heavy Users of the browser?  This brings up a host of related questions such as which variables of usage are of interest, and what would be a cutoff value of those variables to determine heavy usage? Do heavy users use the product differently than other users? What features do they use?  As the investigation continues the questions may change slightly or new questions may come up.

It’s important to note that the user data that I am investigating is “non-personal information” meaning that there is nothing that identifies the user such as their name, email address or IP address.The data instead contains information such as OS, version of the application and number of tabs open.

To start this investigation you need the data and tools in which to analyze it. My mentors suggested the dataset that I should use. A sample of the data for one week contains 6,473,655 rows and 607 columns. My two possible tools were Redash or Databricks. Redash makes it easy to write a SQL query and visualize the data in a dashboard. Unfortunately since the dataset that I’m using is massive my queries would be slow and could impact performance for other users because it is on a shared cluster. So I’m creating notebooks in Databricks. Databricks is powered by Apache Spark to work with Big Data. A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. I’m writing the code in Python.

To view, manipulate and analyze the data, a SQL query is used to put the data into a DataFrame. A DataFrame organizes the data in a tabular format with rows of data in named columns like you’d see in a spreadsheet. There are built-in functions that you can use to select various rows or columns, group them, count them and graph them. You can also write user defined functions. When I first learned about data science I was using Pandas DataFrames. The data in a Pandas DataFrame is in-memory on a single server. The Spark SQL DataFrames are distributed on your Spark cluster to be able to handle large amounts of data. I quickly realized that the Spark DataFrames in Databricks are similar to Pandas but have different APIs, so be sure to specify which DataFrame you are using when searching the internet for how to do a specific function.

Finally, you want to be able to visualize your data. There are some easy graphing methods in the Databricks notebook using Matplotlib. Although these allow you to quickly view your data to further inform your investigation, they are better suited to analysis than a polished presentation of conclusions. For example, they are lacking titles and clear labeling of the x and y axis. For an elegant presentation of graphs I plan to look into ggplot.

Help, I’m stuck! Now what?

At some point in a new project, new endeavor or new learning experience, everyone is likely to get stuck. Since I’m just learning data science while working on an Outreachy Internship with Mozilla, I reached that point this week.

My task involves exploratory data analysis, so I’m mostly looking at/analyzing large amounts to data to see what information it can reveal. It’s quite open-ended which can be good, but it also means there’s not a predefined direction that I need to go. As part of the task for this week, my mentors told me to determine a cutoff to define heavy users. They suggested I do some visualizations and look at percentiles.

Okay, in theory I know how to make visualizations of data – I’d done just that in my contribution for the internship application. But there are so many kinds of visualizations – which ones should I use? What form should the underlying data be in to make the graphs render as I’d expect them to? And which package should I use for drawing the charts?

I started with histograms, because that was an area where I had some experience. Since I’m working in Databricks for the first time, I looked at the documentation for visualizations in Databricks. There are built in visualizations, but the documentation was pretty sparse and high level, so I started Googling. Between reading lots of suggestions on several web pages and blogs and trying various combinations of things, I got some histograms that looked reasonable.

But what was I supposed to do about percentiles? Back to Google, I eventually figured out that Databricks has built in visualizations for Quantiles. This was starting to look pretty good, but I wasn’t sure if that was really the best way to represent percentiles.

When I could not make any more progress on my own, I decided to ask in a Slack channel for this project. It was a bit intimidating to ask because I felt like I should have been able to figure this out for myself. Because I am also new to using Slack, I didn’t realize that I should directly ping my mentors with my question. I just generally asked it in the channel. Although my mentors didn’t see it, another person in the channel told me they like using CDFs for that type of thing.

I was happy to have an answer that could move me forward on my project, however I did not know how to use a CDF or even what it stood for. Back to Google, I figured out it was a Cumulative Distribution Function. More Googling taught me that it could be implemented as a cumulative step histogram in Matplotlib with some very helpful sample code. Stack Overflow has been my friend in this project.

I learned that it is a good thing to reach out for help when needed because it did get me past my stuck point. I also learned how best to ask to get the attention of the people most likely to be able to help me. In reading other communication channels in the community, I’ve realized that people ask questions about all sorts of topics. Even people that have a lot of experience in some areas are learning new skills or working with new departments or new data, so they ask questions. And all the answers that I’ve seen have been respectful and helpful.

Applying for Outreachy

Who am I?

Once upon a time, I was a software engineer.  I had jobs in a variety of industries from paper machine controls to hotel reservation systems.  I started programming in proprietary languages, then moved to C, C++ and finally Java. 

When I had my second child, I chose to stay home to raise my children.  It was a great experience – challenging in new ways.  I spent lots of time volunteering at schools and learned a whole new set of skills.  As my children got older it was time for me to return to my own goals.  But how do you return to technology after a decade away?

I decided to refresh my Java skills with some online classes.  Through this process, I figured out that this whole new field of data science had been created.  It sounded like all the things I enjoy – math, programming, and making meaning out of data.  So I signed up for a series of online classes to learn data science.

Why Apply to Outreachy?

Outreachy offers three-month internships to work in Free and Open Source Software (FOSS).  This was a great opportunity to gain real-world experience on a project, refresh my technical skills, and have something recent to add to my resume.  With any luck, it would give me some actual experience in data science!

A friend had told me about the Outreachy internships a year ago.  I’d missed the window for applying for last winter.  I checked the dates for the spring/summer internship but that wasn’t a good time for my schedule.  I signed up for the Outreachy email notifications so I could apply for the next round this winter.

How to Apply

Although the internship starts in December, the application process begins in September.  Since there are several steps and a project contribution required, I suggest starting as early as possible.

 The first step was filling out an application with a couple of short answer questions.  Although the questions were not hard, they did require some reflection and I wanted to answer them “perfectly” because I really wanted an internship.  It was a bit scary to submit my answers, and then I had to wait for an email response to find out if I was approved.  Success, I was on to the next step!

Now I had access to all the project information.  You have to make a contribution to a project in order to be considered as an intern.  There were a lot of projects listed so it was a bit overwhelming to pick 1 or 2 that I thought would be a good fit.  To be honest, I wasn’t even sure who all of the open source communities were.  There were no data science projects listed, but there were several Python projects, so I chose one of those.

The process of getting that software downloaded, installed and running was an experience.  I needed to create a GitHub account, learn to chat on IRC and remember some Unix.  It was often frustrating and overwhelming, but it was a huge sense of accomplishment when it finally worked.

After submitting my contribution for this project, I was feeling very satisfied.  Even if I did not receive the internship I felt like I had learned a lot and it had been a very good experience.

Then, shortly before the application period closed, Mozilla created 2 data science internships.  Within a matter of hours, 14 people had chosen to work on the 10 projects that were offered.  I spent several days focusing on my contribution and really put my best effort into it. 

After recording my contribution, there is an application with a few more short answer questions.  This was another opportunity to reflect on myself and why this would be a good project for me.  It was a bit intimidating because I felt like I didn’t have enough experience in Python, data science or open source, but I’d come this far so I submitted it.

Results

I was thrilled when I got the email saying I’d been chosen as a data science intern at Mozilla.  I’m very grateful to get the opportunity to get professional experience with supportive mentors.

This process has been a great experience even when it’s been scary – like writing this blog.  If you have an interest in technology or open source software I suggest that you apply for an Outreachy internship.  You may discover a new community, gain interesting experience and learn some things about yourself.