My Data Science Project

My project focuses on Exploratory Data Analysis (EDA). EDA is basically analyzing the data you will be working with through visual and quantitative methods to understand it. That means creating various graphs of the data since it’s often easier to see patterns and relationships through graphs rather than tables of numbers. The quantitative methods generate summary statistics such as the minimum, maximum, mean and standard deviation. This analysis is both univariate to look at one variable or data column at a time and multivariate to explore the relationship between variables.

There are many techniques for doing EDA, but a good starting place is considering the data through a question that you’d like to address. The question that I’m looking at is who are Heavy Users of the browser?  This brings up a host of related questions such as which variables of usage are of interest, and what would be a cutoff value of those variables to determine heavy usage? Do heavy users use the product differently than other users? What features do they use?  As the investigation continues the questions may change slightly or new questions may come up.

It’s important to note that the user data that I am investigating is “non-personal information” meaning that there is nothing that identifies the user such as their name, email address or IP address.The data instead contains information such as OS, version of the application and number of tabs open.

To start this investigation you need the data and tools in which to analyze it. My mentors suggested the dataset that I should use. A sample of the data for one week contains 6,473,655 rows and 607 columns. My two possible tools were Redash or Databricks. Redash makes it easy to write a SQL query and visualize the data in a dashboard. Unfortunately since the dataset that I’m using is massive my queries would be slow and could impact performance for other users because it is on a shared cluster. So I’m creating notebooks in Databricks. Databricks is powered by Apache Spark to work with Big Data. A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. I’m writing the code in Python.

To view, manipulate and analyze the data, a SQL query is used to put the data into a DataFrame. A DataFrame organizes the data in a tabular format with rows of data in named columns like you’d see in a spreadsheet. There are built-in functions that you can use to select various rows or columns, group them, count them and graph them. You can also write user defined functions. When I first learned about data science I was using Pandas DataFrames. The data in a Pandas DataFrame is in-memory on a single server. The Spark SQL DataFrames are distributed on your Spark cluster to be able to handle large amounts of data. I quickly realized that the Spark DataFrames in Databricks are similar to Pandas but have different APIs, so be sure to specify which DataFrame you are using when searching the internet for how to do a specific function.

Finally, you want to be able to visualize your data. There are some easy graphing methods in the Databricks notebook using Matplotlib. Although these allow you to quickly view your data to further inform your investigation, they are better suited to analysis than a polished presentation of conclusions. For example, they are lacking titles and clear labeling of the x and y axis. For an elegant presentation of graphs I plan to look into ggplot.

Advertisements

Help, I’m stuck! Now what?

At some point in a new project, new endeavor or new learning experience, everyone is likely to get stuck. Since I’m just learning data science while working on an Outreachy Internship with Mozilla, I reached that point this week.

My task involves exploratory data analysis, so I’m mostly looking at/analyzing large amounts to data to see what information it can reveal. It’s quite open-ended which can be good, but it also means there’s not a predefined direction that I need to go. As part of the task for this week, my mentors told me to determine a cutoff to define heavy users. They suggested I do some visualizations and look at percentiles.

Okay, in theory I know how to make visualizations of data – I’d done just that in my contribution for the internship application. But there are so many kinds of visualizations – which ones should I use? What form should the underlying data be in to make the graphs render as I’d expect them to? And which package should I use for drawing the charts?

I started with histograms, because that was an area where I had some experience. Since I’m working in Databricks for the first time, I looked at the documentation for visualizations in Databricks. There are built in visualizations, but the documentation was pretty sparse and high level, so I started Googling. Between reading lots of suggestions on several web pages and blogs and trying various combinations of things, I got some histograms that looked reasonable.

But what was I supposed to do about percentiles? Back to Google, I eventually figured out that Databricks has built in visualizations for Quantiles. This was starting to look pretty good, but I wasn’t sure if that was really the best way to represent percentiles.

When I could not make any more progress on my own, I decided to ask in a Slack channel for this project. It was a bit intimidating to ask because I felt like I should have been able to figure this out for myself. Because I am also new to using Slack, I didn’t realize that I should directly ping my mentors with my question. I just generally asked it in the channel. Although my mentors didn’t see it, another person in the channel told me they like using CDFs for that type of thing.

I was happy to have an answer that could move me forward on my project, however I did not know how to use a CDF or even what it stood for. Back to Google, I figured out it was a Cumulative Distribution Function. More Googling taught me that it could be implemented as a cumulative step histogram in Matplotlib with some very helpful sample code. Stack Overflow has been my friend in this project.

I learned that it is a good thing to reach out for help when needed because it did get me past my stuck point. I also learned how best to ask to get the attention of the people most likely to be able to help me. In reading other communication channels in the community, I’ve realized that people ask questions about all sorts of topics. Even people that have a lot of experience in some areas are learning new skills or working with new departments or new data, so they ask questions. And all the answers that I’ve seen have been respectful and helpful.

Applying for Outreachy

Who am I?

Once upon a time, I was a software engineer.  I had jobs in a variety of industries from paper machine controls to hotel reservation systems.  I started programming in proprietary languages, then moved to C, C++ and finally Java. 

When I had my second child, I chose to stay home to raise my children.  It was a great experience – challenging in new ways.  I spent lots of time volunteering at schools and learned a whole new set of skills.  As my children got older it was time for me to return to my own goals.  But how do you return to technology after a decade away?

I decided to refresh my Java skills with some online classes.  Through this process, I figured out that this whole new field of data science had been created.  It sounded like all the things I enjoy – math, programming, and making meaning out of data.  So I signed up for a series of online classes to learn data science.

Why Apply to Outreachy?

Outreachy offers three-month internships to work in Free and Open Source Software (FOSS).  This was a great opportunity to gain real-world experience on a project, refresh my technical skills, and have something recent to add to my resume.  With any luck, it would give me some actual experience in data science!

A friend had told me about the Outreachy internships a year ago.  I’d missed the window for applying for last winter.  I checked the dates for the spring/summer internship but that wasn’t a good time for my schedule.  I signed up for the Outreachy email notifications so I could apply for the next round this winter.

How to Apply

Although the internship starts in December, the application process begins in September.  Since there are several steps and a project contribution required, I suggest starting as early as possible.

 The first step was filling out an application with a couple of short answer questions.  Although the questions were not hard, they did require some reflection and I wanted to answer them “perfectly” because I really wanted an internship.  It was a bit scary to submit my answers, and then I had to wait for an email response to find out if I was approved.  Success, I was on to the next step!

Now I had access to all the project information.  You have to make a contribution to a project in order to be considered as an intern.  There were a lot of projects listed so it was a bit overwhelming to pick 1 or 2 that I thought would be a good fit.  To be honest, I wasn’t even sure who all of the open source communities were.  There were no data science projects listed, but there were several Python projects, so I chose one of those.

The process of getting that software downloaded, installed and running was an experience.  I needed to create a GitHub account, learn to chat on IRC and remember some Unix.  It was often frustrating and overwhelming, but it was a huge sense of accomplishment when it finally worked.

After submitting my contribution for this project, I was feeling very satisfied.  Even if I did not receive the internship I felt like I had learned a lot and it had been a very good experience.

Then, shortly before the application period closed, Mozilla created 2 data science internships.  Within a matter of hours, 14 people had chosen to work on the 10 projects that were offered.  I spent several days focusing on my contribution and really put my best effort into it. 

After recording my contribution, there is an application with a few more short answer questions.  This was another opportunity to reflect on myself and why this would be a good project for me.  It was a bit intimidating because I felt like I didn’t have enough experience in Python, data science or open source, but I’d come this far so I submitted it.

Results

I was thrilled when I got the email saying I’d been chosen as a data science intern at Mozilla.  I’m very grateful to get the opportunity to get professional experience with supportive mentors.

This process has been a great experience even when it’s been scary – like writing this blog.  If you have an interest in technology or open source software I suggest that you apply for an Outreachy internship.  You may discover a new community, gain interesting experience and learn some things about yourself.