At some point in a new project, new endeavor or new learning experience, everyone is likely to get stuck. Since I’m just learning data science while working on an Outreachy Internship with Mozilla, I reached that point this week.
My task involves exploratory data analysis, so I’m mostly looking at/analyzing large amounts to data to see what information it can reveal. It’s quite open-ended which can be good, but it also means there’s not a predefined direction that I need to go. As part of the task for this week, my mentors told me to determine a cutoff to define heavy users. They suggested I do some visualizations and look at percentiles.
Okay, in theory I know how to make visualizations of data – I’d done just that in my contribution for the internship application. But there are so many kinds of visualizations – which ones should I use? What form should the underlying data be in to make the graphs render as I’d expect them to? And which package should I use for drawing the charts?
I started with histograms, because that was an area where I had some experience. Since I’m working in Databricks for the first time, I looked at the documentation for visualizations in Databricks. There are built in visualizations, but the documentation was pretty sparse and high level, so I started Googling. Between reading lots of suggestions on several web pages and blogs and trying various combinations of things, I got some histograms that looked reasonable.
But what was I supposed to do about percentiles? Back to Google, I eventually figured out that Databricks has built in visualizations for Quantiles. This was starting to look pretty good, but I wasn’t sure if that was really the best way to represent percentiles.
When I could not make any more progress on my own, I decided to ask in a Slack channel for this project. It was a bit intimidating to ask because I felt like I should have been able to figure this out for myself. Because I am also new to using Slack, I didn’t realize that I should directly ping my mentors with my question. I just generally asked it in the channel. Although my mentors didn’t see it, another person in the channel told me they like using CDFs for that type of thing.
I was happy to have an answer that could move me forward on my project, however I did not know how to use a CDF or even what it stood for. Back to Google, I figured out it was a Cumulative Distribution Function. More Googling taught me that it could be implemented as a cumulative step histogram in Matplotlib with some very helpful sample code. Stack Overflow has been my friend in this project.
I learned that it is a good thing to reach out for help when needed because it did get me past my stuck point. I also learned how best to ask to get the attention of the people most likely to be able to help me. In reading other communication channels in the community, I’ve realized that people ask questions about all sorts of topics. Even people that have a lot of experience in some areas are learning new skills or working with new departments or new data, so they ask questions. And all the answers that I’ve seen have been respectful and helpful.