My project focuses on Exploratory Data Analysis (EDA). EDA is basically analyzing the data you will be working with through visual and quantitative methods to understand it. That means creating various graphs of the data since it’s often easier to see patterns and relationships through graphs rather than tables of numbers. The quantitative methods generate summary statistics such as the minimum, maximum, mean and standard deviation. This analysis is both univariate to look at one variable or data column at a time and multivariate to explore the relationship between variables.
There are many techniques for doing EDA, but a good starting place is considering the data through a question that you’d like to address. The question that I’m looking at is who are Heavy Users of the browser? This brings up a host of related questions such as which variables of usage are of interest, and what would be a cutoff value of those variables to determine heavy usage? Do heavy users use the product differently than other users? What features do they use? As the investigation continues the questions may change slightly or new questions may come up.
It’s important to note that the user data that I am investigating is “non-personal information” meaning that there is nothing that identifies the user such as their name, email address or IP address.The data instead contains information such as OS, version of the application and number of tabs open.
To start this investigation you need the data and tools in which to analyze it. My mentors suggested the dataset that I should use. A sample of the data for one week contains 6,473,655 rows and 607 columns. My two possible tools were Redash or Databricks. Redash makes it easy to write a SQL query and visualize the data in a dashboard. Unfortunately since the dataset that I’m using is massive my queries would be slow and could impact performance for other users because it is on a shared cluster. So I’m creating notebooks in Databricks. Databricks is powered by Apache Spark to work with Big Data. A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. I’m writing the code in Python.
To view, manipulate and analyze the data, a SQL query is used to put the data into a DataFrame. A DataFrame organizes the data in a tabular format with rows of data in named columns like you’d see in a spreadsheet. There are built-in functions that you can use to select various rows or columns, group them, count them and graph them. You can also write user defined functions. When I first learned about data science I was using Pandas DataFrames. The data in a Pandas DataFrame is in-memory on a single server. The Spark SQL DataFrames are distributed on your Spark cluster to be able to handle large amounts of data. I quickly realized that the Spark DataFrames in Databricks are similar to Pandas but have different APIs, so be sure to specify which DataFrame you are using when searching the internet for how to do a specific function.
Finally, you want to be able to visualize your data. There are some easy graphing methods in the Databricks notebook using Matplotlib. Although these allow you to quickly view your data to further inform your investigation, they are better suited to analysis than a polished presentation of conclusions. For example, they are lacking titles and clear labeling of the x and y axis. For an elegant presentation of graphs I plan to look into ggplot.