- The first is that it’s awkward to deal with higher dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there’s many workarounds like pivot tables, this will only gets you so far.
- The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time consuming to iterate over hundreds of sheets or notebooks.
- Finally, data manipulation in Excel is actually very manual and hence very error prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.
In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re gonna define as wage times the quantity hours worked plus 1.5 times the overtime.
In Excel, you’d have to squash the two-dimensional years cross months into a single-dimensional time. You’d have to repeat this process across every sheet and you’d have to repeat this over every workbook. Plus you’d have to handle missing data on an ad hoc basis. For example, while overtime is only given for three industries, hours worked is give for all 19. And of course some sheets are missing data for certain dates but this is different for every sheet. On top of this, how would you repeat this analysis each month as the government updates its data? And how would you do similar calculations based on the same data?
Instead, let’s explore this data using Python. Python is an open-source platform that’s used by data scientists and analysts all over the world. It’s free to download and use and it’s 90,000 packages make manipulating data agrees. Specifically, it solves many of our pain points of Excel. It can natively work with high-dimensional data, can easily repeat calculations across sheets and workbooks, and we can automate common tasks like data manipulation and plotting so that processing new data is simple.
So let’s look at the data from the Bureau of Labor Statistics. We can use assert to check for errors. So here we’re able to check that the row above contains the header and the row below contains null as we pull out the data in the middle. We can easily squash this two-dimensional table into a single-dimensional time series and we can iterate this calculation across multiple sheets and multiple workbooks without much trouble. We can write the formulas once without copying and we can easily handle missing data using conventions like fill value. Finally, we can plot and format our plots programmatically.
We can easily produce another calculation looking at the fraction of workers who are women across different industries. We can reuse our code to extract the data, write simple formulas that we don’t have to copy from cell to cell, and then programmatically plot the fraction of women who are working in different industries in December 2015. This took about 15 minutes to write but you can now run this on updated data from the Bureau of Labor Statistics for free.
If you want to see the code that we used in today’s demo, you can view that on our GitHub page. And if you want to learn how to do this yourself, you can check out our online data science training courses.