Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.
Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.
There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity.
Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.
The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation.
Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
The Python library pandas is a great alternative to Excel, providing much of the same functionality and more. Pandas is great for data manipulation, cleaning, analysis, and exploration.
Additionally, these tasks can be easily automated and reapplied to different datasets. It is built on top of Python’s NumPy library which offers fast and efficient numerical computations. Unlike Excel, it is open-source and free.
Some people may be intimidated to use Python and pandas because they will have to move away from the point-and-click GUI approach of Excel. Knowing pandas is a great introduction to more powerful and complex data analysis.
This article explores how do to some common Excel tasks in pandas, helping people get their feet wet with one of the most powerful Python libraries for data science. If you’re interested in using Python to supercharge your Excel spreadsheets, PyXLL is currently the only package that enables developers to write fully featured Excel addins in Python.
A Quick Introduction to Python pandas
The equivalent to an Excel spreadsheet in pandas is the DataFrame
class. It looks like a spreadsheet, with rows, columns, and indices.
Let us consider three spreadsheets, the first two containing each student’s grade on an exam and the third has information on which section the students belong. These DataFrames
are loaded into memory from CSV files using the read_csv
function.
# import pandas library
import pandas as pd
# create pandas DataFrame from CSV
exam_one = pd.read_csv('csv/exam_one.csv', delimiter=',')
exam_two = pd.read_csv('csv/exam_two.csv', delimiter=',')
sections = pd.read_csv('csv/sections.csv', delimiter=',')
# print the top five entries of the DataFrame
exam_one.head()
Student Name | Exam 1 Score | |
---|---|---|
0 | Edwin Duncan | 83 |
1 | Steve Hayes | 69 |
2 | Cecilia Richards | 91 |
3 | Dorothy Harris | 53 |
4 | Jordan Davis | 76 |
Columns can be indiviudally selected simply by referring to them by name.
exam_one['Student Name']
0 Edwin Duncan 1 Steve Hayes 2 Cecilia Richards 3 Dorothy Harris 4 Jordan Davis Name: Student Name, dtype: object
Python pandas Vlookup
Experienced Excel users rely on Vlookup, a built-in function that searches (looks up) a specified value in one column and returns the corresponding value of another column.
For our example of exam scores, we would like to take a student’s second exam score and include it into the table of first exam score. The column of student names may not be in the same order, e.g., the first name in one table may not correspond to the first name in another table.
A Vlookup can be done in pandas using the merge
method of a DataFrame. The merge
method takes the second DataFrame
and the on
keyword specifies the column name to perform that matching on. The result is a new DataFrame
with both exam scores.
# apply "Vlookup" in padas
exam_scores = exam_one.merge(exam_two, on='Student Name')
exam_scores.head()
Student Name | Exam 1 Score | Exam 2 Score | |
---|---|---|---|
0 | Edwin Duncan | 83 | 33 |
1 | Steve Hayes | 69 | 42 |
2 | Cecilia Richards | 91 | 73 |
3 | Dorothy Harris | 53 | 75 |
4 | Jordan Davis | 76 | 81 |
Python pandas Pivot Tables
Pivot tables are used to aggregate and filter data and are a useful tool for data analysis in Excel. We can group data by certain values in a given column and filter out rows. In pandas, we can easily filter out rows from our DataFrame
by using Boolean logic.
For this example, we would like to determine the student’s name that belong to section “A”. This is done in pandas by first creating an array of True/False values. This array corresponds to which rows met the condition. We then use the resulting Boolean array to only call rows that meet our condition.
IX = sections['Section'] == 'A' # Boolean (True/False) indices or rows that met our filtering criteria
sections[IX] # use the Boolean indices to access rows
Student Name | Section | |
---|---|---|
0 | Cecilia Richards | A |
3 | Steve Hayes | A |
pivot_table
or group_by
method. Using the pivot_table
method, the syntax is:# use merge to add student's section
student_data = exam_scores.merge(sections, on='Student Name')
# create pivot table using pivot_table method
student_data.pivot_table(index='Section', aggfunc='mean')
Exam 1 Score | Exam 2 Score | |
---|---|---|
Section | ||
A | 80.0 | 57.5 |
B | 64.5 | 78.0 |
C | 83.0 | 33.0 |
Section
column and the aggfunc
is the operation we want to perform. An alternate approach is to utilize the groupby
method, akin to the GROUP BY
statement in SQL.student_data.groupby('Section').agg({'Exam 1 Score': 'mean', 'Exam 2 Score': 'mean'})
Exam 2 Score | Exam 1 Score | |
---|---|---|
Section | ||
A | 57.5 | 80.0 |
B | 78.0 | 64.5 |
C | 33.0 | 83.0 |
groupby
, we then used the agg
method and passed a Python dictionary. The keys of the dictionary are the column names to apply the aggregation and the values are the actual aggregation function. If we want to apply different or more than one aggregation function to each column, we can pass a dictionary whose values consist of lists of aggregation functions:student_data.groupby('Section').agg({'Exam 1 Score':['max', 'mean'], 'Exam 2 Score': ['max', 'min']})
Exam 2 Score | Exam 1 Score | |||
---|---|---|---|---|
max | min | max | mean | |
Section | ||||
A | 73 | 42 | 91 | 80.0 |
B | 81 | 75 | 76 | 64.5 |
C | 33 | 33 | 83 | 83.0 |
Pandas vs. Excel: Concluding Remarks
Pandas provides the powerful DataFrame
class, of which Excel users will recognize as resembling and behaving similarly to an Excel spreadsheet. Through the DataFrame
class, one can perform operations equivalent to Excel’s Vlookup and pivot tables.
One of the barriers of learning a new tool is knowing how to perform operations one was accustomed to with the old tool. This article shows pandas’ syntax, illustrating how pandas works and easing one’s transition from Excel. Learning pandas is a good start in familiarizing oneself with the toolkit of a data scientist.