SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Note that the axis is logarithmic, so that raw differences are more pronounced.

Analysis details

For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandassqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.

Our sample data was randomly generated. Here’s what it looks like:

sql_vs_pandas$ head -n 5 data/sample.100.csv


This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.


sqlite or memory-sqlite is faster for the following tasks:

  • select two columns from data (<.1 millisecond for any data size for sqlitepandas scales with the data, up to just under 0.5 seconds for 10 million records)
  • filter data (>10x-50x faster with sqlite. The difference is more pronounced as data grows in size)
  • sort by single column: pandas is always a bit slower, but this was the closest

pandas is faster for the following tasks:

  • groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records)
  • load data from disk (5x faster for >10k records, even better for smaller data)
  • join data (2-5x faster, but slower for smallest dataset of 1000 rows)

Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.

There is no significant speedup from loading sqlite in its own shell vs. via pandas.

Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second fastest operation for larger data).

Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.


All code is on our GitHub page.

Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).

sqlite is first, followed by pandas:


def sort(self):
  self._cursor.execute('SELECT * FROM employee ORDER BY name ASC;')

def sort(self):


def select(self):
  self._cursor.execute('SELECT name, dept FROM employee;')

def select(self):
  self.df_employee[["name", "dept"]]


def load(self):
  self._cursor.execute('CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);')
  df = pd.read_csv(self.employee_file)
  df.columns = employee_columns
  df.to_sql('employee', self._conn, if_exists='replace')

  self._cursor.execute('CREATE TABLE bonus (name varchar(255), bonus double);')
  df_bonus = pd.read_csv(self.bonus_file)
  df_bonus.columns = bonus_columns
  df_bonus.to_sql('bonus', self._conn, if_exists='replace')

def load(self):
  self.df_employee = pd.read_csv(self.employee_file)
  self.df_employee.columns = employee_columns

  self.df_bonus = pd.read_csv(self.bonus_file)
  self.df_bonus.columns = bonus_columns


def join(self):
  self._cursor.execute('SELECT employee.name, employee.salary + bonus.bonus '
                       'FROM employee INNER JOIN bonus ON employee.name = bonus.name')

def join(self):
  joined = self.df_employee.merge(self.df_bonus, on='name')
  joined['total'] = joined['bonus'] + joined['salary']


def filter(self):
  self._cursor.execute('SELECT * FROM employee WHERE dept = "a";')

def filter(self):
  self.df_employee[self.df_employee['dept'] == 'a']

group by

def groupby(self):
  self._cursor.execute('SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;')

def groupby(self):
  self.df_employee.groupby("dept").agg({'birth': np.mean, 'salary': np.sum})


We used pandas version 0.19.1 and sqlite version 3.13.0. All tests were ran on Digital Ocean Ubuntu 14.04 with 16GB memory and 8 core CPU.


For resource on becoming a data scientist, check out our blog, particular this article on preparing for our free data science fellowship and this one on data manipulating data like a professional data scientist

To learn more how sqlite works, check out this awesome blog series.

Here is a syntax comparison between pandas and sql.


So What Are You Waiting For?

There’s never been a better time to start learning new skills. Emerging technologies are revolutionizing the way we work, play, and live. Learning these disciplines deepens your understanding of the world around you and provides a fountain of knowledge to explore new frontiers and technological breakthroughs. 

The Data Incubator offers intensive training bootcamps that provide the tools you need to succeed as a data scientist or data engineer. You will gain hands-on experience working on real projects and apply what you’ve learned in our curriculum to solve problems in your work or for clients. Our curriculum includes machine learning, natural language processing, predictive analytics, data visualization, and more. 

We also partner with leading organizations to place our highly trained graduates. Our hiring partners recognize the quality of our expert training and make us their go-to resource for providing quality, capable candidates throughout the industry.

Take a look at the programs we offer to help you achieve your dreams.

We’re always here to guide you through your data journey! Contact our admissions team if you have any questions about the application process.

Related Blog Posts


Career Enablement at TDI

TDI is more than your typical bootcamp. We provide robust career support to ensure exceptional outcomes for our students. Learn more about our career enablement options here!

Read More »
is it too late to become a data scientist

Is It Too Late for Me to Become a Data Scientist?

You might think data science is a young person’s game. After all, this is a relatively new discipline that might not have been around when you were in school. But research shows nearly half of all data scientists are 40 years and older. Whatever your age, it’s never too late to pursue your dreams of becoming a qualified data scientist. Learn how to succeed in this profession in this blog.

Read More »