The Shape Of Code, or What We Can Learn From Indentation
By Robert Schroll
TDI Data Scientist in Residence and Instructor for our Data Science Fellowship
As I have reviewed students’ Python code for our fellowship program, I’ve learned that I can judge code quality at a quick glance, just by looking at its indentation. I’ve come up with the following rule of thumb:
Good code should have frequent changes in indentation, but should not be deeply indented.
This isn’t because indentation is inherently good or bad. Instead, indentation is a clue to the structure of code. Let’s look at the implications of this rule of thumb.
Good code should not be deeply indented.
We’ll start with the second half of this credo, since it is universally true. In most languages, indentation represents some new context. In Python, the indentation determines the context. We, as humans, are limited in the amount of context that we can hold in our working memory (See any one of a million pop-sci articles about how we can only remember five, or seven, or eight, or ten, or fifteen things at a time). Deep indentation shows that we are asking the reader to keep too much context in mind at once.
As a concrete example, consider this (simplified) code from one of the mini projects we give our students:
def flatten_dicts(list_of_dicts):
ret = []
for d in list_of_dicts:
new_dict = {}
for k, v in d.items():
if isinstance(v, str):
new_dict[k + '_' + v] = 1
elif isinstance(v, dict):
for kk, vv in v.items():
if isinstance(vv, str):
new_dict[kk + '_' + vv] = 1 # ★
else:
new_dict[kk] = vv
else:
new_dict[k] = v
ret.append(new_dict)
return ret
To understand what’s happening at the line ★, you must keep track of six levels of context: a function call, a loop over the list of dictionaries, a loop over the dictionary items, a conditional on the type of items, another loop over another dictionary’s items, and a conditional on another item’s type. But I don’t have to read the code to see this is a problem. The six levels of indentation show me this problem at a glance.
How do we fix this? Any time I see deeply nested code inside of for loop, I recommend splitting the inside of that loop into a separate function:
def flatten_dict(dict):
new_dict = {}
for k, v in d.items():
if isinstance(v, str):
new_dict[k + '_' + v] = 1
elif isinstance(v, dict):
for kk, vv in v.items():
if isinstance(vv, str):
new_dict[kk + '_' + vv] = 1 // ★
else:
new_dict[kk] = vv
else:
new_dict[k] = v
return new_dict
def flatten_dicts(list_of_dicts):
ret = []
for d in list_of_dicts:
ret.append(flatten_dict(d))
return ret
This does a number of things for us:
It has removed one level of context for line ★. It no longer depends on anything about the list of dictionaries that we are iterating through (if it did, it would probably be a sign that our code is so convoluted that we cannot reason correctly about it).
It makes it possible to understand the
flatten_dicts
function at a glance. The details of the inner loop obscured the simple behavior of this function. Now that we see it, we can see….The
flatten_dicts
function can be simplified even further. Any time you write a loop in Python in which an item is appended to a list every cycle, you should ask if it can be written as a comprehension. This was impossible in the original code, but now we can use a list comprehension:
def flatten_dicts(list_of_dicts):
return [flatten_dict(d) for d in list_of_dicts]
- It reveals duplicated code in
flatten_dict
. The case wherev
isdict
is basically the same as the outer loop. We can replace this with a recursive call.
def flatten_dict(dict):
new_dict = {}
for k, v in d.items():
if isinstance(v, str):
new_dict[k + '_' + v] = 1
elif isinstance(v, dict):
new_dict.update(flatten_dict(v))
else:
new_dict[k] = v
return new_dict
The resultant code has a maximum indentation of three—half of what we started with. We no longer have to reason through three nested for loops. This code is also more robust: it can handle arbitrarily-nested dictionaries. These problems were not immediately apparent in the original code, but they appeared a soon as we worked to decrease the indentation.
Good code should have frequent changes in indentation.
This is less universal, but number of consecutive lines at the same level of indentation is often a code smell. Constant indentation is usually a sign that the code has no branches. It is just a series of instructions to be executed in order. We should ask if all of these instructions are necessary, or if they should be (or have already been) put into their own function.
Let’s suppose we are taking sentence
, a string, and wish to produce a normalized version, entirely lowercased with all whitespace reduced to a single space character. Here we do it in three lines:
words = sentence.split()
lowered = [w.lower() for w in words]
normalized = ' '.join(lowered)
However, the lists words
and lowered
are unnecessary. We can accomplish the same result in a single line:
normalized = ' '.join(w.lower() for w in sentence.split())
Not only is this fewer lines of code, it is more memory-efficient thanks to the use of a generator expression.
Some care is necessary; there are times where several lines could be combined into one, but doing so would ruin readability. In these cases, additional lines and additional intermediate variables are preferred.
Another common cause of unchanging indentation is repeated code. If that code is several lines long, consider packing it into a function and then calling that. Also consider if repetition can be made into a loop. Thanks to Python’s tuple unpacking, even data that’s not in a single structure can be operated on within a loop. For example:
a2 = transform(a1)
b2 = transform(b1)
c2 = transform(c1)
can be rewritten as:
a2, b2, c2 = [transform(x) for x in (a1, b1, c1)]
If we need to change transform
to better_transform
, now we need to change it in only one place. We’re less likely to miss one instance of the function call.
Conclusion…
These are only rules of thumb; there are some occasions where good code will not follow their advice. For instance, code defining constants or configuring a system will often consist of tens or hundreds of lines at the same level of indentation. You should not insert changes of indentation just for the variation. But when reviewing your code, focusing on areas with deep indentation or unchanging indentation will often lead you to code that can be refactored and improved.
More about the author
Robert Schroll
Robert studied squishy physics in Chicago, Amherst, and Santiago, Chile, before uniting his love of computers, teaching, and making pretty graphs at The Data Incubator. In his free time, he plays tuba and right field, usually not simultaneously.