Looping in Pandas DataFrames: A Common Mistake
Introduction
If youâve just started in the data engineering field, youâve probably used the pandas libraryâa powerful tool that lets you read structured data from various sources and formats, perform calculations, and export it to different formats. What more could you wish for? But when working on small datasets, we often donât pay attention to performance. We write a script that does what we need, test it on a small dataset, and everything works perfectly. Then we run it on a huge datasetâprobably on a Fridayâhead home expecting it to work just as smoothly. But then you come back on Monday and⌠surprise! The script is still running. đŹ
From the logs (if youâre lucky enough to have them), you find out itâs stuck in a loop. Sound familiar? It happened to me, and itâs happened to hundreds of other people too. So, Iâm writing this article to save you time and show you how to loop through a DataFrame efficiently.
The Problem with Loops in Pandas
Loops, such as for and while, are commonly used to perform repetitive tasks. However, when applied to pandas DataFrames, especially on large datasets, loops can drastically slow down the processing speed. Letâs explore why.
Example: Looping Over Rows
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Looping through rows
for index, row in df.iterrows():
print(row['Name'], row['Age'])
You might look at this and think, âHey, thatâs fine, whatâs the problem?â But on larger DataFrames, this is basically turning your code into a turtle race. Each row gets converted into a pandas Series object, and things get sluggish real quick.
Timing the Loop
To measure the duration of a loop, you can use a simple time calculation with the time module.
import time
start_time = time.time()
# Your loop here
print(f"Execution time: {time.time() - start_time} seconds")
For more precise measurements, the timeit module is an excellent tool, though it is beyond the scope of this article.
You can find more information about the timeit module in the official documentation.
Better Alternatives to Loops (a.k.a. How to Be Fast and Efficient)
Hereâs where we get to the good stuffâpandas has some built-in superpowers that make looping totally unnecessary. Letâs check out some ways to kick loops to the curb and give your code the speed boost it deserves.
1. Vectorization
The best alternative is to use pandas’ built-in vectorized operations. These are optimized for performance and avoid the overhead of Python loops.
Example: Using Vectorized Operations
# Add 5 to the Age column without a loop
df['Age'] = df['Age'] + 5
print(df)
Boom. Done. One line. No loops, no fuss. This is like getting your morning coffee delivered to your desk while everyone else is still waiting in line.
But what if you need to perform complex calculations and not just add or subtract values? Good question!
Say hello to numpy.vectorize.
numpy.vectorize allows you to apply a function element-wise to arrays, which can be very useful for complex operations on a pandas DataFrame.
Example: Using numpy.vectorize with Pandas DataFrame
Letâs say we want to apply a custom function to each element in a DataFrame. Hereâs how you can do it using numpy.vectorize:
import numpy as np
import pandas as pd
# Sample DataFrame
data = {'Value': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Define a custom function
def complex_calculation(x):
return x ** 2 + 3 * x + 2
# Vectorize the custom function
vectorized_func = np.vectorize(complex_calculation)
# Apply the vectorized function to the DataFrame column
df['Calculated'] = vectorized_func(df['Value'])
print(df)
2. Apply Function
If vectorization feels like overkill for what youâre doing, then apply() is your next best friend. Itâs still faster than looping and lets you apply a function to each element in a column or row.
Example: Using apply()
# Apply function to increase age (but still avoid the loop!)
df['Age'] = df['Age'].apply(lambda x: x + 5)
print(df)
3. List Comprehension
If youâre into that whole Pythonic style, list comprehensions are your jam. Theyâre cleaner, faster, and you still get to feel like youâre in control.
Example: List Comprehension Magic
# Creating a new list for the Age column (like a boss)
new_ages = [age + 5 for age in df['Age']]
df['Age'] = new_ages
print(df)
Looks cleaner, runs faster, and makes you look like you really know your stuff.
4. pandas to_dict()
If you need to convert a pandas DataFrame to a dictionary for looping through rows, you can use the to_dict("records") method.
This converts each row of the DataFrame into a dictionary, with each dictionary representing a record.
Hereâs an example:
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Convert DataFrame to a list of dictionaries
records = df.to_dict("records")
# Loop through records
for record in records:
print(record)
Benchmarking Loop Methods
Someone on Stack Overflow created a benchmark comparing different looping methods in pandas. You can see the benchmark image below for a visual representation of the performance differences. The key takeaway from the benchmark is:
never use iterrows. đ

This benchmark is like a reality check for anyone still using iterrowsâitâs slow, itâs clunky, and itâs definitely not invited to the performance party. So, take a peek and see why you should be ditching those old-school loops in favor of faster, more efficient methods. Your code (and your sanity) will thank you!
Further Reading
For more details and the full discussion, check out the Stack Overflow thread.