Faster Alternatives to Pandas .apply()
Function for Scalable Machine Learning Pipelines
We’ve all been guilty of sticking to the data science practices that we learned when we began our journey in the world of data.
There always comes a time when you realize that some methods, while familiar, may not be the most efficient — especially in a production environment where performance matters.
In this article I would like to share one of the best practice that I learned while deploying machine learning pipelines in production.
This article will guide you through three powerful alternative to the .apply()
function— each progressively better than the last in terms of efficiency and scalability.
Definition of .apply()
function
In Pandas, the .apply()
function is both versatile and powerful, offering a wide range of possibilities for data manipulation.
While many of us are already familiar with its capabilities, a quick refresher can be helpful. In simple terms, the .apply()
function allows you to apply a function along a specific axis of a DataFrame or Series. It essentially loops over each row or column, applying the provided function to perform transformations, compute custom aggregates, or execute complex operations.
The function passed to .apply()
can handle a variety of tasks, from simple transformations to intricate row-level or column-level calculations.
First, we will see how .apply()
maps a simple additional function to a column, and then we will jump to the alternatives.
To demonstrate, we will create a DataFrame with 10 million rows and two columns, using the .apply()
function to add the two columns together using a add()
function.
from ast import comprehension
import pandas as pd
import numpy as np
import time
# Create a sample DataFrame with 10,000 rows
df = pd.DataFrame({
'A': np.random.rand(10000000),
'B': np.random.rand(10000000)
})
# Add function
def add(a, b):
return a+b
Now we will run the function with .apply()
function.
# Measure time for apply
start_time = time.time()
result_apply = df.apply(lambda row: add(row['A'], row['B']), axis=1)
apply_time = time.time() - start_time
print(f"Time taken using apply: {apply_time:.5f} seconds.")
Time taken using apply: 100.67693 seconds.
It took around 1.8 minutes to apply the add()
function across the two columns.
Alternative 1: .itertuples() function
Our first alternative to the .apply()
function is .itertuples()
. This function iterated over the rows of a DataFrame by returning named tuples, where each tuple corresponds to a row.
Why is this better?
1. Avoids overhead associated with the function call for each row. Since it yields rows as tuples, it minimizes the performance hit often seen with row-wise operations.
2. Allows user to access columns using dot notation.
3. Generates row on-the-fly.
Implementing .itertuples()
on the same example:
# Measure time for iterrows
start_time = time.time()
result_iterrows = []
for index, a, b in df.itertuples():
add(a, b)
iterrows_time = time.time() - start_time
print(f"Time taken using iterrows: {iterrows_time:.5f} seconds.")
Time taken using iterrows: 10.88897 seconds.
With our first alternative, we’ve already achieved a performance improvement, operating 10xfaster than before!
Alternative 2: comprehension
A concise Pythonic way to create lists based on data is our another effective alternative to the .apply()
function.
Why is this better?
1. Leverages the Python’s in-built functions and avoid the overhead of method calls.
2. Readable!
3. Direct access to column without the needing to iterate through the rows
Implementing comprehension
on the same example:
# Measure time for comprehension
start_time = time.time()
[a + b for a, b in zip(df['A'], df['B'])]
comprehension_time = time.time() - start_time
print(f"Time taken using comprehension: {comprehension_time:.5f} seconds.")
Time taken using comprehension: 4.64803 seconds.
Almost 25x faster than .apply()
function.
Alternative 3: map
Yet another efficient alternative to the .apply()
method is the map()
function. This built-in function is designed to apply a function to every item of iterable by return an iterator.
Why is this better?
1. Works better because optimized for iterating over iterable.
2. Simple as it provides a straightforward way to apply a function to elements of a Series or DataFrame.
# Measure time for map
start_time = time.time()
f = map(add, df[['A', 'B']])
map_time = time.time() - start_time
print(f"Time taken using map: {map_time:.5f} seconds")
Time taken using map: 0.06153 seconds
Conclusion
In this article, we explored three better alternatives of .apply()
function each offering enhanced efficiency for data manipulation tasks.
- .itertuples()
- comprehension
- map()
By incorporating these alternatives into your machine learning pipeline, you can significantly reduce the chances of performance bottlenecks, ensuring more efficient data processing and smoother model deployment.
Reference:
[1] Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html