Faster Alternatives to Pandas `.apply()` Function for Scalable Machine Learning Pipelines

3 min readSep 23, 2024

We’ve all been guilty of sticking to the data science practices that we learned when we began our journey in the world of data.
There always comes a time when you realize that some methods, while familiar, may not be the most efficient — especially in a production environment where performance matters.

In this article I would like to share one of the best practice that I learned while deploying machine learning pipelines in production.

This article will guide you through three powerful alternative to the .apply() function— each progressively better than the last in terms of efficiency and scalability.

Definition of `.apply()` function

In Pandas, the .apply() function is both versatile and powerful, offering a wide range of possibilities for data manipulation.

While many of us are already familiar with its capabilities, a quick refresher can be helpful. In simple terms, the .apply() function allows you to apply a function along a specific axis of a DataFrame or Series. It essentially loops over each row or column, applying the provided function to perform transformations, compute custom aggregates, or execute complex operations.

The function passed to .apply() can handle a variety of tasks, from simple transformations to intricate row-level or column-level calculations.

First, we will see how .apply() maps a simple additional function to a column, and then we will jump to the alternatives.

To demonstrate, we will create a DataFrame with 10 million rows and two columns, using the .apply() function to add the two columns together using a add() function.

from ast import comprehension
import pandas as pd
import numpy as np
import time

# Create a sample DataFrame with 10,000 rows
df = pd.DataFrame({
    'A': np.random.rand(10000000),
    'B': np.random.rand(10000000)
})

# Add function
def add(a, b):
  return a+b

Now we will run the function with .apply() function.

# Measure time for apply
start_time = time.time()
result_apply = df.apply(lambda row: add(row['A'], row['B']), axis=1)
apply_time = time.time() - start_time

print(f"Time taken using apply: {apply_time:.5f} seconds.")

Time taken using apply: 100.67693 seconds.

It took around 1.8 minutes to apply the add() function across the two columns.

Alternative 1: `.itertuples() function`

Our first alternative to the .apply() function is .itertuples(). This function iterated over the rows of a DataFrame by returning named tuples, where each tuple corresponds to a row.

Why is this better?
1. Avoids overhead associated with the function call for each row. Since it yields rows as tuples, it minimizes the performance hit often seen with row-wise operations.
2. Allows user to access columns using dot notation.
3. Generates row on-the-fly.

Implementing .itertuples() on the same example:

# Measure time for iterrows
start_time = time.time()
result_iterrows = []
for index, a, b in df.itertuples():
  add(a, b)
iterrows_time = time.time() - start_time
print(f"Time taken using iterrows: {iterrows_time:.5f} seconds.")

Time taken using iterrows: 10.88897 seconds.

With our first alternative, we’ve already achieved a performance improvement, operating 10xfaster than before!

Alternative 2: comprehension

A concise Pythonic way to create lists based on data is our another effective alternative to the .apply() function.

Why is this better?
1. Leverages the Python’s in-built functions and avoid the overhead of method calls.
2. Readable!
3. Direct access to column without the needing to iterate through the rows

Implementing comprehension on the same example:

# Measure time for comprehension
start_time = time.time()
[a + b for a, b in zip(df['A'], df['B'])]
comprehension_time = time.time() - start_time
print(f"Time taken using comprehension: {comprehension_time:.5f} seconds.")

Time taken using comprehension: 4.64803 seconds.

Almost 25x faster than .apply() function.

Alternative 3: map

Yet another efficient alternative to the .apply() method is the map() function. This built-in function is designed to apply a function to every item of iterable by return an iterator.

Why is this better?
1. Works better because optimized for iterating over iterable.
2. Simple as it provides a straightforward way to apply a function to elements of a Series or DataFrame.

# Measure time for map
start_time = time.time()
f = map(add, df[['A', 'B']])
map_time = time.time() - start_time
print(f"Time taken using map: {map_time:.5f} seconds")

Time taken using map: 0.06153 seconds

Conclusion

In this article, we explored three better alternatives of .apply() function each offering enhanced efficiency for data manipulation tasks.
- .itertuples()
- comprehension
- map()

By incorporating these alternatives into your machine learning pipeline, you can significantly reduce the chances of performance bottlenecks, ensuring more efficient data processing and smoother model deployment.

Reference:

[1] Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Faster Alternatives to Pandas `.apply()` Function for Scalable Machine Learning Pipelines

Definition of `.apply()` function

Alternative 1: `.itertuples() function`

Alternative 2: comprehension

Alternative 3: map

Conclusion

Reference:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mohit Sharma

No responses yet

More from Mohit Sharma

What is Add & Norm, as soon as possible?

This article is inspired by — Attention is all you need paper and its explanation by Umar Jamil.

How to calculate explained variance ratio in PCA using SVD?

The ultimate goal of my articles is to provide maximum information in minimum time, so if you want to skip to the main content click here

What is RAG, as quick as possible?

LLM

Recommended from Medium

Getting Started with Data Analytics Using PyArrow in Python

Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

ETL Pipeline with AWS Glue and PySpark: A Hands-on PoC

1. Setup AWS Glue Resources

9 Modern Python Libraries You Must Know in 2025! 🚀

I have discovered these few game changer modern set of libraries that you should must know in 2025.

Get with the Times: PyTabKit for better Tabular Machine Learning over Sk-Learn (CODE Included)

For too long has Scikit-Learn been the go-to library for machine learning on tabular data, offering a broad collection of algorithms…

How I Learned to Love `init.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Faster Alternatives to Pandas .apply() Function for Scalable Machine Learning Pipelines

Definition of .apply() function

Alternative 1: .itertuples() function

Alternative 2: comprehension

Alternative 3: map

Conclusion

Reference:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mohit Sharma

No responses yet

More from Mohit Sharma

What is Add & Norm, as soon as possible?

This article is inspired by — Attention is all you need paper and its explanation by Umar Jamil.

How to calculate explained variance ratio in PCA using SVD?

The ultimate goal of my articles is to provide maximum information in minimum time, so if you want to skip to the main content click here

What is RAG, as quick as possible?

LLM

Recommended from Medium

Getting Started with Data Analytics Using PyArrow in Python

Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

ETL Pipeline with AWS Glue and PySpark: A Hands-on PoC

1. Setup AWS Glue Resources

9 Modern Python Libraries You Must Know in 2025! 🚀

I have discovered these few game changer modern set of libraries that you should must know in 2025.

Get with the Times: PyTabKit for better Tabular Machine Learning over Sk-Learn (CODE Included)

For too long has Scikit-Learn been the go-to library for machine learning on tabular data, offering a broad collection of algorithms…

How I Learned to Love `__init__.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Faster Alternatives to Pandas `.apply()` Function for Scalable Machine Learning Pipelines

Definition of `.apply()` function

Alternative 1: `.itertuples() function`

How I Learned to Love `init.py`: A Simple Guide😊