Image generated with ChatGPT
Pandas is one of the most popular data manipulation and analysis tools available, known for its ease of use and powerful capabilities. But did you know that you can also use it to create and execute data pipelines for processing and analyzing datasets?
In this tutorial, we will learn how to use Pandas’ `pipe` method to build end-to-end data science pipelines. The pipeline includes various steps like data ingestion, data cleaning, data analysis, and data visualization. To highlight the benefits of this approach, we will also compare pipeline-based code with non-pipeline alternatives, giving you a clear understanding of the differences and advantages.
What is a Pandas Pipe?
The Pandas `pipe` method is a powerful tool that allows users to chain multiple data processing functions in a clear and readable manner. This method can handle both positional and keyword arguments, making it flexible for various custom functions.
In short, Pandas `pipe` method:
Enhances Code Readability
Enables Function Chaining
Accommodates Custom Functions
Improves Code Organization
Efficient for Complex Transformations
Here is the code example of the `pipe` function. We have applied `clean` and `analysis` Python functions to the Pandas DataFrame. The pipe method will first clean the data, perform data analysis, and return the output.
(
df.pipe(clean)
.pipe(analysis)
)
Pandas Code without Pipe
First, we will write a simple data analysis code without using pipe so that we have a clear comparison of when we use pipe to simplify our data processing pipeline.
For this tutorial, we will be using the Online Sales Dataset – Popular Marketplace Data from Kaggle that contains information about online sales transactions across different product categories.
We will load the CSV file and display the top three rows from the dataset.
import pandas as pd
df = pd.read_csv(‘/work/Online Sales Data.csv’)
df.head(3)
Clean the dataset by dropping duplicates and missing values and reset the index.
Convert column types. We will convert “Product Category” and “Product Name” to string and “Date” column to date type.
To perform analysis, we will create a “month” column out of a “Date” column. Then, calculate the mean values of units sold per month.
Visualize the bar chart of the average unit sold per month.
# data cleaning
df = df.drop_duplicates()
df = df.dropna()
df = df.reset_index(drop=True)
# convert types
df[‘Product Category’] = df[‘Product Category’].astype(‘str’)
df[‘Product Name’] = df[‘Product Name’].astype(‘str’)
df[‘Date’] = pd.to_datetime(df[‘Date’])
# data analysis
df[‘month’] = df[‘Date’].dt.month
new_df = df.groupby(‘month’)[‘Units Sold’].mean()
# data visualization
new_df.plot(kind=’bar’, figsize=(10, 5), title=”Average Units Sold by Month”);
This is quite simple, and if you are a data scientist or even a data science student, you will know how to perform most of these tasks.
Building Data Science Pipelines Using Pandas Pipe
To create an end-to-end data science pipeline, we first have to convert the above code into a proper format using Python functions.
We will create Python functions for:
Loading the data: It requires a directory of CSV files.
Cleaning the data: It requires raw DataFrame and returns the cleaned DataFrame.
Convert column types: It requires a clean DataFrame and data types and returns the DataFrame with the correct data types.
Data analysis: It requires a DataFrame from the previous step and returns the modified DataFrame with two columns.
Data visualization: It requires a modified DataFrame and visualization type to generate visualization.
def load_data(path):
return pd.read_csv(path)
def data_cleaning(data):
data = data.drop_duplicates()
data = data.dropna()
data = data.reset_index(drop=True)
return data
def convert_dtypes(data, types_dict=None):
data = data.astype(dtype=types_dict)
## convert the date column to datetime
data[‘Date’] = pd.to_datetime(data[‘Date’])
return data
def data_analysis(data):
data[‘month’] = data[‘Date’].dt.month
new_df = data.groupby(‘month’)[‘Units Sold’].mean()
return new_df
def data_visualization(new_df,vis_type=”bar”):
new_df.plot(kind=vis_type, figsize=(10, 5), title=”Average Units Sold by Month”)
return new_df
We will now use the `pipe` method to chain all of the above Python functions in series. As we can see, we have provided the path of the file to the `load_data` function, data types to the `convert_dtypes` function, and visualization type to the `data_visualization` function. Instead of a bar, we will use a visualization line chart.
Building the data pipelines allows us to experiment with different scenarios without changing the overall code. You are standardizing the code and making it more readable.
path = “/work/Online Sales Data.csv”
df = (pd.DataFrame()
.pipe(lambda x: load_data(path))
.pipe(data_cleaning)
.pipe(convert_dtypes,{‘Product Category’: ‘str’, ‘Product Name’: ‘str’})
.pipe(data_analysis)
.pipe(data_visualization,’line’)
)
The end result looks awesome.
Conclusion
In this short tutorial, we learned about the Pandas `pipe` method and how to use it to build and execute end-to-end data science pipelines. The pipeline makes your code more readable, reproducible, and better organized. By integrating the pipe method into your workflow, you can streamline your data processing tasks and enhance the overall efficiency of your projects. Additionally, some users have found that using `pipe` instead of the `.apply()`method results in significantly faster execution times.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.