52_Pandas_For_Data_Manipulation

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:07:44
For: Data Science, Machine Learning & Technical Interviews

Pandas for Data Manipulation: Cheat Sheet for AI/ML Engineers

1. Tool/Library Overview

Pandas is a powerful and flexible Python library for data analysis and manipulation. It provides data structures like DataFrames (tabular data) and Series (one-dimensional arrays) that make it easy to clean, transform, analyze, and visualize data. In AI/ML, Pandas is essential for:

Data Loading & Exploration: Reading data from various formats (CSV, Excel, SQL, etc.) and understanding its structure.
Data Cleaning & Preprocessing: Handling missing values, removing duplicates, and transforming data types.
Feature Engineering: Creating new features from existing ones to improve model performance.
Data Aggregation & Summarization: Calculating statistics and grouping data for analysis.
Integration with ML Libraries: Feeding cleaned and prepared data into scikit-learn, TensorFlow, PyTorch, etc.

2. Installation & Setup

# Using pip (recommended)
pip install pandas

# Using conda
conda install pandas

Basic Import:

import pandas as pd
import numpy as np  # Often used alongside Pandas

3. Core Features & API

pd.DataFrame: The primary data structure, representing a table of data with rows and columns.
pd.Series: A one-dimensional labeled array capable of holding any data type.
pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False, nrows=None): Reads CSV files into a DataFrame.
- sep: Delimiter (default: ’,’).
- header: Row number(s) to use as the column names (default: ‘infer’). None if no header row.
- names: List of column names to use.
- index_col: Column(s) to use as the row index.
- usecols: List of columns to import.
- dtype: Dictionary specifying data types for columns (e.g., {'col1': 'int', 'col2': 'float'}).
- na_values: Values to recognize as NaN/missing.
- parse_dates: List of columns to parse as dates.
- nrows: Number of rows to read.
pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False): Reads Excel files into a DataFrame. sheet_name specifies the sheet to read.
df.head(n=5)/df.tail(n=5): Returns the first/last n rows of the DataFrame.
df.info(): Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage.
df.describe(): Generates descriptive statistics (count, mean, std, min, max, quartiles) for numerical columns.
df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
df.columns: Returns an Index object containing the column names.
df.index: Returns an Index object representing the row labels.
df.dtypes: Returns a Series with the data type of each column.
df.isnull()/df.notnull(): Returns a DataFrame of boolean values indicating missing (NaN) values.
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False): Removes rows or columns containing missing values.
- axis: 0 (rows, default) or 1 (columns).
- how: ‘any’ (drop if any NA values are present, default) or ‘all’ (drop if all values are NA).
- thresh: Require that many non-NA values to survive.
- subset: List of columns to consider when dropping rows.
- inplace: Modify the DataFrame directly (default: False).
df.fillna(value=None, method=None, axis=None, inplace=False, limit=None): Fills missing values.
- value: Value to use for filling.
- method: ‘ffill’ (forward fill) or ‘bfill’ (backward fill).
- limit: Maximum number of consecutive NaN values to fill.
df.astype(dtype, copy=True, errors='raise'): Converts the data type of columns. dtype can be a single type or a dictionary of column names to types.
df.rename(columns=None, index=None, inplace=False): Renames columns or rows.
df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise'): Drops rows or columns.
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True): Groups rows based on one or more columns. Returns a DataFrameGroupBy object.
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None): Sorts the DataFrame by one or more columns.
df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs): Applies a function along an axis of the DataFrame.
df.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None): Joins DataFrames based on common columns.
- how: ‘inner’, ‘outer’, ‘left’, ‘right’.
- on: Column(s) to join on.
- left_on, right_on: Columns to join on when column names differ.
df.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True): Concatenates DataFrames along an axis.
df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True): Creates a pivot table.
df.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None): Writes the DataFrame to a CSV file.

4. Practical Examples

# Create a DataFrame from a dictionary
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, 28, 22],
        'city': ['New York', 'London', 'Paris', 'Tokyo'],
        'score': [85, 92, np.nan, 78]} # Include a NaN value
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Data Exploration
print("\nDataFrame Info:\n", df.info())
print("\nDescriptive Statistics:\n", df.describe())
print("\nFirst 2 rows:\n", df.head(2))

# Handling Missing Values
print("\nMissing Values (isnull()):\n", df.isnull())
df_filled = df.fillna({'score': df['score'].mean()}) # Fill NaN with mean
print("\nDataFrame after filling NaN:\n", df_filled)

# Data Type Conversion
df_filled['age'] = df_filled['age'].astype(float)
print("\nData types after conversion:\n", df_filled.dtypes)

# Filtering Data
young_people = df_filled[df_filled['age'] < 30]
print("\nYoung People (age < 30):\n", young_people)

# Grouping and Aggregation
city_avg_age = df_filled.groupby('city')['age'].mean()
print("\nAverage age per city:\n", city_avg_age)

# Feature Engineering (creating a new column)
df_filled['age_squared'] = df_filled['age']**2
print("\nDataFrame with new feature:\n", df_filled)

# Saving to CSV
df_filled.to_csv('processed_data.csv', index=False) # Don't include index in CSV

Expected Output (examples):

Original DataFrame:
        name  age      city  score
0     Alice   25  New York   85.0
1       Bob   30    London   92.0
2  Charlie   28     Paris    NaN
3     David   22     Tokyo   78.0

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
   name     4 non-null object
   age      4 non-null int64
   city     4 non-null object
   score    3 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes

... (other outputs as described in the code comments)

5. Advanced Usage

# Multi-indexing
data = {'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'product': ['A', 'B', 'A', 'B'],
        'sales': [100, 150, 120, 180]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])  # Convert to datetime

df_pivot = df.pivot_table(index='date', columns='product', values='sales')
print("\nPivot Table:\n", df_pivot)

# Using apply with lambda functions
df['sales_increase'] = df.groupby('product')['sales'].apply(lambda x: x.diff())
print("\nSales increase:\n", df)

# Custom Aggregation Functions
def custom_agg(x):
  return x.max() - x.min()

grouped = df.groupby('product')['sales'].agg(['mean', 'median', custom_agg])
print("\nCustom aggregation:\n", grouped)

# Performance Tip: Vectorization
# Avoid looping; use Pandas' built-in vectorized operations for speed.
# Example: Instead of iterating through rows to apply a function, use:
# df['new_column'] = df['existing_column'] * 2  (This is much faster)

# Use `pd.Categorical` for memory efficiency when dealing with repeating string values
df['product'] = pd.Categorical(df['product'])

# Using `pd.cut` and `pd.qcut` for binning numerical data
df['age_group'] = pd.cut(df_filled['age'], bins=[0, 25, 30, 100], labels=['Young', 'Mid', 'Senior'])
print("\nAge Groups:\n", df['age_group'])

6. Tips & Tricks

Chaining Methods: Write concise code by chaining Pandas methods:

df = (pd.read_csv('my_data.csv')
        .dropna()
        .rename(columns={'old_name': 'new_name'}))

inplace=True vs. Assignment: Using inplace=True modifies the DataFrame directly, but it can sometimes be less predictable. Assignment (df = df.dropna()) is often preferred for clarity.
Use loc and iloc for label-based and integer-based indexing, respectively. loc is inclusive of the end value; iloc is exclusive.
Memory Optimization: For large datasets:
- Use appropriate data types (e.g., int8 instead of int64 if the range of values is small).
- Read only the necessary columns using usecols in pd.read_csv.
- Use chunksize in pd.read_csv to process data in smaller chunks.

7. Integration

NumPy: Pandas builds on NumPy, so you can easily convert between DataFrames/Series and NumPy arrays:

import numpy as np
data = df['age'].to_numpy()  # Series to NumPy array
df_from_array = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=['A', 'B', 'C'])

Scikit-learn: Pandas DataFrames are the standard input format for scikit-learn models.

from sklearn.model_selection import train_test_split
X = df_filled[['age', 'score']] # Features
y = df_filled['name'] # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Matplotlib/Seaborn: Pandas provides convenient plotting methods that integrate well with Matplotlib and Seaborn.

import matplotlib.pyplot as plt
df['age'].hist()
plt.show()

import seaborn as sns
sns.scatterplot(x='age', y='score', data=df)
plt.show()

8. Further Resources

Official Pandas Documentation: https://pandas.pydata.org/docs/
10 Minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
Pandas Cookbook: https://pandas.pydata.org/docs/user_guide/cookbook.html
Real Python Pandas Tutorials: https://realpython.com/pandas-tutorials/
Kaggle Pandas Tutorials: https://www.kaggle.com/learn/pandas