Skip to content

52_Pandas_For_Data_Manipulation

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:07:44
For: Data Science, Machine Learning & Technical Interviews


Pandas for Data Manipulation: Cheat Sheet for AI/ML Engineers

Section titled “Pandas for Data Manipulation: Cheat Sheet for AI/ML Engineers”

1. Tool/Library Overview

Pandas is a powerful and flexible Python library for data analysis and manipulation. It provides data structures like DataFrames (tabular data) and Series (one-dimensional arrays) that make it easy to clean, transform, analyze, and visualize data. In AI/ML, Pandas is essential for:

  • Data Loading & Exploration: Reading data from various formats (CSV, Excel, SQL, etc.) and understanding its structure.
  • Data Cleaning & Preprocessing: Handling missing values, removing duplicates, and transforming data types.
  • Feature Engineering: Creating new features from existing ones to improve model performance.
  • Data Aggregation & Summarization: Calculating statistics and grouping data for analysis.
  • Integration with ML Libraries: Feeding cleaned and prepared data into scikit-learn, TensorFlow, PyTorch, etc.

2. Installation & Setup

Terminal window
# Using pip (recommended)
pip install pandas
# Using conda
conda install pandas

Basic Import:

import pandas as pd
import numpy as np # Often used alongside Pandas

3. Core Features & API

  • pd.DataFrame: The primary data structure, representing a table of data with rows and columns.
  • pd.Series: A one-dimensional labeled array capable of holding any data type.
  • pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False, nrows=None): Reads CSV files into a DataFrame.
    • sep: Delimiter (default: ’,’).
    • header: Row number(s) to use as the column names (default: ‘infer’). None if no header row.
    • names: List of column names to use.
    • index_col: Column(s) to use as the row index.
    • usecols: List of columns to import.
    • dtype: Dictionary specifying data types for columns (e.g., {'col1': 'int', 'col2': 'float'}).
    • na_values: Values to recognize as NaN/missing.
    • parse_dates: List of columns to parse as dates.
    • nrows: Number of rows to read.
  • pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False): Reads Excel files into a DataFrame. sheet_name specifies the sheet to read.
  • df.head(n=5)/df.tail(n=5): Returns the first/last n rows of the DataFrame.
  • df.info(): Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage.
  • df.describe(): Generates descriptive statistics (count, mean, std, min, max, quartiles) for numerical columns.
  • df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
  • df.columns: Returns an Index object containing the column names.
  • df.index: Returns an Index object representing the row labels.
  • df.dtypes: Returns a Series with the data type of each column.
  • df.isnull()/df.notnull(): Returns a DataFrame of boolean values indicating missing (NaN) values.
  • df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False): Removes rows or columns containing missing values.
    • axis: 0 (rows, default) or 1 (columns).
    • how: ‘any’ (drop if any NA values are present, default) or ‘all’ (drop if all values are NA).
    • thresh: Require that many non-NA values to survive.
    • subset: List of columns to consider when dropping rows.
    • inplace: Modify the DataFrame directly (default: False).
  • df.fillna(value=None, method=None, axis=None, inplace=False, limit=None): Fills missing values.
    • value: Value to use for filling.
    • method: ‘ffill’ (forward fill) or ‘bfill’ (backward fill).
    • limit: Maximum number of consecutive NaN values to fill.
  • df.astype(dtype, copy=True, errors='raise'): Converts the data type of columns. dtype can be a single type or a dictionary of column names to types.
  • df.rename(columns=None, index=None, inplace=False): Renames columns or rows.
  • df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise'): Drops rows or columns.
  • df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True): Groups rows based on one or more columns. Returns a DataFrameGroupBy object.
  • df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None): Sorts the DataFrame by one or more columns.
  • df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs): Applies a function along an axis of the DataFrame.
  • df.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None): Joins DataFrames based on common columns.
    • how: ‘inner’, ‘outer’, ‘left’, ‘right’.
    • on: Column(s) to join on.
    • left_on, right_on: Columns to join on when column names differ.
  • df.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True): Concatenates DataFrames along an axis.
  • df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True): Creates a pivot table.
  • df.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None): Writes the DataFrame to a CSV file.

4. Practical Examples

# Create a DataFrame from a dictionary
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 28, 22],
'city': ['New York', 'London', 'Paris', 'Tokyo'],
'score': [85, 92, np.nan, 78]} # Include a NaN value
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Data Exploration
print("\nDataFrame Info:\n", df.info())
print("\nDescriptive Statistics:\n", df.describe())
print("\nFirst 2 rows:\n", df.head(2))
# Handling Missing Values
print("\nMissing Values (isnull()):\n", df.isnull())
df_filled = df.fillna({'score': df['score'].mean()}) # Fill NaN with mean
print("\nDataFrame after filling NaN:\n", df_filled)
# Data Type Conversion
df_filled['age'] = df_filled['age'].astype(float)
print("\nData types after conversion:\n", df_filled.dtypes)
# Filtering Data
young_people = df_filled[df_filled['age'] < 30]
print("\nYoung People (age < 30):\n", young_people)
# Grouping and Aggregation
city_avg_age = df_filled.groupby('city')['age'].mean()
print("\nAverage age per city:\n", city_avg_age)
# Feature Engineering (creating a new column)
df_filled['age_squared'] = df_filled['age']**2
print("\nDataFrame with new feature:\n", df_filled)
# Saving to CSV
df_filled.to_csv('processed_data.csv', index=False) # Don't include index in CSV

Expected Output (examples):

Original DataFrame:
name age city score
0 Alice 25 New York 85.0
1 Bob 30 London 92.0
2 Charlie 28 Paris NaN
3 David 22 Tokyo 78.0
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
name 4 non-null object
age 4 non-null int64
city 4 non-null object
score 3 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
... (other outputs as described in the code comments)

5. Advanced Usage

# Multi-indexing
data = {'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 120, 180]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date']) # Convert to datetime
df_pivot = df.pivot_table(index='date', columns='product', values='sales')
print("\nPivot Table:\n", df_pivot)
# Using apply with lambda functions
df['sales_increase'] = df.groupby('product')['sales'].apply(lambda x: x.diff())
print("\nSales increase:\n", df)
# Custom Aggregation Functions
def custom_agg(x):
return x.max() - x.min()
grouped = df.groupby('product')['sales'].agg(['mean', 'median', custom_agg])
print("\nCustom aggregation:\n", grouped)
# Performance Tip: Vectorization
# Avoid looping; use Pandas' built-in vectorized operations for speed.
# Example: Instead of iterating through rows to apply a function, use:
# df['new_column'] = df['existing_column'] * 2 (This is much faster)
# Use `pd.Categorical` for memory efficiency when dealing with repeating string values
df['product'] = pd.Categorical(df['product'])
# Using `pd.cut` and `pd.qcut` for binning numerical data
df['age_group'] = pd.cut(df_filled['age'], bins=[0, 25, 30, 100], labels=['Young', 'Mid', 'Senior'])
print("\nAge Groups:\n", df['age_group'])

6. Tips & Tricks

  • Chaining Methods: Write concise code by chaining Pandas methods:

    df = (pd.read_csv('my_data.csv')
    .dropna()
    .rename(columns={'old_name': 'new_name'}))
  • inplace=True vs. Assignment: Using inplace=True modifies the DataFrame directly, but it can sometimes be less predictable. Assignment (df = df.dropna()) is often preferred for clarity.

  • Use loc and iloc for label-based and integer-based indexing, respectively. loc is inclusive of the end value; iloc is exclusive.

  • Memory Optimization: For large datasets:

    • Use appropriate data types (e.g., int8 instead of int64 if the range of values is small).
    • Read only the necessary columns using usecols in pd.read_csv.
    • Use chunksize in pd.read_csv to process data in smaller chunks.

7. Integration

  • NumPy: Pandas builds on NumPy, so you can easily convert between DataFrames/Series and NumPy arrays:

    import numpy as np
    data = df['age'].to_numpy() # Series to NumPy array
    df_from_array = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=['A', 'B', 'C'])
  • Scikit-learn: Pandas DataFrames are the standard input format for scikit-learn models.

    from sklearn.model_selection import train_test_split
    X = df_filled[['age', 'score']] # Features
    y = df_filled['name'] # Target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • Matplotlib/Seaborn: Pandas provides convenient plotting methods that integrate well with Matplotlib and Seaborn.

    import matplotlib.pyplot as plt
    df['age'].hist()
    plt.show()
    import seaborn as sns
    sns.scatterplot(x='age', y='score', data=df)
    plt.show()

8. Further Resources