52_Pandas_For_Data_Manipulation
Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:07:44
For: Data Science, Machine Learning & Technical Interviews
Pandas for Data Manipulation: Cheat Sheet for AI/ML Engineers
Section titled “Pandas for Data Manipulation: Cheat Sheet for AI/ML Engineers”1. Tool/Library Overview
Pandas is a powerful and flexible Python library for data analysis and manipulation. It provides data structures like DataFrames (tabular data) and Series (one-dimensional arrays) that make it easy to clean, transform, analyze, and visualize data. In AI/ML, Pandas is essential for:
- Data Loading & Exploration: Reading data from various formats (CSV, Excel, SQL, etc.) and understanding its structure.
- Data Cleaning & Preprocessing: Handling missing values, removing duplicates, and transforming data types.
- Feature Engineering: Creating new features from existing ones to improve model performance.
- Data Aggregation & Summarization: Calculating statistics and grouping data for analysis.
- Integration with ML Libraries: Feeding cleaned and prepared data into scikit-learn, TensorFlow, PyTorch, etc.
2. Installation & Setup
# Using pip (recommended)pip install pandas
# Using condaconda install pandasBasic Import:
import pandas as pdimport numpy as np # Often used alongside Pandas3. Core Features & API
pd.DataFrame: The primary data structure, representing a table of data with rows and columns.pd.Series: A one-dimensional labeled array capable of holding any data type.pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False, nrows=None): Reads CSV files into a DataFrame.sep: Delimiter (default: ’,’).header: Row number(s) to use as the column names (default: ‘infer’).Noneif no header row.names: List of column names to use.index_col: Column(s) to use as the row index.usecols: List of columns to import.dtype: Dictionary specifying data types for columns (e.g.,{'col1': 'int', 'col2': 'float'}).na_values: Values to recognize as NaN/missing.parse_dates: List of columns to parse as dates.nrows: Number of rows to read.
pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, na_values=None, parse_dates=False): Reads Excel files into a DataFrame.sheet_namespecifies the sheet to read.df.head(n=5)/df.tail(n=5): Returns the first/lastnrows of the DataFrame.df.info(): Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage.df.describe(): Generates descriptive statistics (count, mean, std, min, max, quartiles) for numerical columns.df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.df.columns: Returns an Index object containing the column names.df.index: Returns an Index object representing the row labels.df.dtypes: Returns a Series with the data type of each column.df.isnull()/df.notnull(): Returns a DataFrame of boolean values indicating missing (NaN) values.df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False): Removes rows or columns containing missing values.axis: 0 (rows, default) or 1 (columns).how: ‘any’ (drop if any NA values are present, default) or ‘all’ (drop if all values are NA).thresh: Require that many non-NA values to survive.subset: List of columns to consider when dropping rows.inplace: Modify the DataFrame directly (default: False).
df.fillna(value=None, method=None, axis=None, inplace=False, limit=None): Fills missing values.value: Value to use for filling.method: ‘ffill’ (forward fill) or ‘bfill’ (backward fill).limit: Maximum number of consecutive NaN values to fill.
df.astype(dtype, copy=True, errors='raise'): Converts the data type of columns.dtypecan be a single type or a dictionary of column names to types.df.rename(columns=None, index=None, inplace=False): Renames columns or rows.df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise'): Drops rows or columns.df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True): Groups rows based on one or more columns. Returns aDataFrameGroupByobject.df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None): Sorts the DataFrame by one or more columns.df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs): Applies a function along an axis of the DataFrame.df.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None): Joins DataFrames based on common columns.how: ‘inner’, ‘outer’, ‘left’, ‘right’.on: Column(s) to join on.left_on,right_on: Columns to join on when column names differ.
df.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True): Concatenates DataFrames along an axis.df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True): Creates a pivot table.df.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None): Writes the DataFrame to a CSV file.
4. Practical Examples
# Create a DataFrame from a dictionarydata = {'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [25, 30, 28, 22], 'city': ['New York', 'London', 'Paris', 'Tokyo'], 'score': [85, 92, np.nan, 78]} # Include a NaN valuedf = pd.DataFrame(data)print("Original DataFrame:\n", df)
# Data Explorationprint("\nDataFrame Info:\n", df.info())print("\nDescriptive Statistics:\n", df.describe())print("\nFirst 2 rows:\n", df.head(2))
# Handling Missing Valuesprint("\nMissing Values (isnull()):\n", df.isnull())df_filled = df.fillna({'score': df['score'].mean()}) # Fill NaN with meanprint("\nDataFrame after filling NaN:\n", df_filled)
# Data Type Conversiondf_filled['age'] = df_filled['age'].astype(float)print("\nData types after conversion:\n", df_filled.dtypes)
# Filtering Datayoung_people = df_filled[df_filled['age'] < 30]print("\nYoung People (age < 30):\n", young_people)
# Grouping and Aggregationcity_avg_age = df_filled.groupby('city')['age'].mean()print("\nAverage age per city:\n", city_avg_age)
# Feature Engineering (creating a new column)df_filled['age_squared'] = df_filled['age']**2print("\nDataFrame with new feature:\n", df_filled)
# Saving to CSVdf_filled.to_csv('processed_data.csv', index=False) # Don't include index in CSVExpected Output (examples):
Original DataFrame: name age city score0 Alice 25 New York 85.01 Bob 30 London 92.02 Charlie 28 Paris NaN3 David 22 Tokyo 78.0
DataFrame Info:<class 'pandas.core.frame.DataFrame'>RangeIndex: 4 entries, 0 to 3Data columns (total 4 columns): name 4 non-null object age 4 non-null int64 city 4 non-null object score 3 non-null float64dtypes: float64(1), int64(1), object(2)memory usage: 256.0+ bytes
... (other outputs as described in the code comments)5. Advanced Usage
# Multi-indexingdata = {'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'], 'product': ['A', 'B', 'A', 'B'], 'sales': [100, 150, 120, 180]}df = pd.DataFrame(data)df['date'] = pd.to_datetime(df['date']) # Convert to datetime
df_pivot = df.pivot_table(index='date', columns='product', values='sales')print("\nPivot Table:\n", df_pivot)
# Using apply with lambda functionsdf['sales_increase'] = df.groupby('product')['sales'].apply(lambda x: x.diff())print("\nSales increase:\n", df)
# Custom Aggregation Functionsdef custom_agg(x): return x.max() - x.min()
grouped = df.groupby('product')['sales'].agg(['mean', 'median', custom_agg])print("\nCustom aggregation:\n", grouped)
# Performance Tip: Vectorization# Avoid looping; use Pandas' built-in vectorized operations for speed.# Example: Instead of iterating through rows to apply a function, use:# df['new_column'] = df['existing_column'] * 2 (This is much faster)
# Use `pd.Categorical` for memory efficiency when dealing with repeating string valuesdf['product'] = pd.Categorical(df['product'])
# Using `pd.cut` and `pd.qcut` for binning numerical datadf['age_group'] = pd.cut(df_filled['age'], bins=[0, 25, 30, 100], labels=['Young', 'Mid', 'Senior'])print("\nAge Groups:\n", df['age_group'])6. Tips & Tricks
-
Chaining Methods: Write concise code by chaining Pandas methods:
df = (pd.read_csv('my_data.csv').dropna().rename(columns={'old_name': 'new_name'})) -
inplace=Truevs. Assignment: Usinginplace=Truemodifies the DataFrame directly, but it can sometimes be less predictable. Assignment (df = df.dropna()) is often preferred for clarity. -
Use
locandilocfor label-based and integer-based indexing, respectively.locis inclusive of the end value;ilocis exclusive. -
Memory Optimization: For large datasets:
- Use appropriate data types (e.g.,
int8instead ofint64if the range of values is small). - Read only the necessary columns using
usecolsinpd.read_csv. - Use
chunksizeinpd.read_csvto process data in smaller chunks.
- Use appropriate data types (e.g.,
7. Integration
-
NumPy: Pandas builds on NumPy, so you can easily convert between DataFrames/Series and NumPy arrays:
import numpy as npdata = df['age'].to_numpy() # Series to NumPy arraydf_from_array = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=['A', 'B', 'C']) -
Scikit-learn: Pandas DataFrames are the standard input format for scikit-learn models.
from sklearn.model_selection import train_test_splitX = df_filled[['age', 'score']] # Featuresy = df_filled['name'] # TargetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) -
Matplotlib/Seaborn: Pandas provides convenient plotting methods that integrate well with Matplotlib and Seaborn.
import matplotlib.pyplot as pltdf['age'].hist()plt.show()import seaborn as snssns.scatterplot(x='age', y='score', data=df)plt.show()
8. Further Resources
- Official Pandas Documentation: https://pandas.pydata.org/docs/
- 10 Minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
- Pandas Cookbook: https://pandas.pydata.org/docs/user_guide/cookbook.html
- Real Python Pandas Tutorials: https://realpython.com/pandas-tutorials/
- Kaggle Pandas Tutorials: https://www.kaggle.com/learn/pandas