54_Scikit Learn_For_Machine_Learning

Scikit-learn for Machine Learning

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:08:32
For: Data Science, Machine Learning & Technical Interviews

Scikit-learn Cheat Sheet for Machine Learning

This cheat sheet provides a comprehensive overview of Scikit-learn (sklearn), a powerful Python library for machine learning. It covers installation, core functionalities, practical examples, advanced usage, and integration with other tools.

1. Tool/Library Overview

What it is: Scikit-learn (sklearn) is a free software machine learning library for Python. It features various classification, regression, clustering algorithms, and tools for model selection, preprocessing, and evaluation.

Main Use Cases:

Classification: Predicting categorical outcomes (e.g., spam detection).
Regression: Predicting continuous outcomes (e.g., house price prediction).
Clustering: Grouping similar data points (e.g., customer segmentation).
Dimensionality Reduction: Reducing the number of features (e.g., PCA).
Model Selection: Finding the best model and hyperparameters (e.g., GridSearchCV).
Preprocessing: Cleaning and transforming data (e.g., scaling, encoding).

2. Installation & Setup

Installation:

pip install scikit-learn

Importing:

import sklearn
print(sklearn.__version__)  # Check version

3. Core Features & API

Key Modules:

sklearn.model_selection: Model selection and evaluation tools (e.g., train_test_split, GridSearchCV).
sklearn.preprocessing: Data preprocessing techniques (e.g., StandardScaler, OneHotEncoder).
sklearn.linear_model: Linear models for classification and regression (e.g., LogisticRegression, LinearRegression).
sklearn.tree: Decision tree models (e.g., DecisionTreeClassifier, DecisionTreeRegressor).
sklearn.ensemble: Ensemble methods (e.g., RandomForestClassifier, GradientBoostingClassifier).
sklearn.cluster: Clustering algorithms (e.g., KMeans).
sklearn.metrics: Performance metrics (e.g., accuracy_score, mean_squared_error).
sklearn.decomposition: Dimensionality reduction techniques (e.g., PCA).
sklearn.pipeline: Building pipelines for sequential data processing.

Common API:

fit(X, y): Train the model.
predict(X): Predict outcomes for new data.
transform(X): Apply transformation to data.
fit_transform(X): Fit the transformer and transform the data.
score(X, y): Evaluate model performance.
get_params(): Get model parameters.
set_params(): Set model parameters.

4. Practical Examples

4.1 Classification (Logistic Regression)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")  # Output: Accuracy: 0.9777777777777777

4.2 Regression (Linear Regression)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")  # Output: Mean Squared Error: 0.07250609964744528

4.3 Clustering (K-Means)

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Train model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # n_init is important to set in recent versions
kmeans.fit(X)

# Predict
y_kmeans = kmeans.predict(X)

# Visualize (requires matplotlib)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.title('K-Means Clustering')
plt.show()

4.4 Preprocessing (StandardScaler)

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6]])

# Scale data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(f"Original Data:\n{data}")
print(f"Scaled Data:\n{scaled_data}")
#Expected Output (approximately):
#Original Data:
#[[1 2]
# [3 4]
# [5 6]]
#Scaled Data:
#[[-1.22474487 -1.22474487]
# [ 0.          0.        ]
# [ 1.22474487  1.22474487]]

5. Advanced Usage

5.1 Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr'))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate (from previous example)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}") # Output: Accuracy: 0.9777777777777777

5.2 GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}

# Create GridSearchCV object
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)

# Fit the grid with training data
grid.fit(X_train, y_train)

# Print best parameters
print(f"Best Parameters: {grid.best_params_}")

# Use best estimator for prediction
y_pred = grid.predict(X_test)

# Evaluate (from previous example)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}") #Output will vary depending on the best parameters found.

5.3 Custom Transformers

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feature_name):
        self.feature_name = feature_name

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Create a new feature based on the specified column
        new_feature = X[:, 0] * 2  # Example: Double the first column
        return np.concatenate((X, new_feature.reshape(-1, 1)), axis=1)

# Example Usage
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Sample Data
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([7, 8, 9])

# Create pipeline
pipeline = Pipeline([
    ('custom_transformer', CustomTransformer(feature_name='X1')),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X, y)

# Predict on new data
new_data = np.array([[2, 3]])
predictions = pipeline.predict(new_data)
print(predictions) # Output will vary based on the training.

6. Tips & Tricks

Data Scaling: Use StandardScaler or MinMaxScaler for algorithms sensitive to feature scaling (e.g., SVM, k-NN).
Cross-Validation: Use cross_val_score or cross_validate for robust model evaluation.
Feature Selection: Use SelectKBest or RFE to select relevant features.
Imbalanced Data: Use SMOTE or ADASYN to handle imbalanced datasets.
Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to find optimal hyperparameters.
Pipelines: Use pipelines to streamline preprocessing and modeling steps.
Random State: Set random_state for reproducibility.
n_init for Kmeans: Ensure n_init parameter is set (>=10 is recommended) when using KMeans. This parameter controls the number of times the k-means algorithm will be run with different centroid seeds.
Solver selection: Consider the dataset size and characteristics when choosing a solver for LogisticRegression (e.g., liblinear for small datasets, lbfgs for larger datasets).

7. Integration

Pandas: Use Pandas DataFrames as input for fit and predict.
NumPy: Use NumPy arrays for numerical computations.
Matplotlib/Seaborn: Use Matplotlib/Seaborn for data visualization and model evaluation.
Joblib: Use joblib for model persistence (saving and loading).

import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib

# Create a Pandas DataFrame
data = {'feature1': [1, 2, 3, 4, 5], 'feature2': [6, 7, 8, 9, 10], 'target': [0, 0, 1, 1, 0]}
df = pd.DataFrame(data)

# Prepare data for sklearn
X = df[['feature1', 'feature2']].values
y = df['target'].values

# Train a model
model = LogisticRegression()
model.fit(X, y)

# Save the model
joblib.dump(model, 'my_model.pkl')

# Load the model
loaded_model = joblib.load('my_model.pkl')

# Predict using the loaded model
new_data = [[2, 7], [4, 9]]
predictions = loaded_model.predict(new_data)
print(predictions) # Example output: [0 1]

8. Further Resources

Official Documentation: https://scikit-learn.org/stable/
Tutorials: https://scikit-learn.org/stable/tutorial/index.html
Examples: https://scikit-learn.org/stable/auto_examples/index.html
User Guide: https://scikit-learn.org/stable/user_guide.html
Stack Overflow: Search for specific problems and solutions.
Kaggle: Explore datasets and kernels for practical examples.