Skip to content

54_Scikit Learn_For_Machine_Learning

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:08:32
For: Data Science, Machine Learning & Technical Interviews


Scikit-learn Cheat Sheet for Machine Learning

Section titled “Scikit-learn Cheat Sheet for Machine Learning”

This cheat sheet provides a comprehensive overview of Scikit-learn (sklearn), a powerful Python library for machine learning. It covers installation, core functionalities, practical examples, advanced usage, and integration with other tools.

What it is: Scikit-learn (sklearn) is a free software machine learning library for Python. It features various classification, regression, clustering algorithms, and tools for model selection, preprocessing, and evaluation.

Main Use Cases:

  • Classification: Predicting categorical outcomes (e.g., spam detection).
  • Regression: Predicting continuous outcomes (e.g., house price prediction).
  • Clustering: Grouping similar data points (e.g., customer segmentation).
  • Dimensionality Reduction: Reducing the number of features (e.g., PCA).
  • Model Selection: Finding the best model and hyperparameters (e.g., GridSearchCV).
  • Preprocessing: Cleaning and transforming data (e.g., scaling, encoding).

Installation:

Terminal window
pip install scikit-learn

Importing:

import sklearn
print(sklearn.__version__) # Check version

Key Modules:

  • sklearn.model_selection: Model selection and evaluation tools (e.g., train_test_split, GridSearchCV).
  • sklearn.preprocessing: Data preprocessing techniques (e.g., StandardScaler, OneHotEncoder).
  • sklearn.linear_model: Linear models for classification and regression (e.g., LogisticRegression, LinearRegression).
  • sklearn.tree: Decision tree models (e.g., DecisionTreeClassifier, DecisionTreeRegressor).
  • sklearn.ensemble: Ensemble methods (e.g., RandomForestClassifier, GradientBoostingClassifier).
  • sklearn.cluster: Clustering algorithms (e.g., KMeans).
  • sklearn.metrics: Performance metrics (e.g., accuracy_score, mean_squared_error).
  • sklearn.decomposition: Dimensionality reduction techniques (e.g., PCA).
  • sklearn.pipeline: Building pipelines for sequential data processing.

Common API:

  • fit(X, y): Train the model.
  • predict(X): Predict outcomes for new data.
  • transform(X): Apply transformation to data.
  • fit_transform(X): Fit the transformer and transform the data.
  • score(X, y): Evaluate model performance.
  • get_params(): Get model parameters.
  • set_params(): Set model parameters.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr')
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}") # Output: Accuracy: 0.9777777777777777
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}") # Output: Mean Squared Error: 0.07250609964744528
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, random_state=42)
# Train model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # n_init is important to set in recent versions
kmeans.fit(X)
# Predict
y_kmeans = kmeans.predict(X)
# Visualize (requires matplotlib)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.title('K-Means Clustering')
plt.show()
from sklearn.preprocessing import StandardScaler
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6]])
# Scale data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(f"Original Data:\n{data}")
print(f"Scaled Data:\n{scaled_data}")
#Expected Output (approximately):
#Original Data:
#[[1 2]
# [3 4]
# [5 6]]
#Scaled Data:
#[[-1.22474487 -1.22474487]
# [ 0. 0. ]
# [ 1.22474487 1.22474487]]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr'))
])
# Train model
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate (from previous example)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}") # Output: Accuracy: 0.9777777777777777
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
# Create GridSearchCV object
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
# Fit the grid with training data
grid.fit(X_train, y_train)
# Print best parameters
print(f"Best Parameters: {grid.best_params_}")
# Use best estimator for prediction
y_pred = grid.predict(X_test)
# Evaluate (from previous example)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}") #Output will vary depending on the best parameters found.
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self, feature_name):
self.feature_name = feature_name
def fit(self, X, y=None):
return self
def transform(self, X):
# Create a new feature based on the specified column
new_feature = X[:, 0] * 2 # Example: Double the first column
return np.concatenate((X, new_feature.reshape(-1, 1)), axis=1)
# Example Usage
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
# Sample Data
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([7, 8, 9])
# Create pipeline
pipeline = Pipeline([
('custom_transformer', CustomTransformer(feature_name='X1')),
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
# Fit the pipeline
pipeline.fit(X, y)
# Predict on new data
new_data = np.array([[2, 3]])
predictions = pipeline.predict(new_data)
print(predictions) # Output will vary based on the training.
  • Data Scaling: Use StandardScaler or MinMaxScaler for algorithms sensitive to feature scaling (e.g., SVM, k-NN).
  • Cross-Validation: Use cross_val_score or cross_validate for robust model evaluation.
  • Feature Selection: Use SelectKBest or RFE to select relevant features.
  • Imbalanced Data: Use SMOTE or ADASYN to handle imbalanced datasets.
  • Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to find optimal hyperparameters.
  • Pipelines: Use pipelines to streamline preprocessing and modeling steps.
  • Random State: Set random_state for reproducibility.
  • n_init for Kmeans: Ensure n_init parameter is set (>=10 is recommended) when using KMeans. This parameter controls the number of times the k-means algorithm will be run with different centroid seeds.
  • Solver selection: Consider the dataset size and characteristics when choosing a solver for LogisticRegression (e.g., liblinear for small datasets, lbfgs for larger datasets).
  • Pandas: Use Pandas DataFrames as input for fit and predict.
  • NumPy: Use NumPy arrays for numerical computations.
  • Matplotlib/Seaborn: Use Matplotlib/Seaborn for data visualization and model evaluation.
  • Joblib: Use joblib for model persistence (saving and loading).
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib
# Create a Pandas DataFrame
data = {'feature1': [1, 2, 3, 4, 5], 'feature2': [6, 7, 8, 9, 10], 'target': [0, 0, 1, 1, 0]}
df = pd.DataFrame(data)
# Prepare data for sklearn
X = df[['feature1', 'feature2']].values
y = df['target'].values
# Train a model
model = LogisticRegression()
model.fit(X, y)
# Save the model
joblib.dump(model, 'my_model.pkl')
# Load the model
loaded_model = joblib.load('my_model.pkl')
# Predict using the loaded model
new_data = [[2, 7], [4, 9]]
predictions = loaded_model.predict(new_data)
print(predictions) # Example output: [0 1]