54_Scikit Learn_For_Machine_Learning
Scikit-learn for Machine Learning
Section titled “Scikit-learn for Machine Learning”Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:08:32
For: Data Science, Machine Learning & Technical Interviews
Scikit-learn Cheat Sheet for Machine Learning
Section titled “Scikit-learn Cheat Sheet for Machine Learning”This cheat sheet provides a comprehensive overview of Scikit-learn (sklearn), a powerful Python library for machine learning. It covers installation, core functionalities, practical examples, advanced usage, and integration with other tools.
1. Tool/Library Overview
Section titled “1. Tool/Library Overview”What it is: Scikit-learn (sklearn) is a free software machine learning library for Python. It features various classification, regression, clustering algorithms, and tools for model selection, preprocessing, and evaluation.
Main Use Cases:
- Classification: Predicting categorical outcomes (e.g., spam detection).
- Regression: Predicting continuous outcomes (e.g., house price prediction).
- Clustering: Grouping similar data points (e.g., customer segmentation).
- Dimensionality Reduction: Reducing the number of features (e.g., PCA).
- Model Selection: Finding the best model and hyperparameters (e.g., GridSearchCV).
- Preprocessing: Cleaning and transforming data (e.g., scaling, encoding).
2. Installation & Setup
Section titled “2. Installation & Setup”Installation:
pip install scikit-learnImporting:
import sklearnprint(sklearn.__version__) # Check version3. Core Features & API
Section titled “3. Core Features & API”Key Modules:
sklearn.model_selection: Model selection and evaluation tools (e.g.,train_test_split,GridSearchCV).sklearn.preprocessing: Data preprocessing techniques (e.g.,StandardScaler,OneHotEncoder).sklearn.linear_model: Linear models for classification and regression (e.g.,LogisticRegression,LinearRegression).sklearn.tree: Decision tree models (e.g.,DecisionTreeClassifier,DecisionTreeRegressor).sklearn.ensemble: Ensemble methods (e.g.,RandomForestClassifier,GradientBoostingClassifier).sklearn.cluster: Clustering algorithms (e.g.,KMeans).sklearn.metrics: Performance metrics (e.g.,accuracy_score,mean_squared_error).sklearn.decomposition: Dimensionality reduction techniques (e.g.,PCA).sklearn.pipeline: Building pipelines for sequential data processing.
Common API:
fit(X, y): Train the model.predict(X): Predict outcomes for new data.transform(X): Apply transformation to data.fit_transform(X): Fit the transformer and transform the data.score(X, y): Evaluate model performance.get_params(): Get model parameters.set_params(): Set model parameters.
4. Practical Examples
Section titled “4. Practical Examples”4.1 Classification (Logistic Regression)
Section titled “4.1 Classification (Logistic Regression)”from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom sklearn.datasets import load_iris
# Load datairis = load_iris()X, y = iris.data, iris.target
# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train modelmodel = LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr')model.fit(X_train, y_train)
# Predicty_pred = model.predict(X_test)
# Evaluateaccuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy}") # Output: Accuracy: 0.97777777777777774.2 Regression (Linear Regression)
Section titled “4.2 Regression (Linear Regression)”from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorimport numpy as np
# Generate synthetic datanp.random.seed(42)X = np.random.rand(100, 1)y = 2 + 3 * X + np.random.rand(100, 1)
# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train modelmodel = LinearRegression()model.fit(X_train, y_train)
# Predicty_pred = model.predict(X_test)
# Evaluatemse = mean_squared_error(y_test, y_pred)print(f"Mean Squared Error: {mse}") # Output: Mean Squared Error: 0.072506099647445284.3 Clustering (K-Means)
Section titled “4.3 Clustering (K-Means)”from sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsimport matplotlib.pyplot as plt
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, random_state=42)
# Train modelkmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # n_init is important to set in recent versionskmeans.fit(X)
# Predicty_kmeans = kmeans.predict(X)
# Visualize (requires matplotlib)plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')centers = kmeans.cluster_centers_plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);plt.title('K-Means Clustering')plt.show()4.4 Preprocessing (StandardScaler)
Section titled “4.4 Preprocessing (StandardScaler)”from sklearn.preprocessing import StandardScalerimport numpy as np
# Sample datadata = np.array([[1, 2], [3, 4], [5, 6]])
# Scale datascaler = StandardScaler()scaled_data = scaler.fit_transform(data)
print(f"Original Data:\n{data}")print(f"Scaled Data:\n{scaled_data}")#Expected Output (approximately):#Original Data:#[[1 2]# [3 4]# [5 6]]#Scaled Data:#[[-1.22474487 -1.22474487]# [ 0. 0. ]# [ 1.22474487 1.22474487]]5. Advanced Usage
Section titled “5. Advanced Usage”5.1 Pipelines
Section titled “5.1 Pipelines”from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_iris
# Load datairis = load_iris()X, y = iris.data, iris.target
# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create pipelinepipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression(random_state=42, solver='liblinear', multi_class='ovr'))])
# Train modelpipeline.fit(X_train, y_train)
# Predicty_pred = pipeline.predict(X_test)
# Evaluate (from previous example)from sklearn.metrics import accuracy_scoreaccuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy}") # Output: Accuracy: 0.97777777777777775.2 GridSearchCV
Section titled “5.2 GridSearchCV”from sklearn.model_selection import GridSearchCVfrom sklearn.svm import SVCfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split
# Load datairis = load_iris()X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define parameter gridparam_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
# Create GridSearchCV objectgrid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
# Fit the grid with training datagrid.fit(X_train, y_train)
# Print best parametersprint(f"Best Parameters: {grid.best_params_}")
# Use best estimator for predictiony_pred = grid.predict(X_test)
# Evaluate (from previous example)from sklearn.metrics import accuracy_scoreaccuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy}") #Output will vary depending on the best parameters found.5.3 Custom Transformers
Section titled “5.3 Custom Transformers”from sklearn.base import BaseEstimator, TransformerMixinimport numpy as np
class CustomTransformer(BaseEstimator, TransformerMixin): def __init__(self, feature_name): self.feature_name = feature_name
def fit(self, X, y=None): return self
def transform(self, X): # Create a new feature based on the specified column new_feature = X[:, 0] * 2 # Example: Double the first column return np.concatenate((X, new_feature.reshape(-1, 1)), axis=1)
# Example Usagefrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegression
# Sample DataX = np.array([[1, 2], [3, 4], [5, 6]])y = np.array([7, 8, 9])
# Create pipelinepipeline = Pipeline([ ('custom_transformer', CustomTransformer(feature_name='X1')), ('scaler', StandardScaler()), ('regressor', LinearRegression())])
# Fit the pipelinepipeline.fit(X, y)
# Predict on new datanew_data = np.array([[2, 3]])predictions = pipeline.predict(new_data)print(predictions) # Output will vary based on the training.6. Tips & Tricks
Section titled “6. Tips & Tricks”- Data Scaling: Use
StandardScalerorMinMaxScalerfor algorithms sensitive to feature scaling (e.g., SVM, k-NN). - Cross-Validation: Use
cross_val_scoreorcross_validatefor robust model evaluation. - Feature Selection: Use
SelectKBestorRFEto select relevant features. - Imbalanced Data: Use
SMOTEorADASYNto handle imbalanced datasets. - Hyperparameter Tuning: Use
GridSearchCVorRandomizedSearchCVto find optimal hyperparameters. - Pipelines: Use pipelines to streamline preprocessing and modeling steps.
- Random State: Set
random_statefor reproducibility. - n_init for Kmeans: Ensure
n_initparameter is set (>=10 is recommended) when usingKMeans. This parameter controls the number of times the k-means algorithm will be run with different centroid seeds. - Solver selection: Consider the dataset size and characteristics when choosing a solver for
LogisticRegression(e.g.,liblinearfor small datasets,lbfgsfor larger datasets).
7. Integration
Section titled “7. Integration”- Pandas: Use Pandas DataFrames as input for
fitandpredict. - NumPy: Use NumPy arrays for numerical computations.
- Matplotlib/Seaborn: Use Matplotlib/Seaborn for data visualization and model evaluation.
- Joblib: Use
joblibfor model persistence (saving and loading).
import pandas as pdfrom sklearn.linear_model import LogisticRegressionimport joblib
# Create a Pandas DataFramedata = {'feature1': [1, 2, 3, 4, 5], 'feature2': [6, 7, 8, 9, 10], 'target': [0, 0, 1, 1, 0]}df = pd.DataFrame(data)
# Prepare data for sklearnX = df[['feature1', 'feature2']].valuesy = df['target'].values
# Train a modelmodel = LogisticRegression()model.fit(X, y)
# Save the modeljoblib.dump(model, 'my_model.pkl')
# Load the modelloaded_model = joblib.load('my_model.pkl')
# Predict using the loaded modelnew_data = [[2, 7], [4, 9]]predictions = loaded_model.predict(new_data)print(predictions) # Example output: [0 1]8. Further Resources
Section titled “8. Further Resources”- Official Documentation: https://scikit-learn.org/stable/
- Tutorials: https://scikit-learn.org/stable/tutorial/index.html
- Examples: https://scikit-learn.org/stable/auto_examples/index.html
- User Guide: https://scikit-learn.org/stable/user_guide.html
- Stack Overflow: Search for specific problems and solutions.
- Kaggle: Explore datasets and kernels for practical examples.