1. Atharv Sathe | 22BCE10388
2. Ayush Pandey | 22BCE10375
3. Harsh Markandey | 22BCE10861
4. Meet Bhikani | 22BCE11618
5. Manvendra Pratap Singh | 22BCE11382
Fundamentals of Data Science
July 2025
Density-Based Spatial Clustering of Applications with Noise
Points with at least min_samples neighbors within
distance ε
Non-core points within ε distance of a core point
Points that are neither core nor border points
Understanding ε and min_samples
Defines the neighborhood radius.
Value: 1.0
Minimum points in a neighborhood.
Value: 3
Darker regions indicate better clustering quality (higher silhouette scores)
OPTICS and HDBSCAN
from hdbscan import HDBSCAN
clusterer = HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)
DBSCAN vs OPTICS vs HDBSCAN
Clustering quality comparison across different datasets
| Algorithm | Strengths | Weaknesses |
|---|---|---|
| DBSCAN | Simple, fast, handles noise well | Sensitive to parameters, struggles with varying densities |
| OPTICS | Handles varying densities, produces reachability plot | Requires additional clustering step, more complex |
| HDBSCAN | Robust to parameters, hierarchical structure | More computationally expensive, complex interpretation |
Determining if clusters exist in data
from sklearn.neighbors import NearestNeighbors
import numpy as np
def hopkins_statistic(X, sample_size=None):
if sample_size is None:
sample_size = int(0.1 * len(X))
# Sample random points and compute distances
random_points = np.random.uniform(X.min(axis=0), X.max(axis=0),
(sample_size, X.shape[1]))
# Compute nearest neighbor distances
nbrs = NearestNeighbors(n_neighbors=1).fit(X)
distances = nbrs.kneighbors(random_points)[0]
return np.sum(distances) / (np.sum(distances) + np.sum(sample_distances))
Evaluating clustering performance
s(i) = (b(i) - a(i)) / max(a(i), b(i))
Where a(i) = avg distance to same cluster, b(i) = avg distance to nearest cluster
Multi-dimensional view of DBSCAN performance across various metrics
This presentation explores the practical applications of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. We will delve into real-world case studies showcasing its effectiveness in various domains and provide hands-on code examples with visualizations to solidify your understanding.
DBSCAN's ability to find arbitrarily shaped clusters and identify noise makes it a valuable tool in many fields:
Scenario: Identifying unusual data points in scientific datasets.
Application: Wine quality analysis, temperature anomaly detection.
Scenario: Understanding customer base for targeted marketing.
Application: Organic customer segments without predefined groups.
Scenario: Identifying fraudulent financial activities.
Application: Isolating fraudulent transactions as noise points.
DBSCAN can be easily implemented using Python's scikit-learn library. Here are examples with visualizations to illustrate the core concepts.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Generate sample data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)
# Visualize the results
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("DBSCAN Clustering of Moons Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(handles=scatter.legend_elements()[0],
labels=['Cluster 1', 'Cluster 2', 'Noise'])
plt.savefig('dbscan_moons_clustering.png')
plt.show()
Result: DBSCAN successfully identifies two non-linear clusters in the moons dataset, demonstrating its ability to find arbitrarily shaped clusters.
This example shows how DBSCAN can be used to identify outliers in a dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
# Generate sample data with outliers
centers = [[1, 1], [-1, -1]]
X, _ = make_blobs(n_samples=750, centers=centers,
cluster_std=0.4, random_state=0)
outliers = np.random.uniform(low=-3, high=3, size=(50, 2))
X = np.vstack([X, outliers])
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=10)
clusters = dbscan.fit_predict(X)
# Visualize the results
plt.figure(figsize=(10, 6))
# Plot non-outlier points
plt.scatter(X[clusters != -1, 0], X[clusters != -1, 1],
c=clusters[clusters != -1], cmap='viridis', label='Clusters')
# Plot outlier points
plt.scatter(X[clusters == -1, 0], X[clusters == -1, 1],
c='red', marker='x', label='Outliers')
plt.title("DBSCAN Anomaly Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.savefig('dbscan_anomaly_detection.png')
plt.show()
# Print results
n_clusters_ = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise_ = list(clusters).count(-1)
print(f'Estimated number of clusters: {n_clusters_}')
print(f'Estimated number of noise points: {n_noise_}')
Result: Dense clusters represent normal data patterns while outliers (red X's) are flagged as noise points, useful for fraud detection and quality control.
Wine quality analysis, temperature anomaly detection
Customer segmentation, market analysis
Fraud detection, transaction monitoring
Outlier identification, data validation
DBSCAN in the Clustering Landscape
Discovers Hidden Patterns: Identifies arbitrarily shaped clusters without predefined cluster count
Robust to Noise: Automatically identifies and isolates outliers as noise points
Computational Efficiency: O(n log n) complexity with proper indexing makes it scalable
Parameter Tuning: Epsilon (ε) and min_samples require careful selection using domain knowledge
Data Preparation: Feature scaling and dimensionality considerations significantly impact performance
Validation Strategy: Multiple metrics (Silhouette, Davies-Bouldin) provide comprehensive evaluation
Best for uniform density, interpretable parameters, real-time applications
Superior for varying densities, automatic parameter selection, complex datasets
Optimal for exploratory analysis, hierarchical structures, density visualization
Silhouette Score (0.3-0.7 indicates good clustering), Davies-Bouldin Index (<1.0 preferred)
Hopkins Statistic (H > 0.75 suggests strong clustering tendency)
Essential for parameter tuning and result interpretation
A: Use the k-distance graph method:
A: In DBSCAN context, they're synonymous:
A: Consider HDBSCAN when:
A: High dimensions pose challenges:
✓ Assess clustering tendency
✓ Preprocess and scale data
✓ Understand domain context
✓ Use systematic parameter selection
✓ Apply multiple validation metrics
✓ Visualize intermediate results
✓ Validate with domain experts
✓ Test stability across parameters
✓ Document assumptions and limitations
DBSCAN is not just a clustering algorithm—it's a powerful tool for understanding data structure, identifying patterns, and discovering anomalies. Its success depends on thoughtful parameter selection, proper data preparation, and comprehensive validation.
When applied correctly, DBSCAN reveals hidden insights that traditional clustering methods might miss.
Questions & Discussion
Group Members: Atharv Sathe • Ayush Pandey • Harsh Markandey • Meet Bhikani • Manvendra Pratap Singh