oraclesai.clustering.DBScanClustering
- class DBScanClustering(eps=None, min_samples=2, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, spatial_weights_definition=None, use_spatial_weights_distances=True)
DBSCAN is a density-based clustering technique capable of finding clusters of different shapes and sizes from a large amount of data. This algorithm doesn’t require the number of clusters as a parameter. The algorithm starts at any point; if at least
min_samples
points are within a radius ofeps
, then all the points in the neighborhood are considered part of the same cluster. Then the process is repeated for all the points in the neighborhood. There are three types of points:- Core Point
At least has
min_samples
number of points in its neighborhood within the radiuseps
.
- Border Point
It is reachable from a core point, but there are fewer than
min_samples
number of points within its neighborhood.
- Noise Point
It is neither a core point nor a border point; it is a point that is not reachable from any core points.
Regionalization is used to provide a spatial context to the DBSCAN algorithm. This way, observations of the same cluster are similar not only in their attributes, but also in their spatial location.
- Parameters:
eps – float, default=None. The maximum distance between two samples for one to be considered as in the neighborhood of the other. If it is None, the K-Distance method is used to estimate the best value for
eps
.min_samples – int, default=None. The number of samples in a neighborhood for a point to be considered as a core point. If it is None, it is estimated using the number of features in the data.
metric – str, or callable, default=’euclidean’. The metric used to calculate the distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by :func:
sklearn.metrics.pairwise_distances
. If metric is ‘precomputed’,X
is assumed to be the distance matrix and must be square.metric_params – dict, default=None. Additional arguments for the metric function
algorithm – {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors.
leaf_size – int, default=30. Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
p – float, default=None. The power of the Minkowski metric to be used to calculate distance between points. If None, then
p=2
(equivalent to the Euclidean distance).n_jobs – int, default=None. The number of parallel jobs to run
spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. Defines the criteria used to identify neighbors, for example, KNNWeightsDefinition, DistanceBandWeightsDefinition, etc.
use_spatial_weights_distances – bool, default=True. If True, it will use the spatial the weight matrix as distance. If False, it will set the distance to all neighbors to zero.
Methods
__init__
([eps, min_samples, metric, ...])fit
(X[, y, geometries, spatial_weights, crs])Fits a DBSCAN model with the given data and parameters; in case spatial weights were defined, Regionalization is executed, causing elements of the same cluster to be geographically connected.
fit_predict
(X[, y, geometries, ...])Trains the clustering model and returns the labels assigned to each observation.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
Attributes
METRIC_PRECOMPUTED
NON_NEIGHBOR_DISTANCE
Maximum distance between two samples for one to be considered neighbor of the other.
The isoperimetric quotient (IPQ) for the resulting clusters.
Array indicating the cluster associated with each sample.
The number of samples in a neighborhood for a point to be considered as a core point.
The Silhouette score for the resulting clusters.