About Feature Extraction
Feature extraction supports transforming original attributes into linear combinations to reduce dimensionality and enhance model quality.
Feature extraction is a dimensionality reduction technique. Unlike feature selection, which selects and retains the most significant attributes, feature extraction actually transforms the attributes. In Oracle Machine Learning, the transformed attributes, or features, are linear combinations of the original attributes.
The feature extraction technique results in a much smaller and richer set of attributes. The maximum number of features can be user-specified or determined by the algorithm. By default, the algorithm determines it.
Models built on extracted features can be of higher quality, 'because the attributes concentrate the signal found in weaker attributes in fewer attributes that describe the data. However, interpreting the resulting features and models becomes more challenging.
We can think of each feature or attribute as one such dimension. Feature extraction projects a data set with higher dimensionality onto a smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to two or three dimensions.
Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of machine learning algorithms.
Feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.
Feature Extraction and Scoring
Transform input data into features without a target, using feature extraction for improved data representation.
-
Producing individual features or attributes as columns that can be used as input to other algorithms.
-
Creating a vector output consisting of multiple dimensions, where each dimension corresponds to a feature or attribute.
Both approaches improve data representation and enable downstream analytic tasks, however, the vector output offers additional advantages for handling large, dense data.
Feature Extraction algorithms in OML transform input data into a set of features or dimensions improving the data's representation. As an unsupervised machine learning technique, feature extraction does not involve a target. This allows models to extract meaningful attributes from the input, optimizing the data for subsequent analysis.
-
FEATURE_ID
andFEATURE_VALUE
operators extract individual features. -
FEATURE_SET
returns feature ID and value pair sets. -
VECTOR_EMBEDDING
operator enablesVECTOR
data type output for OML feature extraction models, facilitating a unified approach for vectorization.
FEATURE_SET
operator retrieves results of the full
projections. That is, transforming high-dimensional data into a lower-dimensional space
while preserving as much of the data's structure and information as
possible.
You can use this operation to query feature values across all feature IDs. The output
representation is not ideal for large, dense data, and requires extra processing for use
on any vector-based operations, like similarity search. OML supports using the
VECTOR_EMBEDDING
operator that enables VECTOR
data
type as an output representation for feature extraction models such as, SVD, PCA, NMF,
and ESA (with random projections), which enhances the
usability of those algorithms. The benefits of vector output include:
-
Vector output optimizes data representation and provides a more compact format, reducing computational requirements for subsequent analytic tasks.
-
Vector output enables vector-based operations, including similarity search on relational data.
The following cases present how the vector outputs are determined:
-
Determining the dimension of the output vector: The data dictionary views,
USER/ALL/DBA_MINING_MODEL_ATTRIBUTES
andUSER/ALL/DBA_MINING_MODEL_XFORMS
for feature extraction models have a new attribute,ORA$VECTOR
, of the DTYVEC data type. Its dimension and storage type are detailed in theVECTOR_INFO
column.The output vector dimension corresponds to the number of features you require and specify. If you do not specify this, algorithms determine an optimal dimension based on the data, looking for natural cut-off points such as significant drops in explained variance.
- Handling partitioned models with FLEX dimension vector: For
partitioned models,
each
partition of data might have different characteristics or levels of complexity,
which can result in projections with different dimensions for each partition. For
example, in one partition, there might be fewer meaningful features, leading to a
lower-dimensional projection. In another partition, the data might have more
complexity, resulting in a higher-dimensional projection. In these cases, the system
utilizes a FLEX dimension vector to accommodate the varying dimensionality. The OML
partitioned models that consider partitioned sets, will use each partition’s vector
in isolation, leveraging the specific data characteristics of that partition.
The
FLEX dimension type is stored in
VECTOR_INFO
using the vector format. See ALL_MINING_MODEL_ATTRIBUTES. - Special case: Zero features: When a model has zero features, the
system outputs an empty entry, maintaining consistency with the current behavior of
the
FEATURE_VALUE
operator.
For a step-by-step example on how you can use Feature Extraction
in conjunction with the VECTOR_EMBEDDING
operator, see
Vectorize Relational Tables Using OML Feature
Extraction Algorithms.
Note:
TheVECTOR
data
type and VECTOR_EMBEDDING
operator applies only to newly built
models. Older models lack the necessary vector output metadata, and the system
raises a 40290 error to show the operator is incompatible with the model if the
VECTOR_EMBEDDING
operator is used with them.
Related Topics