About Feature Extraction

Feature extraction supports transforming original attributes into linear combinations to reduce dimensionality and enhance model quality.

Feature extraction is a dimensionality reduction technique. Unlike feature selection, which selects and retains the most significant attributes, feature extraction actually transforms the attributes. In Oracle Machine Learning, the transformed attributes, or features, are linear combinations of the original attributes.

The feature extraction technique results in a much smaller and richer set of attributes. The maximum number of features can be user-specified or determined by the algorithm. By default, the algorithm determines it.

Models built on extracted features can be of higher quality, 'because the attributes concentrate the signal found in weaker attributes in fewer attributes that describe the data. However, interpreting the resulting features and models becomes more challenging.

We can think of each feature or attribute as one such dimension. Feature extraction projects a data set with higher dimensionality onto a smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to two or three dimensions.

Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of machine learning algorithms.

Feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.

Feature Extraction and Scoring

Transform input data into features without a target, using feature extraction for improved data representation.

Oracle Machine Learning (OML) provides two primary methods for using in-database feature extraction algorithms:
  • Producing individual features or attributes as columns that can be used as input to other algorithms.

  • Creating a vector output consisting of multiple dimensions, where each dimension corresponds to a feature or attribute.

Both approaches improve data representation and enable downstream analytic tasks, however, the vector output offers additional advantages for handling large, dense data.

Feature Extraction algorithms in OML transform input data into a set of features or dimensions improving the data's representation. As an unsupervised machine learning technique, feature extraction does not involve a target. This allows models to extract meaningful attributes from the input, optimizing the data for subsequent analysis.

OML supports scoring operations for feature extraction using the following operators:
  • FEATURE_ID and FEATURE_VALUE operators extract individual features.

  • FEATURE_SET returns feature ID and value pair sets.

  • VECTOR_EMBEDDING operator enables VECTOR data type output for OML feature extraction models, facilitating a unified approach for vectorization.

The FEATURE_SET operator retrieves results of the full projections. That is, transforming high-dimensional data into a lower-dimensional space while preserving as much of the data's structure and information as possible. You can use this operation to query feature values across all feature IDs. The output representation is not ideal for large, dense data, and requires extra processing for use on any vector-based operations, like similarity search. OML supports using the VECTOR_EMBEDDING operator that enables VECTOR data type as an output representation for feature extraction models such as, SVD, PCA, NMF, and ESA (with random projections), which enhances the usability of those algorithms. The benefits of vector output include:
  • Vector output optimizes data representation and provides a more compact format, reducing computational requirements for subsequent analytic tasks.

  • Vector output enables vector-based operations, including similarity search on relational data.

The following cases present how the vector outputs are determined:

  • Determining the dimension of the output vector: The data dictionary views, USER/ALL/DBA_MINING_MODEL_ATTRIBUTES and USER/ALL/DBA_MINING_MODEL_XFORMS for feature extraction models have a new attribute, ORA$VECTOR, of the DTYVEC data type. Its dimension and storage type are detailed in the VECTOR_INFO column.

    The output vector dimension corresponds to the number of features you require and specify. If you do not specify this, algorithms determine an optimal dimension based on the data, looking for natural cut-off points such as significant drops in explained variance.

  • Handling partitioned models with FLEX dimension vector: For partitioned models, each partition of data might have different characteristics or levels of complexity, which can result in projections with different dimensions for each partition. For example, in one partition, there might be fewer meaningful features, leading to a lower-dimensional projection. In another partition, the data might have more complexity, resulting in a higher-dimensional projection. In these cases, the system utilizes a FLEX dimension vector to accommodate the varying dimensionality. The OML partitioned models that consider partitioned sets, will use each partition’s vector in isolation, leveraging the specific data characteristics of that partition. The FLEX dimension type is stored in VECTOR_INFO using the vector format. See ALL_MINING_MODEL_ATTRIBUTES.
  • Special case: Zero features: When a model has zero features, the system outputs an empty entry, maintaining consistency with the current behavior of the FEATURE_VALUE operator.

For a step-by-step example on how you can use Feature Extraction in conjunction with the VECTOR_EMBEDDING operator, see Vectorize Relational Tables Using OML Feature Extraction Algorithms.

Note:

The VECTOR data type and VECTOR_EMBEDDING operator applies only to newly built models. Older models lack the necessary vector output metadata, and the system raises a 40290 error to show the operator is incompatible with the model if the VECTOR_EMBEDDING operator is used with them.

Algorithms for Feature Extraction

Understand the algorithms used for feature extraction.

OML4SQL supports these feature extraction algorithms:

  • Explicit Semantic Analysis (ESA).

  • Non-Negative Matrix Factorization (NMF).

  • Singular Value Decomposition (SVD) and Principal Component Analysis (PCA).

Note:

OML4SQL uses NMF as the default feature extraction algorithm.