9.11 Explicit Semantic Analysis
The oml.esa
class extracts text-based features from a corpus of documents and performs document similarity comparisons.
Explicit Semantic Analysis (ESA) is an unsupervised algorithm for feature extraction. ESA does not discover latent features but instead uses explicit features based on an existing knowledge base.
Explicit knowledge often exists in text form. Multiple knowledge bases are available as collections of text documents. These knowledge bases can be generic, such as Wikipedia, or domain-specific. Data preparation transforms the text into vectors that capture attribute-concept associations.
ESA uses concepts of an existing knowledge base as features rather than latent features derived by latent semantic analysis methods such as Singular Value Decomposition and Latent Dirichlet Allocation. Each row, for example, in a document in the training data maps to a feature, that is, a concept. ESA has multiple applications in the area of text processing, most notably semantic relatedness (similarity) and explicit topic modeling. Text similarity use cases might involve, for example, resume matching, searching for similar blog postings, and so on.
For information on the oml.esa
class attributes and methods, invoke help(oml.esa)
or see Oracle Machine Learning for Python API Reference.
Settings for an Explicit Semantic Analysis Model
The following table lists settings for ESA models.
Table 9-9 Explicit Semantic Analysis Settings
Setting Name | Setting Value | Description |
---|---|---|
|
A non-negative number |
Determines the minimum number of non-zero entries required in an input row. The default value is 100 for text input and 0 for non-text input. |
|
A positive integer |
Controls the maximum number of features per attribute. The default value is |
|
A non-negative number |
Sets the threshold to a small value for attribute weights in the transformed build data. The default value is |
FEAT_NUM_FEATURES |
|
The number of features to extract. The default value is estimated by the algorithm. If the matrix rank is smaller than this number, then fewer features are returned. |
See Also:
Example 9-11 Using the oml.esa Class
This example creates an ESA model and uses some of the methods of the oml.esa
class.
import oml
from oml import cursor
import pandas as pd
# Create training data and test data.
dat = oml.push(pd.DataFrame(
{'COMMENTS':['Aids in Africa: Planning for a long war',
'Mars rover maneuvers for rim shot',
'Mars express confirms presence of water at Mars south pole',
'NASA announces major Mars rover finding',
'Drug access, Asia threat in focus at AIDS summit',
'NASA Mars Odyssey THEMIS image: typical crater',
'Road blocks for Aids'],
'YEAR':['2017', '2018', '2017', '2017', '2018', '2018', '2018'],
'ID':[1,2,3,4,5,6,7]})).split(ratio=(0.7,0.3), seed = 1234)
train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
cur = cursor()
cur.execute("Begin ctx_ddl.create_policy('DMDEMO_ESA_POLICY'); End;")
cur.close()
odm_settings = {'odms_text_policy_name': 'DMDEMO_ESA_POLICY',
'"ODMS_TEXT_MIN_DOCUMENTS"': 1,
'"ESAS_MIN_ITEMS"': 1}
ctx_settings = {'COMMENTS':
'TEXT(POLICY_NAME:DMDEMO_ESA_POLICY)(TOKEN_TYPE:STEM)'}
# Create an oml ESA model object.
esa_mod = oml.esa(**odm_settings)
# Fit the ESA model according to the training data and parameter settings.
esa_mod = esa_mod.fit(train_dat, case_id = 'ID',
ctx_settings = ctx_settings)
# Show model details.
esa_mod
# Use the model to make predictions on test data.
esa_mod.predict(test_dat,
supplemental_cols = test_dat[:, ['ID', 'COMMENTS']])
esa_mod.transform(test_dat,
supplemental_cols = test_dat[:, ['ID', 'COMMENTS']],
topN = 2).sort_values(by = ['ID'])
esa_mod.feature_compare(test_dat,
compare_cols = 'COMMENTS',
supplemental_cols = ['ID'])
esa_mod.feature_compare(test_dat,
compare_cols = ['COMMENTS', 'YEAR'],
supplemental_cols = ['ID'])
# Change the setting parameter and refit the model.
new_setting = {'ESAS_VALUE_THRESHOLD': '0.01',
'ODMS_TEXT_MAX_FEATURES': '2',
'ESAS_TOPN_FEATURES': '2'}
esa_mod.set_params(**new_setting).fit(train_dat, 'ID', case_id = 'ID',
ctx_settings = ctx_settings)
cur = cursor()
cur.execute("Begin ctx_ddl.drop_policy('DMDEMO_ESA_POLICY'); End;")
cur.close()
Listing for This Example
>>> import oml
>>> from oml import cursor
>>> import pandas as pd
>>>
>>> # Create training data and test data.
... dat = oml.push(pd.DataFrame(
... {'COMMENTS':['Aids in Africa: Planning for a long war',
... 'Mars rover maneuvers for rim shot',
... 'Mars express confirms presence of water at Mars south pole',
... 'NASA announces major Mars rover finding',
... 'Drug access, Asia threat in focus at AIDS summit',
... 'NASA Mars Odyssey THEMIS image: typical crater',
... 'Road blocks for Aids'],
... 'YEAR':['2017', '2018', '2017', '2017', '2018', '2018', '2018'],
... 'ID':[1,2,3,4,5,6,7]})).split(ratio=(0.7,0.3), seed = 1234)
>>> train_dat = dat[0]
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... cur = cursor()
>>> cur.execute("Begin ctx_ddl.create_policy('DMDEMO_ESA_POLICY'); End;")
>>> cur.close()
>>>
>>> odm_settings = {'odms_text_policy_name': 'DMDEMO_ESA_POLICY',
... '"ODMS_TEXT_MIN_DOCUMENTS"': 1,
... '"ESAS_MIN_ITEMS"': 1}
>>>
>>> ctx_settings = {'COMMENTS':
... 'TEXT(POLICY_NAME:DMDEMO_ESA_POLICY)(TOKEN_TYPE:STEM)'}
>>>
>>> # Create an oml ESA model object.
... esa_mod = oml.esa(**odm_settings)
>>>
>>> # Fit the ESA model according to the training data and parameter settings.
... esa_mod = esa_mod.fit(train_dat, case_id = 'ID',
... ctx_settings = ctx_settings)
>>>
>>> # Show model details.
... esa_mod
Algorithm Name: Explicit Semantic Analysis
Mining Function: FEATURE_EXTRACTION
Settings:
setting name setting value
0 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS
1 ESAS_MIN_ITEMS 1
2 ESAS_TOPN_FEATURES 1000
3 ESAS_VALUE_THRESHOLD .00000001
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
7 ODMS_TEXT_MAX_FEATURES 300000
8 ODMS_TEXT_MIN_DOCUMENTS 1
9 ODMS_TEXT_POLICY_NAME DMDEMO_ESA_POLICY
10 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 4
Attributes:
COMMENTS
YEAR
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
0 1 COMMENTS.AFRICA None 0.342997
1 1 COMMENTS.AIDS None 0.171499
2 1 COMMENTS.LONG None 0.342997
3 1 COMMENTS.PLANNING None 0.342997
... ... ... ... ...
24 6 COMMENTS.ODYSSEY None 0.282843
25 6 COMMENTS.THEMIS None 0.282843
26 6 COMMENTS.TYPICAL None 0.282843
27 6 YEAR 2018 0.707107
>>> # Use the model to make predictions on test data.
... esa_mod.predict(test_dat,
... supplemental_cols = test_dat[:, ['ID', 'COMMENTS']])
ID COMMENTS FEATURE_ID
0 4 NASA announces major Mars rover finding 3
1 6 NASA Mars Odyssey THEMIS image: typical crater 2
2 7 Road blocks for Aids 5
>>>
>>> esa_mod.transform(test_dat,
... supplemental_cols = test_dat[:, ['ID', 'COMMENTS']],
... topN = 2).sort_values(by = ['ID'])
COMMENTS TOP_1 TOP_1_VAL \
0 4 NASA announces major Mars rover finding 3 0.647065
1 6 NASA Mars Odyssey THEMIS image: typical crater 2 0.766237
2 7 Road blocks for Aids 5 0.759125
TOP_2 TOP_2_VAL
0 1 0.590565
1 2 0.616672
2 2 0.632604
>>>
>>> esa_mod.feature_compare(test_dat,
compare_cols = 'COMMENTS',
supplemental_cols = ['ID'])
ID_A ID_B SIMILARITY
0 4 6 0.946469
1 4 7 0.871994
2 6 7 0.954565
>>> esa_mod.feature_compare(test_dat,
... compare_cols = ['COMMENTS', 'YEAR'],
... supplemental_cols = ['ID'])
ID_A ID_B SIMILARITY
0 4 6 0.467644
1 4 7 0.377144
2 6 7 O.952857
>>> # Change the setting parameter and refit the model.
... new_setting = {'ESAS_VALUE_THRESHOLD': '0.01',
... 'ODMS_TEXT_MAX_FEATURES': '2',
... 'ESAS_TOPN_FEATURES': '2'}
>>> esa_mod.set_params(**new_setting).fit(train_dat, case_id = 'ID',
... ctx_settings = ctx_settings)
Algorithm Name: Explicit Semantic Analysis
Mining Function: FEATURE_EXTRACTION
Settings:
setting name setting value
0 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS
1 ESAS_MIN_ITEMS 1
2 ESAS_TOPN_FEATURES 2
3 ESAS_VALUE_THRESHOLD 0.01
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
7 ODMS_TEXT_MAX_FEATURES 2
8 ODMS_TEXT_MIN_DOCUMENTS 1
9 ODMS_TEXT_POLICY_NAME DMDEMO_ESA_POLICY
10 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 4
Attributes:
COMMENTS
YEAR
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
0 1 COMMENTS.AIDS None 0.707107
1 1 YEAR 2017 0.707107
2 2 COMMENTS.MARS None 0.707107
3 2 YEAR 2018 0.707107
4 3 COMMENTS.MARS None 0.707107
5 3 YEAR 2017 0.707107
6 5 COMMENTS.AIDS None 0.707107
7 5 YEAR 2018 0.707107
>>>
>>> cur = cursor()
>>> cur.execute("Begin ctx_ddl.drop_policy('DMDEMO_ESA_POLICY'); End;")
>>> cur.close()