3.8 About Entity Extraction and CTX_ENTITY
Entity extraction is the identification and extraction of named entities within text. Entities are mainly nouns and noun phrases, such as names, places, times, coded strings (such as phone numbers and zip codes), percentages, and monetary amounts. The CTX_ENTITY
package implements entity extraction by means of a built-in dictionary and a set of rules for English text. You can extend the capabilities for English and other languages with user-provided add-on dictionaries and rule sets.
See Also:
-
CTX_ENTITY
Package in Oracle Text Reference -
Entity Extraction User Dictionary Loader (ctxload) in Oracle Text Reference
This section contains the following examples:
3.8.1 Basic Example of Using Entity Extraction
The example in this section provides a very basic example of entity extraction. The example assumes that a CLOB contains the following text:
New York, United States of America The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.
The example uses CTX_ENTITY.EXTRACT
to find the entities in CLOB value. (For now, do not worry about how the text got into the CLOB or how we provide the output CLOB.)
Entity extraction requires a new type of policy, an "extract policy," which enables you to specify options. For now, create a default policy:
ctx_entity.create_extract_policy( 'mypolicy' );
Now you can call extract
to do the work. It needs four arguments: the policy name, the document to process, the language, and the output CLOB (which you should have initialized, for example, by calling dbms_lob.createtemporary
).
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )
In the previous example, outclob
contains the XML that identifies extracted entities. When you display the contents (preferably by selecting it as XMLTYPE so that it is formatted nicely), here is what you see:
<entities> <entity id="0" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="1" offset="150" length="18" source="SuppliedRule"> <text>Oracle Corporation</text> <type>company</type> </entity> <entity id="2" offset="10" length="24" source="SuppliedDictionary"> <text>United States of America</text> <type>country</type> </entity> <entity id="3" offset="83" length="2" source="SuppliedRule"> <text>5%</text> <type>percent</type> </entity> <entity id="4" offset="113" length="8" source="SuppliedDictionary"> <text>software</text> <type>product</type> </entity> <entity id="5" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>state</type> </entity> </entities>
This display is fine if you process it with an XML-aware program. However, if you want it in a more "SQL friendly" view, use Oracle XML Database (XML DB) functions to convert it as follows:
select xtab.offset, xtab.text, xtab.type, xtab.source from xmltable( '/entities/entity' PASSING xmltype(outclob) COLUMNS offset number PATH '@offset', lngth number PATH '@length', text varchar2(50) PATH 'text/text()', type varchar2(50) PATH 'type/text()', source varchar2(50) PATH '@source' ) as xtab order by offset;
Here is the output:
OFFSET TEXT TYPE SOURCE ---------- ------------------------- -------------------- -------------------- 0 New York city SuppliedDictionary 0 New York state SuppliedDictionary 10 United States of America country SuppliedDictionary 83 5% percent SuppliedRule 113 software product SuppliedDictionary 150 Oracle Corporation company SuppliedRule
If you do not want to fetch all entity types, you can select the types by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types. For example:
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' ) That would give us the XML <entities> <entity id="0" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="2" offset="10" length="24" source="SuppliedDictionary"> <text>United States of America</text> <type>country</type> </entity> </entities>
3.8.2 Example of Creating a New Entity Type by Using a User-Defined Rule
The example in this section shows how to create a new entity type with a user-defined rule. You define rules with a regular-expression-based syntax and add the rules to an extraction policy. The policy is applied whenever it is used.
The following rule identifies increases in a stock index by matching any of the following expressions:
climbed by 5% increased by over 30 percent jumped 5.5%
Therefore, you must create a new type of entity as well as a regular expression that matches any of the expressions:
exec ctx_entity.add_extract_rule( 'mypolicy', 1, '<rule>' || '<expression>' || '((climbed|gained|jumped|increasing|increased|rallied)' || '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))' || '</expression>' || '<type>Positive Gain</type>' || '</rule>');
In this case, you must compile the policy with CTX_ENTITY.COMPILE:
ctx_entity.compile('mypolicy');
Then you can use it as before:
ctx_entity.extract('mypolicy', mydoc, null, myresults)
Here is the (abbreviated) output:
<entities> ... <entity id="6" offset="72" length="18" source="UserRule"> <text>climbed by over 5%</text> <type>Positive Gain</type> </entity> </entities>
Finally, you add another user-defined entity, but this time it uses a dictionary. You want to recognize "Dow Jones Industrial Average" as an entity of type Index.
You also add "S&P 500". To do that, create an XML file containing the following:
<dictionary> <entities> <entity> <value>dow jones industrial average</value> <type>Index</type> </entity> <entity> <value>S&P 500</value> <type>Index</type> </entity> </entities> </dictionary>
Case is not significant in this file, but notice how the "&" in "S&P" must be specified as the XML entity &
. Otherwise, the XML is not valid.
This XML file is loaded into the system with the CTXLOAD
utility. If the file were called dict.load
, you would use the following command:
ctxload -user username/password -extract -name mypolicy -file dict.load
You must compile the policy with CTX_ENTITY.COMPILE.