Content Categorizer and Link Manager are optional components automatically installed with Oracle WebCenter Content Server. When enabled, Content Categorizer suggests metadata values for new documents checked in to Content Server, and for existing documents that have or do not have metadata values. When Link Manager is enabled, it evaluates, filters, and parses the URL links of indexed content items before extracting them for storage in a database table (ManagedLinks).
For Content Categorizer to recognize structural properties, the content must go through XML Conversion (eXtensible Markup Language). The conversion method is defined in the sccXMLConversion configuration variable. Content Categorizer uses Search Rules to suggest metadata values for content:
The Batch Categorizer that is included with the component can search a large number of files and create a Batch Loader control file containing appropriate metadata field values. The Batch Categorizer can also be used to recategorize content checked in to the repository.
Important:
There is a problem with the XSLT transformation used to post-process PDF content converted using the Flexiondoc schema. When Flexiondoc schema are used, single words are assigned to individual XML elements, making the final XML unusable. It is necessary to use SearchML for categorizing PDF content.
Regardless of which XML converter is specified, the XML intermediate files are used only by Content Categorizer, so they are discarded after use, and documents are checked in to Content Server in their original source form. The only exception is content that is in XML format, which is not subjected to the translation process.
With each converter, the OutsideIn XML Export technology is used in combination with a custom XSLT style sheet (flexiondoc_to_scc.xsl) to produce XML in a two-stage process. In the first stage, the native document is converted to either Flexiondoc-formatted XML or SearchML-formatted XML.
In the second stage, the style sheet is used to further refine the XML so that it is searchable by Content Categorizer. Native document properties and text segments are isolated in XML elements, which are named after the corresponding document property, paragraph style, or character style (note that character styles are not supported by SearchML).
For a list of file formats supported by OutsideIn XML Export, see Chapter 40, "Input File Formats."
Content Categorizer executes search rules depending on the type of rule defined:
Pattern Matching and Abstract Rules: Content Categorizer scans a content document looking for "landmarks." A landmark can be specific text, or it can be based on structural properties of the source document, such as styles, fonts, and formatting.
Option List Rule: Content Categorizer searches for keywords whose cumulative score determines which option of a list is selected. It does not look for either landmarks or specific XML tags.
Categorization Engine Rule: Content Categorizer invokes a 3rd-party categorizer engine and taxonomy to categorize a content item.
Filetype Rule: Content Categorizer looks for the document file type (the file name extension).
Normally, a user-entered value on the Content Check In Form prevents Content Categorizer from applying the search rules for that field. This is also true for list fields that have a default value, such as the Type field.
Important:
It is important to instruct contributors to leave any fields blank that they want to have filled by search rules.
For more information about search rules, see Section 9.1.5.
The following tasks must be done to run Content Categorizer:
Define the XML Conversion method. For more information, see Section 9.1.4.1.
Define search rules. For more information, see Section 9.1.5.6.
Optional: Define field properties, including default values for metadata fields. For more information, see Section 9.1.4.2.
Important:
To use the CATEGORY search rule, install, set up and register a categorizer engine before defining the CATEGORY rule for any metadata fields.
Content Categorizer can operate in either Interactive mode or Batch mode. All modes require conversion of the source documents into XML intermediate form. However, the process flows of the modes are distinctly different.
Batch mode is used when recategorizing large numbers of documents in the repository. The system administrator uses a standalone utility to run Content Categorizer, then either performs a live update of content metadata or uses the output file from Batch Categorizer as input to the Batch Loader. For more information about the steps used during this process, see Section 9.1.3.1.1.
Interactive mode integrates Content Categorizer with the Content Check In Form and Info Update Form. Users click Categorize on the form to run Content Categorizer on a single content item. Any value that is returned by Content Categorizer is a suggested value, because the contributor can edit or replace the returned value. For more information about the steps taken during this process, see Section 9.1.3.1.2.
The MaxQueryRows configuration variable is used to specify the maximum number of documents that can be included in a single batch load process. As such, it affects how many documents a user sees in Batch Categorizer. The default setting for this configuration variable is 200 but can be decreased or increased as necessary. For more information about the variable, see Oracle Fusion Middleware Configuration Reference for Oracle WebCenter Content.
The system administrator performs the following steps during the batch mode process:
Run the Batch Categorizer application. For more information about running applications on UNIX systems, see the Oracle Fusion Middleware Administering Oracle WebCenter Content.
If necessary, on the Batch Categorizer page, define filters and release date information to display a list of content to be categorized. Click Categorize.
On the Categorize Existing page, select Live Update or Batch Loader.
The Live Update option updates the data in the repository immediately.
The Batch Loader option is used to create a control file, which is the output of the Content Categorizer process. The file contains an entry for each source document, and contains the values for each metadata field based on the search rules defined in Content Categorizer. You can edit this file before submitting it to the Batch Loader.
To run the Batch Loader utility automatically after the Content Categorizer process is complete, select the Run Batch Loader check box.
Enter the location and file name for the log file that contains error information about the Content Categorizer process.
Choose Categorize All to work with all content items or Categorize Selected to use only the highlighted items in the content list.
Choose to categorize the Latest Revision, which works with only the most recent revision of an item, or All Revisions.
Choose to continue or discontinue the categorization process when Batch Categorizer encounters an error.
Click OK. The Progress bar shows the progress as the batch process moves through its steps:
Content Categorizer locates the source content.
If the content is in XML format, no translation occurs, and the process continues at step d.
If the content is not in XML format, conversion into XML occurs using the selected XML conversion method: Flexiondoc or SearchML.
Content Categorizer applies the search rules to the XML and obtains values for the specified metadata fields.
If Live Update was specified, database records are updated immediately. If Batch Loader was specified, an output control file is created, and the Batch Loader utility is run, if the option to do so after processing was specified.
When the batch process is complete, review the error logs. Errors encountered by Batch Categorizer are displayed on the console and also recorded in the Batch Categorizer log (if specified). Errors encountered by Batch Loader are displayed on the console and also recorded in the system log.
If the optional AddCCToArchiveCheckin component is installed and enabled, all content loaded using the Batchloader utility is categorized automatically, based on predefined rule sets. For more information about defining rule sets, see Section 9.1.5.6.
The following steps occur during the check-in process:
A contributor opens the Content Check In Form or the Info Update Form, selects a primary file (only on Content Check In Form), and clicks Categorize.
The Content Check In Form copies the primary file to the host and calls the Content Categorizer service.
Content Categorizer locates the source content.
If the content is in XML format, no translation occurs, and the process continues at step 6.
If the content is not in XML format, the specified conversion method is used.
Content Categorizer applies the search rules to the XML and obtains suggested values for the specified metadata fields.
Content Categorizer inserts the suggested metadata values into the Content Check In Form or Update Info Form, and returns the form to the contributor.
The contributor can check in or submit the document with the suggested values, revise the metadata values, or cancel the check in or update.
If the optional AddCCToNewCheckin component is installed and enabled, when you click Check In on the Content Check In Form, it performs steps 2 through 6 and completes the check in process, provided the properties for dDocTitle are set to Override Contents.
If the properties of dDocTitle are not set to Override Contents, then an alert is displayed requesting that the required field is completed. Field properties are set using the CC Admin Applet. For more information, see Section 9.1.4.2.
Before using Content Categorizer, install and configure the necessary software. This section discusses those tasks:
To set the XML conversion method in Content Categorizer:
Choose Administration then Content Categorizer Administration from the Main menu.
On the Content Categorizer Administration page, click Configuration.
On the Configuration tab, select the sccXMLConversion property and click Edit or double-click the property.
From the list on the Property Config page, select either Flexiondoc or SearchML as the XML conversion method.
Click OK.
Click Apply to save the changes.
When any rule for a field succeeds, the found value is used (in either Batch Loader operations or Live Update operations). However, depending on how the Override value is set, the found value does not override the existing value (Override is set to false).
When all rules for a field fail, no value is assigned to the field unless a default value is defined for the field and Use Default is set to true.
To define field properties for the metadata fields:
Choose Administration then Content Categorizer Administration from the Main menu.
On the Content Categorizer Administration page, click the Field Properties tab.
Select a metadata field to be edited and click Edit, or double-click the field.
On the Field Properties page, enter a default value for the field.
The default value for a list field must match a value available for that field.
Select the Override check box for the value returned by the categorization process to override an existing value for the field.
Select the Use Default check box for the field's default value to be used if all rules fail (or are not defined) when the categorization process runs.
Click OK.
Repeat these steps for each field to be edited.
Click Save Settings to save the changes.
Search rules define how Content Categorizer determines metadata values to return to the Content Check In Form or Info Update Form (for Interactive mode) or the batch file (for Batch mode).
This section discusses the following information regarding search rules:
Every search rule is defined by:
A rule type, which determines the method that Content Categorizer uses to search the XML document.
A key, which defines the XML element, phrase, or keyword that Content Categorizer looks for in the document, or the categorization engine/taxonomy that Content Categorizer uses to classify the document.
A count, which is used to refine the search criteria.
Consider the following guidelines when creating search rules:
You can apply search rules to any custom metadata field.
You can apply search rules to the Title, Comments, and Type standard metadata fields. You cannot define search rules for any other standard metadata fields (such as Author, Security Group, and Account).
You can define multiple search rules for a metadata field. (For a single metadata field, however, multiple CATEGORY rules referring to different taxonomies are not supported.)
Multiple search rules are run in the order specified, so that if a search rule does not result in a suggested value, the next rule is run. Arrange the list from most to least specific.
You can mix search rule types within a metadata field. For example, you can define an Option List rule, a Pattern Matching rule, and an Abstract rule for the same metadata field.
If none of the search rules specified for a metadata field can be satisfied, the field is left blank.
Pattern Matching search rules look for specific text or a specific XML element and return an associated value. For example, the Invoice # metadata field contain the value that follows an Invoice: or Invoice Number: label in the source document, or it can contain the value that is within the <Invoice> tag in the XML document.
There are two general types of Pattern Matching rules: Tag Search and Text Search. Within each type are several sub-types.
Tag Search searches for the full name of an XML element that matches the key. If such an element is found, the text contained in the element is returned as the result. Tag searches are case sensitive. Sub-types include the following:
TAG_TEXT
TAG_ALLTEXT
Text Search searches for text that matches the key. If such text is found, the text near or following the key is returned as the result. Text searches are not case sensitive. Sub-types include the following:
TEXT_REMAINDER
TEXT_FULL
TEXT_ALLREMAINDER
TEXT_ALLFULL
TEXT_NEXT
TEXT_ALLNEXT
The key for a Pattern Matching search rule is either an XML element (for a Tag Search) or a text phrase (for a Text Search).
The count for a Pattern Matching search rule defines the number of tags or text phrases that must be matched before the rule returns results. For example, a count of 4 looks for the fourth occurrence of the key. If only three occurrences of the key are found in the document, the rule fails.The default count of 1 returns the first occurrence of the key.
The following examples illustrate the use of the Pattern Matching search rules.
This rule searches for the full name of an XML element that matches the key (including case). If such an element is found, all text that belongs to the element is concatenated and returned as the result.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TAG_TEXT
Key: TAG_A
Returns: Title: The Big Wolf
This rule searches for the full name of an XML element that matches the key (including case). If such an element is found, all text that belongs to the element, and to all children of the element, is concatenated and returned as the result.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TAG_ALLTEXT
Key: TAG_A
Returns: Title: The Big Bad Wolf
This rule searches for text that matches the key (except for case). If such text is found, any text following the key that belongs to the same XML element is returned as the result.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_REMAINDER
Key: Title:
Returns: The Big Wolf
This rule searches for text that matches the key (except for case). If such text is found, any text following the key that belongs to the same XML element, and to all children of the element, is returned as the result.
Content: TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLREMAINDER
Key: Title:
Returns: The Big Bad Wolf
This rule searches for text that matches the key (except for case). If such text is found, any text that belongs to the same XML element, including the key text, is returned as the result.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_FULL
Key: Title:
Returns: Title: The Big Wolf
This rule searches for text that matches the key (except for case). If such text is found, any text that belongs to the same XML element, including the key text and any text belonging to children of the element, is returned as the result.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLFULL
Key: Title:
Returns: Title: The Big Bad Wolf
This rule searches for text that matches the key (except for case). If such text is found, any text that belongs to the next non-blank XML element is returned as the result. Blank elements and elements composed of non-printing characters are not selected as the return value.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_NEXT
Key: Title:
Returns: Subtitle: A Play
This rule searches for text that matches the key (except for case). If such text is found, any text that belongs to the next non-blank XML element, and to all children of the element, is returned as the result. Blank elements and elements composed of non-printing characters are not selected as the return value.
Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>
<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLNEXT
Key: Title:
Returns: Subtitle: A Morality Play
Abstract search rules look for an XML element and return a descriptive sentence or paragraph from that element. For example, the Summary metadata field could be filled by a returned value of "Germany is a large country in size, culture, and worldwide economics. One of Germany's largest industries includes the manufacturing of world class automobiles like BMW, Mercedes, and Audi."
The Abstract rule type is useful where there is no readily identifiable or explicitly tagged block of text in the content item. Typically, these rules are used to suggest summary or topic information about the document.
There are two types of abstract search rules: First Paragraph and First Sentence.
First Paragraph searches for the full name of an XML element that matches the key. The entire paragraph of the first such element that meets the size criteria (specified by the count) is returned as the result.
First Sentence searches for the full name of an XML element that matches the key. If such an element is found, the first sentence of the element is returned as the result.
The key for an Abstract search rule is an XML element.
The count is interpreted differently for the First Paragraph and First Sentence search rules.
For a First Paragraph search rule, the count is a size threshold measured in percent:
The rule searches the document for all paragraphs that match the key.
The rule calculates the average size (based on character count) of the paragraphs that match the key.
The rule multiplies the average size by the count percentage (0 = 0%, 100 = 100%).
The rule looks for the first paragraph larger than the resulting number.
For example, if the count is set to 75 and the average paragraph size is 100 characters, the rule returns the first paragraph larger than 75 characters that matches the key.
If the count is set to the default of 1, the rule is likely to return the first paragraph that matches the key.
For a First Sentence search rule, the count is the number of elements that have their first sentences returned.
For example, if the count is set to 3, the rule returns the first sentence from each of the first three elements that match the key.
The following examples illustrate the use of the Abstract search rules.
This example returns the first <Text> element that exceeds one-half the average <Text> element paragraph size. Note that the <Title> element does not match the key value, so it is ignored for both the search and for the average length calculation.
Content: <Title>Poem</Title>
<Text>Mary had</Text>
<Text>a little Lamb</Text>
<Text>The fleece was white as snow</Text>
<Text>And everywhere that Mary went the lamb was sure to go</Text>
Rule: FIRST_PARAGRAPH
Key: Text
Count: 50
Returns: The fleece was white as snow.
This example returns the first sentence of the first two <Text> elements. Note that the<Title> element does not match the key value, so it is excluded from the search.
Content: x<Title>Barefoot in the Park</Title>
<Text>See Dick run. See Jane run. See Dick and Jane.</Text>
<Text>See Spot run. See Puff chase Spot.</Text>
<Text>See Dick chase Spot and Puff.</Text>
Rule: FIRST_SENTENCE
Key: Text
Count: 2
Returns: See Dick run. See Spot run.
The Option List search rule looks for keywords within the source document, applies a score for each keyword found, and returns the value that has the highest keyword score.
For example, if the keywords margin, SEC filing, or invoice were found in a document, the suggested value for the Department field would be Accounting, while the keywords tolerance, assembly, or inventory would return Manufacturing as the suggested value.
The Option List search rule usually applies to metadata fields that have a list defined in the Configuration Manager.
Option list names and values (called categories in Content Categorizer) appear in Content Categorizer as specified in the Configuration Manager. If a custom list field is created or changed while the CC Admin Applet is open, close and reopen the applet to see the changes.
The current version of Content Server automatically inserts a blank value as the default value in a custom list field. In this case, the first value (by default, a blank value) is not considered a user-entered value, and the Option List search rule is applied. To prevent the Option List search rule from overriding the first value in a custom list field, provide a default value for that list on the Configuration Manager Applet.
There is one type of Option List search rule, which searches for keywords (single words or phrases) that match the keywords defined in the key.
Keywords can be single words (for example, dog) or multiple-word phrases (for example, black dog).
Keywords can use the following defined set of operators to further refine a search:
$$AND$$
$$OR$$
$$AND_NOT$$
$$NEAR$$
Keywords are pre-assigned to each category (value) in the list, and each keyword has a weight assigned to it.
The number of occurrences of each keyword found in the document is multiplied by its weight, resulting in a keyword score.
The keyword scores for each category are added, resulting in a category score.
The category with the highest score is returned as the suggested value.
If there is a tie between categories, the category earliest in the list is returned as the suggested value.
Use the weights Always and Never to override the scores and count threshold.
An occurrence of a keyword with the Always weight forces the category to be returned as the suggested value, regardless of score.
An occurrence of a keyword with the Never weight disqualifies the category from being returned as the suggested value, regardless of score.
If two categories have keywords assigned the Always weight, and both keywords occur in the document, the keyword first found in the document takes precedence.
Important:
Option List searches are case sensitive and must match exactly. For example, Invoice, Invoices, invoice, and invoices must be defined to retrieve all instances of this keyword.
The key for an Option List search rule is the Option List name, as shown on the Option Lists tab of the Admin Applet.
The count for an Option List search rule sets a minimum threshold score for the rule to return results. For example, if the count is set to 50, and the highest accumulated keyword score is 45, the rule fails.
The following examples illustrate the use of the Option List search rule.
In this example, the score for Dick and Spot is 30 (3 occurrences x 10), and the score for Jane and Puff is 20 (2 occurrences x 10). Dick is returned as the suggested value because it is earlier in the list than Spot:
Content: <Title>Barefoot in the Park</Title>
<Text>See Dick run. See Jane run. See Dick and Jane.</Text>
<Text>See Spot run. See Puff chase Spot.</Text>
<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacterList
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2
Jane: Jane=10, girl=5, Janie=2
Spot: Spot=10, dog=5
Puff: Puff=10, cat=5
Returns: Dick
In this example, Spot is returned as the suggested value because its score of 60 (3 occurrences x 20) is higher than the other categories:
Content: <Title>Barefoot in the Park</Title>
<Text>See Dick run. See Jane run. See Dick and Jane.</Text>
<Text>See Spot run. See Puff chase Spot.</Text>
<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacterList
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2
Jane: Jane=10, girl=5, Janie=2
Spot: Spot=20, dog=10
Puff: Puff=10, cat=5
Returns: Spot
In this example, the rule fails because none of the scores is above the Count threshold of 50:
Content: <Title>Barefoot in the Park</Title>
<Text>See Dick run. See Jane run. See Dick and Jane.</Text>
<Text>See Spot run. See Puff chase Spot.</Text>
<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacterList
Count: 50
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2
Jane: Jane=10, girl=5, Janie=2
Spot: Spot=10, dog=5
Puff: Puff=10, cat=5
Returns: Fail
In this example, Puff is returned as the suggested value because the keyword "Puff" has a weight of Always:
Content: <Title>Barefoot in the Park</Title>
<Text>See Dick run. See Jane run. See Dick and Jane.</Text>
<Text>See Spot run. See Puff chase Spot.</Text>
<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacterList
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2
Jane: Jane=10, girl=5, Janie=2
Spot: Spot=10, dog=5
Puff: Puff=Always, cat=5
Returns: Puff
The Categorization Engine search rule uses a 3rd-party categorizer engine and defined taxonomy to determine and return a value that represents a category within the specified taxonomy, for example, News/Technology/Computers.
There is one type of Categorization Engine search rule, which uses the categorizer engine and taxonomy specified in the Key to return a value for the field.
The key for a Categorization Engine search rule is the name of the categorizer engine followed by the name of the taxonomy. For example, EngineName/TaxonomyName. If an engine name is defined in the Key field, Content Categorizer defaults to the first engine displayed in the Categorizer Engines list. If only one engine is defined, just enter the taxonomy name in the Key field.
The count for a Categorization Engine search rule sets a minimum confidence level threshold for the returned results.
When a categorization engine returns a category (or set of categories) for a given query, a confidence level is also returned, which is often expressed as a percentage for each category. The Category rule always accepts the highest-confidence category, unless the confidence level is below the count value specified for the rule, in which case the rule fails. For example, if the count is set to 50, and the highest-confidence category returned is 45, the rule fails.
The default count of 1 would always accept the highest-confidence category returned by the categorizer engine. The actual range for the Count value depends on the categorizer engine that is being used.
The Filetype search rule looks at the file name extension of a document and returns a term, usually a file type description associated with the file name extension.
There is one type of Filetype search rule, which uses the file name extension of the primary (native) file to return a value for the field.
When the Filetype search rule is defined for a metadata field, the file name extension of the content item is matched against all values in the DocFormatsWizard table. This table is found in the file doc_config.htm, which is located in the IntradocDir/shared/config/resources/ directory.
If a match is found, the associated value in the Description column is extracted and translated. The resulting string is returned as the suggested metadata value for the field. If the primary file path has no extension, or if the extension does not match any of the "extensions" values in the DocFormatsWizard table, the rule fails and the next rule in the list for the metadata field is executed.
The key for a FILETYPE search rule is not used when determining a metadata value. Leave the Key field blank.
The count for a FILETYPE search rule is not used when determining a metadata value. Leave the Count field blank.
If a FILETYPE rule is created with non-blank Key or Count fields, a warning message is displayed indicating that these fields are not supported by the rule.
The following examples illustrate the use of the Filetype Search rule.
Primary File: policies.doc
Rule: FILETYPE
Key: blank
Count: blank
Returns: Microsoft Word Document
Primary File: procedures.wpd
Rule: FILETYPE
Key: blank
Count: blank
Returns: Corel WordPerfect Document
During startup, Content Categorizer takes a snapshot of the current metadata field configuration including field names and lengths. If the metadata field configuration changes, restart Content Server before running the Content Categorizer Admin Applet to add or modify any search rules.
Important:
Content Categorizer requires a non-empty rule set for any file type (.doc, .txt, .xml, and so on) it is called to examine. If no rules exist for a given file type, Content Categorizer throws an exception.The easiest way to protect against this is to add at least one rule to the Default rule set. The Default rule set is used for all file types which do not have a custom rule set assigned.
To define search rules for any metadata field:
Choose Administration then Content Categorizer Administration.
On the Content Categorizer Administration page, click the Rule Sets tab.
In the Ruleset pane, select the ruleset from the list, or click Add to add and name a new ruleset. A ruleset contains multiple rules that apply to specific documents or a particular document type. If a specific ruleset is not defined for a given document or document type, the default ruleset is used.
Select a metadata field from the Field list.
Click Add.
On the Add/Edit Rule for field_name page, select the rule type from the Rule list.
Enter the search rule key in the Key field.
If CATEGORY is used, enter the categorization engine name (if there are multiple items in list of Categorizer Engines), followed by slash (/), followed by taxonomy name. For example: EngineName/TaxonomyName
For an OPTION_LIST search rule, keywords for the list must be defined on the Option List tab.
Enter the count in the Count field. For TAG and TEXT types, this is the number of tags or text phrases that must be matched before the rule returns results. For example, a count of 4 looks for the fourth occurrence of the key.
If only three occurrences of the key are found in the document, the rule fails. The default count of 1 returns the first occurrence of the key.
For FIRST_PARAGRAPH, this is the size threshold measured in percent. The first paragraph matching the key that is larger than the count percentage multiplied by the average paragraph size is returned. For example, if the count is set to 75 and the average paragraph size is 100 characters, the rule returns the first paragraph larger than 75 characters that matches the key. If the count is set to the default of 1, the rule is likely to return the first paragraph that matches the key.
For FIRST_SENTENCE, this is the number of elements that have their first sentences returned. For example, if the count is set to 3, the rule returns the first sentence from each of the first three elements that match the key.
For CATEGORY, this is the minimum confidence level threshold for the rule to return results. For example, if the count is set to 50, and the highest-confidence category has a confidence level of 45, the rule fails.
Click OK when done.
Add search rules to each metadata field as necessary.
To delete a rule, select the rule in the Rules List and click Delete.
To edit a rule, select the rule in the Rules List and click Edit.
To adjust the order of the rules, select the rule in the Rules List and click Move Up or Move Down. Rules are applied in the order listed. If the first rule succeeds, no other rules are applied. If the first rule fails, then the next rule is applied, and so forth.
Important:
If a CATEGORY rule is added, edited, or deleted, a dialog prompts you to apply the changes and build, rebuild, or check for orphaned query trees for this rule on the Query Trees tab.
Click Apply to save the changes, or click OK to save the changes and close the Content Categorizer Administration page.
To define the keywords and weights for a list:
Choose Administration then Content Categorizer Administration from the Main menu.
On the Content Categorizer Administration page, click the Option Lists tab.
Select a list from the Option List. The list includes the Type ($DocType) list, plus lists of all custom metadata fields that have a list defined in the Configuration Manager.
Caution:
When a list metadata field is deleted from the Configuration Manager, the field is removed from the Rule Sets tab, but it still appears in the Option List list on the Option Lists tab. Be careful not to select an obsolete list.
Select a value from the Category list. Only the pre-defined values for the list are included.
Enter a keyword or phrase in the Keyword field. Option List searches are case sensitive and must match exactly.
Keywords can be single words or multiple-word phrases.
Keywords can include Boolean-type expressions, where the following set of binary operators are valid: $$AND$$, $$OR$$, $$AND_NOT$$, $$NEAR$$
Select a weight for the keyword.
Always: If the keyword is found, the selected category is returned as the suggested value, regardless of the score.
Weight: This number multiplied by the number of occurrences of the keyword is the category's score. The category with the highest score is returned as the suggested value for the list field.
Never: If the keyword is found, the selected category is not returned as the suggested value, regardless of the score.
Click Add.
Enter keywords for each category in the selected list.
To delete a keyword, select the keyword in the Keywords list and click Delete.
To edit a keyword, select the keyword in the Keywords list, click Edit, edit the keyword, the weight or both, and click Update.
Click Apply to save the changes, or click OK to save the changes and close the page.
You can configure the configuration file so Content Categorizer ignores the Type default value and applies search rules to the Type field.
This procedure applies only to the Type (dDocType) field. You cannot apply search rules to the other standard list fields (Security Group, Author, and Account).
To apply search rules to the Type field:
Open the config.cfg file located in the IntradocDir/config/ directory in a text-only editor such as WordPad.
Add the following line to the file:
ForceDocTypeChoice=true
Save and close the file.
Stop and restart Content Server.
The following is a sample doc_config.htm page.
<@table DocFormatsWizard@>
| dFormat | Extensions | dConversion | dDescription | 
|---|---|---|---|
| application/ corel-wordperfect, application/wordperfect | wpd | WordPerfect | apWordPerfectDesc | 
| application/ vnd.framemaker | fm | FrameMaker | apFramemakerDesc | 
| application/ vnd.framebook | bk, book | FrameMaker | apFrameMakerDesc | 
| application/vnd.mif | mif | FrameMaker | apFrameMakerInterchangeDesc | 
| application/lotus-1-2-3 | 123, wk3, wk4 | 123 | apLotus123Desc | 
| application/lotus-freelance | prz | Freelance | apLotusFreelanceDesc | 
| application/lotus-wordpro | lwp | WordPro | apLotusWordProDesc | 
| application/msword, application/ms-word | doc, dot | Word | apMicrosoftWordDesc | 
| application/vnd.ms-excel, application/ms-excel | xls | Excel | apMicrosoftExcelDesc | 
| application/ vnd.ms-powerpoint, application/ms-powerpoint | ppt | PowerPoint | apMicrosoftPowerPointDesc | 
| application/vnd.ms-project, application/ms-project | mpp | MSProject | apMicrosoftProjectDesc | 
| application/ms-publisher | pub | MSPub | apMicrosoftPublisherDesc | 
| application/write | wri | Word | apMicrosoftWriteDesc | 
| application/rtf | rtf | Word | apRtfDesc | 
| application/vnd.visio | vsd | Visio | apVisioDesc | 
| application/vnd.illustrator | ai | Illustrator | apIllustratorDesc | 
| application/vnd.photoshop | psd | PhotoShop | apPhotoshopDesc | 
| application/vnd.pagemaker | p65 | PageMaker | apPageMakerDesc | 
| image/gif | drw, igx, flo, abc, igt | iGrafx | apiGrafxDesc | 
| text/postscript | ps | Distiller | apDistillerDesc | 
| application/hangul | hwp | Hangul97 | apHangul97Desc | 
| application/ichitaro | jtd, jtt | Ichitaro | apIchitaroDesc | 
| image/graphic | gif, jpeg, jpg, png, bmp, tiff, tif | ImageThumbnail | apThumbnailsDesc | 
| image/application | txt, eml, msg | NativeThumbnail | apNativeThumbnailsDesc | 
<@end@> <@table PdfConversions@>
| dFormat | Extensions | dConversion | dDescription | 
|---|---|---|---|
| application/pdf |  | PDFOptimization | apPdfOptimization | 
| application/pdf |  | ImageThumbnail | apPdfThumbnailsDesc | 
<@end@>
Content Server uses a two-step process for categorizing content. The first step translates content into an XML format, the second step transforms the XML file into another XML file useful to Content Categorizer. The process is transparent in that the original content is not modified, and both the translated and transformed XML files are discarded after use.
This section covers the following topics:
The translation step uses the OutsideIn XML Export filters to output the XML in either SearchML or Flexiondoc XML format, depending on the type of content being translated and if the format is available for the platform being used. This translation process enables Categorizer to support a large number of different source document formats.
The transformation step uses eXtensible Style Sheet Language Transformations (XSLT) to transform the initial XML output into an XML equivalent that Content Categorizer can easily search and analyze based on search rules defined by the user.
An overview of the transformation process can be useful to anyone interested in the categorization process, and serve as a starting point for users who would like to define their own XSLT style sheets to accommodate their specific document processing needs.
Translation Using OutsideIn XML Export Filters
A run-time version of the OutsideIn XML Export product is integrated and installed with Content Server, and it filters content checked in for categorization. The Export filters convert content to XML for transformation using Categorizer's XSLT style sheets. The transformation is necessary because the Export XML schemas, Flexiondoc and SearchML, are not in a form easily searched by Content Categorizer rules.
For a list of file formats supported by OutsideIn XML Export, see Chapter 40, "Input File Formats."
Two style sheets are included with Content Categorizer and applied based on the initial translation format provided by the OutsideIn XML Export filter. The style sheets are located in the following directory:
/IntradocDir/data/contentcategorizer/stylesheets/
For content items output in SearchML, searchml_to_scc.xsl is applied. For content items output in Flexiondoc, flexiondoc_to_scc.xsl is applied. SearchML and Flexiondoc both reproduce style designations found in the source content, but they do so differently, in ways not detectable by Content Categorizer rules. The appropriate steeliest can recognize the necessary style information in each format and use that information as the basis for transforming the final output tags into an XML document useful to Content Categorizer.
The similarity between SearchML and Flexiondoc depends on the degree to which internal styles or metadata are used in the content. When working with content using named styles, such as Microsoft Word, the resultant output is similar. When working with content in formats such a PDF or text, results come out with more generic tagging.
Important:
There is a problem with the XSLT transformation used to post-process PDF content that is output in Flexiondoc format. When Flexiondoc is used, single words are assigned to individual XML elements, making the final XML unsuitable for most Categorizer search rules. It is recommended to use SearchML for categorizing PDF content.
When the OutsideIn XML Export filter translates content into SearchML XML format, it identifies the properties of the content item, such as title, subject, and author, and tags them as a <doc_property> element. It distinguishes the properties by a type attribute. It also identifies document text and tags it as a <p> element. It distinguishes styles within text by an s attribute.
When the OutsideIn XML Export filter translates content into Flexiondoc XML format, it identifies the properties of the content item, such as title, subject, and author, and tags them as a <doc_property> element, just like SearchML. However, it distinguishes the properties by a name attribute, instead of type.
Where Flexiondoc differs from SearchML is in how it identifies styles. Paragraph styles are tagged with <tx.p> tags, and character styles are tagged with <tx.r> tags, but each have an attribute based on a unique style id, in addition to a name attribute.
All styles are defined in child elements of the <style_tables> element of the Flexiondoc XML file, and given an id attribute, which is called when referencing the style, and which the template file uses to define a style key with a name attribute.
Link Manager is an optional component bundled with and automatically installed with Content Server. When the component is enabled, it evaluates, filters, and parses the URL links of indexed content items before extracting them for storage in a database table (ManagedLinks). After the ManagedLinks table is populated with the extracted URL links, the Link Manager component references this table to generate link search results, lists of link references for the Content Information page, and the resource information for the Link Info page.
The Link Manager component enables users to:
View lists of links using specific search criteria
View detailed information about a specific link
Recompute and refresh links to reevaluate and validate them
View the links to other content in a specific content item
View the links back to a specific content item
The search results, link references lists, and Link Info pages are useful to determine what content items are affected by content additions, changes, or revision deletions. For example, before deleting a content item, you can verify that any URL references contained in it are insignificant. Another use might be to monitor how content items are being used.
The Link Manager component extracts the URL links during the indexing cycle, so only the URL links of released content items are extracted. For content items with multiple revisions, only the most current released revision has entries in the database table. If the Link Manager component is installed after content items are checked in, perform a rebuild to ensure that all links are included in the ManagedLinks table.
Link Manager does all of its work during the indexing cycle and it increases the amount of time required to index content items and to rebuild collections.
The amount of time required depends on the type and size of the content items involved. That is, if the file is converted, this requires more time than text-based (HTML) files.
For information about disabling Link Manager during the rebuild cycle, see the LkDisableOnRebuild and LkReExtractOnRebuild variables in Oracle Fusion Middleware Configuration Reference for Oracle WebCenter Content.
This section discusses the following topics:
Caution:
The Link Manager component uses HtmlExport 8 for file conversion. A link extractor template file is included with the Link Manager component. HtmlExport 8 requires this template. Do not edit this file.
The Link Manager consists of an extraction engine and a pattern engine. The extraction engine includes a conversion engine (HtmlExport). The conversion engine is used to convert files that the extraction engine cannot natively parse to a text-based file format (HTML).
Link Manager does not use HtmlExport to convert files that contain any of the following strings in the file format: hcs, htm, image, text, xml, jsp, and asp. These text-based files are handled by Link Manager without need for conversion.
During the indexing cycle, the Link Manager component searches the checked-in content items to find URL Links as follows:
The extraction engine converts the file using the conversion engine (if necessary).
The extraction engine then uses the pattern engine to access the link evaluation rules defined in the Link Manager Patterns table.
The evaluation rules tell the extraction engine how to sort, filter, evaluate, and parse the accepted URL links in the content items.
The accepted URL links are inserted or updated in the ManagedLinks table.

Important:
To execute successfully, HtmlExport requires either a virtual or physical video interface adaptor (VIA). Most Windows environments have graphics capabilities that provide HtmlExport access to a frame buffer. UNIX systems, however, may not have graphics cards and do not have a running X-Windows Server for use by HtmlExport. For systems without graphics cards, you can install and use a virtual frame buffer (VFB).
Various file formats (such as Word) must be converted by the conversion engine (HtmlExport) before links can be extracted. Because Link Manager can extract links in text-based files (HTML) without requiring conversion, Link Manager does not use HtmlExport to convert files that contain any of the following strings in the file format: hcs, htm, image, text, xml, jsp, and asp.
Link Manager also handles all the variations of these file formats. For example, the hcs string matches the dynamic server page strings of hcst, hcsp, and hcsf. The image string matches all comparable variants such as image/gif, image/jpeg, image/rgb, image/tiff, and so on. To prevent other types of files from being converted, use the LkDisallowConversionFormats configuration variable. For more information, see Oracle Fusion Middleware Configuration Reference for Oracle WebCenter Content.
Link Manager recognizes links in the following file formats:
Text-based formats (txt, html, xml, jsp, asp, csv, hcst, hcsf, and hcsp)
E-mail (msg and eml)
Microsoft Word
Microsoft Excel
OpenOffice Writer
OpenOffice Calc
All new and existing links are managed during the indexing cycle. When content items are checked in, the accepted links in the content items are added to or updated in the Managed Links table. Existing links are evaluated for changes resulting from content items being checked in or deleted. As links are added or monitored, they are marked as valid or invalid.
When one content item in the system references another content item in the system, the resulting link is marked as valid. When an existing link references a deleted content item, the link is reevaluated and the status changes from valid to invalid. Statuses are recorded as Y (valid) or N (invalid) in the dLkState column of the Managed Links table and displayed for the user in the State column of the Link Info page as Valid or Invalid.
You can specify the following Link Manager configuration variables in the IntradocDir/config/config.cfg file:
AllowForceDelete
HasSiteStudio
LkRefreshBatchSize
LkRefreshErrorsAllowed
LkRefreshErrorPercent
LkRefreshErrorTHreshold
LkDisableOnRebuild
LkDisallowConversonFormats
LkReExtractOnRebuild
LkIsSecureSearch
For information about using these configuration variables, see Oracle Fusion Middleware Configuration Reference for Oracle WebCenter Content.
The Link Manager component uses an extraction engine that references the link patterns defined in a resource table. These link patterns are rules that tell the extraction engine how to sort the different links, which links to filter out, which links to accept, and how to parse the links for more information.
To customize the DomainHome/ucm/LinkManager/resources/linkmanager_resource.htm resource table, you can add new rules or edit the existing default rules. Customize the table using standard component architecture. The table includes the following columns.
| Column Name | Description | 
|---|---|
| lkpName | The name of the pattern and the primary key of the table. Used mainly in error handling and to allow other components to directly target the override of a specified rule. | 
| lkpDescription | An explanation of the purpose of the pattern. | 
| lkpType | The initial screening of the URL: 
 The extraction engine is a two-step engine. The 'prefix' and 'contains' types are used on the path part of the URL, while the 'service' type is used on the query string part of the URL. | 
| lkpParameters | A comma-delimited list of patterns or parameters used by the type. The parameters are Idoc Script capable and are initially evaluated for Idoc Script. The engine uses the following rules for extracting the patterns from the parameters: 
 One rule looks for a URL that begins with the resolved value for  A later rule can look for a URL that literally begins with  | 
| lkpAccept | Determines if the URL is accepted if the pattern is matched: 
 | 
| lkpContinue | Determines if the pattern processing engine continues to parse the URL. If true, the processing continues. If false, processing stops. | 
| lkpLinkType | Specifies the URL type determined for this link. | 
| lkpAction | A function defined in the LinkHandler class referring to a method in the LinkImplementor class used to further parse and evaluate the URL. LinkImplementor can be class aliased or extended. | 
| lkpOrder | The order in which the patterns are to be evaluated. | 
| lkpEnabled | Determines if this rule is evaluated. It is calculated and evaluated during start up when the patterns are loaded. | 
You can add new rules or edit the existing default rules using standard component architecture.
Two database tables are maintained with Link Manager:
Managed Links Table: A link is stored in the Managed Links table if the pattern engine successfully processes it and determines that the link is acceptable. Each link in the table is assigned a unique class id (dLkClassId) and each row in the table has a unique GUID (dLkGUID). A single link can consist of multiple rows in the table if multiple resources define the link and each resource can independently break the link.
For example, in Site Studio, you can define a single link by both a node and a content item. If the node is missing, the link breaks. If the content item is missing, the link breaks. In this case, there are two resources that do not depend on each other and each can break the link. Consequently, each resource is managed separately in the ManagedLinks table.
To improve query execution performance, standard indexes are added to the dDocName and dLkResource columns in the Managed Links table. System administrators are responsible for adjusting these indexes to accommodate specific database tuning requirements in various system environments.
Link Reference Count Table: This table maps the content items to the number of times each is referenced in the ManagedLinks table. A content item in this table might not be a content item that is currently managed by Content Server. If there is an entry for a content item in this table, it only indicates that a link in the ManagedLinks table, as parsed by the pattern engine, has referenced the content item as a 'doc' resource.
When a content item is checked in and a link references it, the link is marked as valid. When a link references a deleted content item, the link is marked as invalid. Notice that the dLkState column indicates the link's status as Y (valid) or N (invalid).
The Link Manager component provides filters for parts of the pattern engine that allow customization of some very specific behavior. In general, the rules of the pattern engine are usually the ones to be modified. In certain circumstances Link Manager explicitly creates and uses filters to augment its standard behavior.
extractLinks Filter: Used during the extraction process when the extraction engine parses the accepted URL links. As links are extracted, Link Manager looks for specific HTML tags. However, other HTML tags might also contain relevant links. If so, use this filter to extract the additional links.
The tag is passed to the filter as a cached object with the key HtmlTag. The value (or link) is passed back to the parse with the key HtmlValue. If the filter extracts extra information, be aware that the passed-in binder is flushed before being passed to the pattern engine. The service.setCachedObject and service.getCachedOject methods should pass and retrieve the extra information, respectively.
By default, it looks for the following HTML tags: <a>, <link>, <iframe>, <img>, <script>, and <frame>.
linkParseService Filter: Used during the extraction process when the pattern engine evaluates links that use the IdcService parameter. After evaluation, the link binder and service are provided for the linkParseService filter.
The service contains the binder for the parsed URL and information map. Customize the values in the parsed URL binder by adjusting certain parameters or customize the information map (which tells the parseService method what parameters to extract from the URL binder and how to map the data to resource types).
sortAndDecodeLinks filter: Only available from the 'refresh' option. It is only called when users are refreshing the links. The service contains the 'LinkSetMap' which includes a sorted list of links contained in the ManagedLinks table. The refresh validates the Site Studio links and the existence of all links referring to 'doc' resources. You can create a component that augments the standard validation.
Important:
When using Site Studio, set the HasSiteStudio configuration variable value to true. This variable enables the Site Studio-specific patterns for parsing 'friendly' URLs for the pattern engine. For more information about the HasSiteStudio variable, see Oracle Fusion Middleware Configuration Reference for Oracle WebCenter Content.
When configured to work with Site Studio, Link Manager obtains links from Site Studio by directly requesting a parsing of the links that Site Studio has identified. In return, Site Studio provides information about the links pertaining to its operation and components. In particular, Site Studio provides information about the node/section, if a content item is used, the state of the content item, the type of link (friendly, page, or node), and if the link is valid.
Site Studio does not load its project information when the Standalone applications are launched. Therefore the Site Studio links are not properly evaluated if a rebuild or index update cycle is started and completed by a standalone application.
When a user changes links using the Site Studio designer, Link Manager checks filter events. If a node is deleted, Link Manager marks all links using the deleted node as invalid, thus managing links that directly reference the node ID. Additionally, with information provided by Site Studio, Link Manager can accurately determine the state of the link.
Friendly URLs (links that do not reference the node ID or dDocName) are more difficult to manage and validate. When a node property changes, Link Manager marks all friendly links (both relative and absolute) that use the node as invalid and broken. Link Manager cannot retrace the parent chain to determine what part of the link was changed, how to fix it, or determine if it is actually broken.
Site Studio uses two types of managed links:
Completely Managed Links: These are any links using the SS_GET_PAGE IdcService or links to nodes that include the following:
javascript:nodelink(Node,Site)
javascript:nodelink(Node)
ssNODELINK/Site/Node
ssNODELINK/Node
Also links to pages that include the following:
ssLINK/Doc
ssLINK/Node/Doc
ssLINK/Site/Node/Doc
ssLink(Doc)
ssLink(Doc,Node)
ssLink(Doc,Node,Site)
javascript:link(Doc)
javascript:link(Doc,Node,Site)
Provisionally Managed Links: The following Site Studio links are managed up to Site Studio node changes. Use the 'refresh' option from the Managed Links Administration page to determine state of the links. If the majority of links are of this form and nodes have changed dramatically, you should refresh or recompute the links.
Absolute (or full URLs): http://site/node/doc.htm
Friendly links to nodes
<!--$ssServerRelativeSiteRoot-->dir/dir/index.htm
[!--$ssServerRelativeSiteRoot--]dir/dir/index.htm
<%=ssServerRelativeSiteRoot%>dir/dir/index.htm
Friendly links to pages
<!--$ssServerRelativeSiteRoot-->dir/dir/doc.htm
[!--$ssServerRelativeSiteRoot--]dir/dir/doc.htm
<%=ssServerRelativeSiteRoot%>dir/dir/doc.htm
This section covers the following topics:
In addition to the refresh activities available on the Managed Links Administration page, you can use alternative methods to update the Managed Links and Link Reference Count tables:
Using the Repository Manager, perform a collection rebuild. This process rebuilds the entire search index, and the old index collection is replaced with a new index collection when the rebuild successfully completes.
If Repository Manager is opened as a standalone application, the alternate refresh method can only be used when the HasSiteStudio configuration variable is disabled. When information is requested from Site Studio and the Repository Manager is in standalone mode, Site Studio is not initialized completely and does not return accurate information. This issue does not occur if the Repository Manager applet is used.
If custom fields have been added while content is in the system, use the Configuration Manager Rebuild Search Index to rebuild the search index.
To reevaluate the links in the ManagedLinks table:
Choose Administration then Managed Links Administration from the Main menu.
On the Managed Links Administration page, use an option to manage links:
To recompute links: Click Go next to the Recompute links option. This refresh activity resubmits each link in the ManagedLinks table to the patterns engine. The link is evaluated according to the pattern rules and updated in the table. A link can be reclassified as another type of link depending on which patterns have been enabled or disabled. Use this option if the pattern rules have changed.
To refresh links: Click Go next to the Refresh links option. This activity checks each link in the ManagedLinks table and attempts to determine if the link is valid. For Site Studio links, the links are sent to the Site Studio decode method to determine what nodes and content items are used by the link. It also determines if the link is valid and is indeed a Site Studio link.
Use this option after many changes to Site Studio node/section properties. LinkManager cannot completely track the changes to 'friendly' Site Studio links. By refreshing or forcing a validation on the links, Link Manager can more accurately determine which links are broken and which are valid.
To refresh the references counts: Click Go next to the Refresh option. This activity flushes the LinkReferenceCount table and queries the ManagedLinks table for the content item references. Both the 'recompute' and 'refresh' table activities try to maintain the LinkReferenceCount table. However, on occasion, this table can become out-of-sync and this option, when used on a quiet system, rebuilds this table.gv
To cancel a refresh activity: Click Go next to the Abort activity option. Only one refresh activity can be active at any one time.
The Status area indicates how many links have been processed and how many errors have been encountered.
Only one refresh activity can be active at any one time. Wait until the refresh activity completes and the 'Ready' status is displayed before attempting another refresh activity.