Custom Model Datasets

Review the supported datasets for custom models, and how to convert datasets into a supported format.

Label Studio Integration

Oracle’s Data Labeling Service is being deprecated. As an option, we recommend migrating your labeled datasets to Label Studio, an open source and marketplace-supported labeling tool.

Follow these steps to convert Data Labeling snapshot exports to Label Studio import and Label Studio raw JSON export formats. Use these formats for further annotation in Label Studio or direct model training.

Allowed Datasets for Custom Text Classification

You can provide labeled data for custom text classification models in two ways:

  • Data Labeling projects
  • Comma-Separated Value (.csv) files
CSV File Requirements
  • The first line must be a header containing the following two-column names:
    • text: captures the text to be classified.
    • labels: captures one or more assigned classes. For multilabel classification datasets, several class names can be specified by joining them with | symbol.
  • All lines after the header line contain training records.
  • If the file has more than two columns, only the text and labels columns are used to train the model.
  • For the CSV file encoding, use UTF-8. When using Excel, save the file as CSV UTF-8 (Comma-delimited) (.csv).

  • For delimiter, use comma (,).
  • For escape character, use a double quote ("), also known with the Unicode character of U+0022.

    For example, in Excel, if you type the following text:

    This is a "double quote" sentence

    The preceding sentence is stored in the CSV as follows:

    "This is a ""double quote"" sentence"

Example CSV file for single label Text Classification:

text,labels
Windows OS -unable to print,Network Printer Failure
Citrix Account frequently locking,Account (Password reset)
Pull print queue not working ,Application Component Disconnect
wifi disable and lan is disconnected at the desktop,Hardware Device Failure
Example CSV file for multi label Text Classification:
Windows OS -unable to print,Network Printer Failure
Pull print queue not working ,Application Component Disconnect|Network Printer Failure
wifi disable and lan is disconnected at the desktop,Hardware Device Failure|Network Connection Issue

Allowed Dataset Formats for Custom NER

You can provide labeled data for custom NER models in two ways:

  • Label Studio projects
  • JSON Lines format (.jsonl).
JSON File Requirements

The JSON file doesn't include the training data. Instead, the JSON file is a manifest file that contains labels and pointers (relative paths) to files with unlabeled data.

The JSON format is a JSON Lines (JSONL) format, where each line is a single JSON object:

  • The first line in the object describes the set of labels or classes and the type of annotation file.
  • All later lines describe a training record.
  • Save all the text files in the same directory as the manifest file (.jsonl), and have the training records name the files.

Schema Definition
  1. The first line is a header line. It contains a JSON object that describes the file type.
  2. Any later line contains a JSON object that represents a labeled record.
Header Line Format
Field Type Description
labelsSet Array of objects.

Object with a string member, "name" that indicates the set of entities supported for annotation. List all entities here.

annotationFormat String Use "ENTITY_EXTRACTION" for NER datasets.
datasetFormatDetails Object Object with a string member, "formatType" that indicates the type of data being annotated. Set the value of formatType to "TEXT" for Language.
Example JSON Schema:
{
    "labelsSet": [
      {
        "name": "Label1"
      },
      {
        "name": "Label2"
      },
      {
        "name": "Label3"
      },
      {
        "name": "Label4"
      }
    ],
    "annotationFormat": "ENTITY_EXTRACTION",
    "datasetFormatDetails": {
      "formatType": "TEXT"
    }
  }
Labeled Record Format
Field Type Description
sourceDetails Object

Object with a string member, path that points to the file being annotated.

The file path is relative to the location of the json file.

annotations Object Complex object that describes the annotations.
entities Array (Objects) A list of the entities identified in the record.
entityType String The type of entity annotation. For the value, use "TEXTSELECTION" for NER.
labels Array (Objects) Each object in the array has the member, "label_name" that represents the type of entity identified.
textSpan Object An object that represents the text span. Contains two required numeric members: "offset", and "length".
JSON Schema for Labeled Record Format Example:
{
    "sourceDetails": {
      "path": "Complaint3.txt"
    },
    "annotations": [
      {
        "entities": [
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              },
              {
                "label_name": "Label2"
              }
            ],
            "textSpan": {
              "offset": 0,
              "length": 28
            }
          },
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              }
            ],
            "textSpan": {
              "offset": 196,
              "length": 11
            }
          }
        ]
      }
    ]
  }