Custom Model Datasets

Review the supported datasets for custom models, and how to convert datasets into a supported format.

Label Studio Integration

Oracle’s Data Labeling Service is being deprecated. As an option, we recommend migrating your labeled datasets to Label Studio, an open source and marketplace-supported labeling tool.

Follow these steps to convert Data Labeling snapshot exports to Label Studio import and Label Studio raw JSON export formats. Use these formats for further annotation in Label Studio or direct model training.

Allowed Datasets for Custom Text Classification

You can provide labeled data for custom text classification models in two ways:

Data Labeling projects
Comma-Separated Value (.csv) files

CSV File Requirements

The first line must be a header containing the following two-column names:
- text: captures the text to be classified.
- labels: captures one or more assigned classes. For multilabel classification datasets, several class names can be specified by joining them with | symbol.
All lines after the header line contain training records.
If the file has more than two columns, only the text and labels columns are used to train the model.
For the CSV file encoding, use UTF-8. When using Excel, save the file as CSV UTF-8 (Comma-delimited) (.csv).
For delimiter, use comma (,).
For escape character, use a double quote ("), also known with the Unicode character of U+0022.
For example, in Excel, if you type the following text:
```
This is a "double quote" sentence
```
The preceding sentence is stored in the CSV as follows:
```
"This is a ""double quote"" sentence"
```

Example CSV file for single label Text Classification:

text,labels
Windows OS -unable to print,Network Printer Failure
Citrix Account frequently locking,Account (Password reset)
Pull print queue not working ,Application Component Disconnect
wifi disable and lan is disconnected at the desktop,Hardware Device Failure

Example CSV file for multi label Text Classification:

Windows OS -unable to print,Network Printer Failure
Pull print queue not working ,Application Component Disconnect|Network Printer Failure
wifi disable and lan is disconnected at the desktop,Hardware Device Failure|Network Connection Issue

Allowed Dataset Formats for Custom NER

You can provide labeled data for custom NER models in two ways:

Label Studio projects
JSON Lines format (.jsonl).

JSON File Requirements

The JSON file doesn't include the training data. Instead, the JSON file is a manifest file that contains labels and pointers (relative paths) to files with unlabeled data.

The JSON format is a JSON Lines (JSONL) format, where each line is a single JSON object:

The first line in the object describes the set of labels or classes and the type of annotation file.
All later lines describe a training record.
Save all the text files in the same directory as the manifest file (.jsonl), and have the training records name the files.

Schema Definition

The first line is a header line. It contains a JSON object that describes the file type.
Any later line contains a JSON object that represents a labeled record.

Header Line Format


Field	Type	Description
`labelsSet`	Array of objects.	Object with a string member, `"name"` that indicates the set of entities supported for annotation. List all entities here.
`annotationFormat`	String	Use `"ENTITY_EXTRACTION"` for NER datasets.
`datasetFormatDetails`	Object	Object with a string member, `"formatType"` that indicates the type of data being annotated. Set the value of `formatType` to `"TEXT"` for Language.

Example JSON Schema:

{
    "labelsSet": [
      {
        "name": "Label1"
      },
      {
        "name": "Label2"
      },
      {
        "name": "Label3"
      },
      {
        "name": "Label4"
      }
    ],
    "annotationFormat": "ENTITY_EXTRACTION",
    "datasetFormatDetails": {
      "formatType": "TEXT"
    }
  }

Labeled Record Format


Field	Type	Description
`sourceDetails`	Object	Object with a string member, `path` that points to the file being annotated. The file path is relative to the location of the `json` file.
`annotations`	Object	Complex object that describes the annotations.
`entities`	Array (Objects)	A list of the entities identified in the record.
`entityType`	String	The type of entity annotation. For the value, use `"TEXTSELECTION"` for NER.
`labels`	Array (Objects)	Each object in the array has the member, `"label_name"` that represents the type of entity identified.
`textSpan`	Object	An object that represents the text span. Contains two required numeric members: `"offset"`, and `"length"`.

JSON Schema for Labeled Record Format Example:

{
    "sourceDetails": {
      "path": "Complaint3.txt"
    },
    "annotations": [
      {
        "entities": [
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              },
              {
                "label_name": "Label2"
              }
            ],
            "textSpan": {
              "offset": 0,
              "length": 28
            }
          },
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              }
            ],
            "textSpan": {
              "offset": 196,
              "length": 11
            }
          }
        ]
      }
    ]
  }

Oracle Cloud Infrastructure Documentation

Custom Model Datasets

Label Studio Integration

Allowed Datasets for Custom Text Classification

Allowed Dataset Formats for Custom NER