Custom Model Datasets
Review the supported datasets for custom models, and how to convert datasets into a supported format.
Label Studio Integration
Oracle’s Data Labeling Service is being deprecated. As an option, we recommend migrating your labeled datasets to Label Studio, an open source and marketplace-supported labeling tool.
Follow these steps to convert Data Labeling snapshot exports to Label Studio import and Label Studio raw JSON export formats. Use these formats for further annotation in Label Studio or direct model training.
Allowed Datasets for Custom Text Classification
You can provide labeled data for custom text classification models in two ways:
- Data Labeling projects
- Comma-Separated Value (
.csv) files
- CSV File Requirements
-
-
The first line must be a header containing the following two-column names:
text: captures the text to be classified.labels: captures one or more assigned classes. For multilabel classification datasets, several class names can be specified by joining them with|symbol.
- All lines after the header line contain training records.
- If the file has more than two columns, only the
textandlabelscolumns are used to train the model. -
For the CSV file encoding, use UTF-8. When using Excel, save the file as CSV UTF-8 (Comma-delimited) (.csv).
- For delimiter, use comma (
,). - For escape character, use a double quote (
"), also known with the Unicode character ofU+0022.For example, in Excel, if you type the following text:
This is a "double quote" sentenceThe preceding sentence is stored in the CSV as follows:
"This is a ""double quote"" sentence"
Example CSV file for single label Text Classification:
text,labels Windows OS -unable to print,Network Printer Failure Citrix Account frequently locking,Account (Password reset) Pull print queue not working ,Application Component Disconnect wifi disable and lan is disconnected at the desktop,Hardware Device FailureExample CSV file for multi label Text Classification:Windows OS -unable to print,Network Printer Failure Pull print queue not working ,Application Component Disconnect|Network Printer Failure wifi disable and lan is disconnected at the desktop,Hardware Device Failure|Network Connection Issue -
Allowed Dataset Formats for Custom NER
You can provide labeled data for custom NER models in two ways:
- Label Studio projects
- JSON Lines format (
.jsonl).
- JSON File Requirements
-
The JSON file doesn't include the training data. Instead, the JSON file is a manifest file that contains labels and pointers (relative paths) to files with unlabeled data.
The JSON format is a JSON Lines (JSONL) format, where each line is a single JSON object:
- The first line in the object describes the set of labels or classes and the type of annotation file.
- All later lines describe a training record.
-
Save all the text files in the same directory as the manifest file
(.jsonl), and have the training records name the files.
- Schema Definition
-
- The first line is a header line. It contains a JSON object that describes the file type.
- Any later line contains a JSON object that represents a labeled record.
- Header Line Format
-
Field Type Description labelsSetArray of objects. Object with a string member,
"name"that indicates the set of entities supported for annotation. List all entities here.annotationFormatString Use "ENTITY_EXTRACTION"for NER datasets.datasetFormatDetailsObject Object with a string member, "formatType"that indicates the type of data being annotated. Set the value offormatTypeto"TEXT"for Language. - Example JSON Schema:
-
{ "labelsSet": [ { "name": "Label1" }, { "name": "Label2" }, { "name": "Label3" }, { "name": "Label4" } ], "annotationFormat": "ENTITY_EXTRACTION", "datasetFormatDetails": { "formatType": "TEXT" } } - Labeled Record Format
-
Field Type Description sourceDetailsObject Object with a string member,
paththat points to the file being annotated.The file path is relative to the location of the
jsonfile.annotationsObject Complex object that describes the annotations. entitiesArray (Objects) A list of the entities identified in the record. entityTypeString The type of entity annotation. For the value, use "TEXTSELECTION"for NER.labelsArray (Objects) Each object in the array has the member, "label_name"that represents the type of entity identified.textSpanObject An object that represents the text span. Contains two required numeric members: "offset", and"length". - JSON Schema for Labeled Record Format Example:
-
{ "sourceDetails": { "path": "Complaint3.txt" }, "annotations": [ { "entities": [ { "entityType": "TEXTSELECTION", "labels": [ { "label_name": "Label1" }, { "label_name": "Label2" } ], "textSpan": { "offset": 0, "length": 28 } }, { "entityType": "TEXTSELECTION", "labels": [ { "label_name": "Label1" } ], "textSpan": { "offset": 196, "length": 11 } } ] } ] }