Getting Started with the N/documentCapture Module

Note:

The content in this help topic pertains to SuiteScript 2.1.

The following sections help you get started with the N/documentCapture module:

Extracting Text from a PDF File
Extracting Feature Content from a Document
Using OCI Credentials to Obtain Additional Usage

Extracting Text from a PDF File

To extract text from a PDF file of any length, use documentCapture.documentToText(options). For a sample, see Extract Text from a PDF File.

Provide the following parameters:

options.file – The PDF file to extract text from. This file must be located in the NetSuite File Cabinet, and you can specify the file using its internal ID or file path.
options.timeout (optional) – The timeout period, in milliseconds, to wait for the service to return results. The default value is 30,000 milliseconds (30 seconds). You can specify a longer timeout period, but you can't specify a period shorter than 30,000 milliseconds. If you do, the default 30,000 millisecond timeout is used instead.

The documentCapture.documentToText(options) method returns a string with the text of the PDF file. If you want to analyze the text further, you can provide the extracted text to the llm.generateText(options) method in the N/llm module, as the following example shows:

            // "14" is the unique ID of a PDF stored in the NetSuite File Cabinet
const fileObj = file.load({
    id: "14"
});
const extractedData = documentCapture.documentToText({
    file: fileObj
});

const response = llm.generateText({
    prompt: "What is this invoice for?",
    documents: [{
        id: '14',
        data: extractedData
    }]
});

Keep the following considerations in mind:

The documentCapture.documentToText(options) method supports PDF files only. If you want to extract content from JPG, PNG, or TIFF files, use documentCapture.documentToStructure(options) instead. See Extracting Feature Content from a Document.
This method extracts content as plain text only. If you want to extract other elements in a structured form, such as tables or key-value pairs (fields), use documentCapture.documentToStructure(options) instead.
This method supports PDF files of any size. However, as a best practice, files in the NetSuite File Cabinet should be less than 100 MB if possible. For large files, consider separating them into smaller files before using this method, which can reduce the time it takes for the text to be extracted and prevent timeout errors. For more information, see Best Practices for Preparing Files for Upload to the File Cabinet.
This method doesn't consume usage from the monthly usage pool of free requests provided by NetSuite (unlike documentCapture.documentToStructure(options), which does consume usage).
Encrypted files are not supported.

Extracting Feature Content from a Document

To extract specific feature content (such as tables and fields) from a file in PDF, JPG, PNG, or TIFF format, use documentCapture.documentToStructure(options). For a sample, see Extract Feature Content from a Document Synchronously.

Provide the following parameters:

options.file – The document file to extract content from. This file must be located in the NetSuite File Cabinet, and you can specify the file using its internal ID or file path.
options.documentType (optional) – The document type. By specifying the type of document, the service can apply pretrained models that are optimized for that type, which can provide more accurate extraction results. Use values from the documentCapture.DocumentType enum to set this parameter. If you don't specify a value for this parameter, the DocumentType.OTHERS type is used by default.
options.features (optional) – The features to extract from the specified document. Use values from the documentCapture.Feature enum to set this parameter. If you don't specify a value for this parameter, the Feature.TEXT_EXTRACTION and Feature.TABLE_EXTRACTION features are used by default.
options.language (optional) – The language of the specified document. Use values from the documentCapture.Language enum to set this parameter. If you don't specify a value for this parameter, ENG (English) is used by default.
options.timeout (optional) – The timeout period, in milliseconds, to wait for the service to return results. The default value is 30,000 milliseconds (30 seconds). You can specify a longer timeout period, but you can't specify a period shorter than 30,000 milliseconds. If you do, the default 30,000 millisecond timeout is used instead.
options.ociConfig (optional) – Oracle Cloud Infrastructure (OCI) credentials to obtain unlimited usage. For more information about providing these credentials, see Using OCI Credentials to Obtain Additional Usage. If you don't specify these credentials, successful calls to documentCapture.documentToStructure(options) consume usage from the free monthly usage pool of requests provided in NetSuite by default.

The documentCapture.documentToStructure(options) method returns a documentCapture.Document object with the following structure:

            {
    mimeType: string,
    pages: {
        fields: Field[],
        lines: Line[],
        tables: Table[],
        words: Word[]
    }
}

The data that's available in this object depends on the features you specify when you call documentCapture.documentToStructure(options). For example, this object includes fields (as documentCapture.Field objects) only when you specify the Feature.FIELD_EXTRACTION feature.

Keep the following considerations in mind:

The documentCapture.documentToStructure(options) method extracts content synchronously and supports documents up to five pages in length. If you want to extract content from longer documents, you must submit an asynchronous task using the N/task module. For an example, see Extract Content from a Document Asynchronously.
Encrypted files are not supported.

Using OCI Credentials to Obtain Additional Usage

NetSuite provides a free monthly usage pool of requests for the N/documentCapture module. Successful calls to documentCapture.documentToStructure(options) consume usage from this pool, and the pool is refreshed each month. You can track your current monthly usage on the AI Preferences page in NetSuite. For more information, see View SuiteScript AI Usage Limit and Usage. Calls to documentCapture.documentToText(options) don't consume usage from this pool.

Each SuiteApp installed in your account gets its own separate monthly usage pool for N/documentCapture methods, and these SuiteApp pools are independent from the usage pool for your regular (non-SuiteApp) scripts. For example, if you install two SuiteApps, each with scripts that use N/documentCapture methods, each SuiteApp draws from its own unique usage pool. This approach means you get twice the total SuiteApp usage (one pool per SuiteApp). Any other scripts outside of SuiteApps use a separate usage pool, and SuiteApp usage doesn't count against it. This setup ensures that SuiteApps can't use up all your monthly allocation and block your own scripts from calling N/documentCapture methods.

If you want more monthly usage, you can provide the Oracle Cloud Infrastructure (OCI) credentials for an Oracle Cloud account that includes the OCI Document Understanding service. When you provide these credentials, usage is consumed from the provided OCI account instead of the free usage pool provided by NetSuite. You can provide OCI configuration parameters in two ways:

Using the AI Preferences page in NetSuite. For more information, see Manage AI Preferences.
Using the options.ociConfig parameter of documentCapture.documentToStructure(options). When using this approach, the credentials you provide override any OCI credentials that are configured on the AI Preferences page. For more information, see Configure OCI Credentials for AI.

For a list of required OCI configuration parameters for synchronous requests, see documentCapture.documentToStructure(options). For asynchronous requests (those that use a document capture task in the N/task module), you must provide three additional OCI configuration parameters: objectStorageNamespace, outputBucketName, and inputBucketName. You can provide these additional parameters in two ways:

Using the fields in the Optional configuration for Asynchronous Document Capture section on the Settings subtab of the AI Preferences page.

Using the ociConfig property of the document capture task. When using this approach, the credentials you provide override any OCI credentials that are configured on the AI Preferences page. Here is the full list of parameters required in the object you provide for the ociConfig property:

                docTask.ociConfig = {
    userId: 'user-ocid',
    tenancyId: 'user-tenancy',
    compartmentId: 'user-compartment',
    fingerprint: 'custsecret_secret_fingerprint_id',
    privateKey: 'custsecret_secret_privatekey_id',
    objectStorageNamespace: 'oraclenetsuite',
    outputBucketName: 'in-bucket-name',
    inputBucketName: 'out-bucket-name'
};

Extracting Text from a PDF File

Extracting Feature Content from a Document

Using OCI Credentials to Obtain Additional Usage

Related Topics