Extracting Text from a PDF File

To extract text from a PDF file of any length, use documentCapture.documentToText(options). For a sample, see Extract Text from a PDF File.

Provide the following parameters:

options.file - The PDF file to extract text from. This file must be located in the NetSuite File Cabinet, and you can specify the file using its internal ID or file path.
options.timeout (optional) - The timeout period, in milliseconds, to wait for the service to return results. The default value is 30,000 milliseconds (30 seconds). You can specify a longer timeout period, but you can't specify a period shorter than 30,000 milliseconds. If you do, the default 30,000 millisecond timeout is used instead.

The documentCapture.documentToText(options) method returns a string with the text of the PDF file. If you want to analyze the text further, you can provide the extracted text to the llm.generateText(options) method in the N/llm module, as the following example shows:

          // "14" is the unique ID of a PDF stored in the NetSuite File Cabinet
const fileObj = file.load({
    id: "14"
});
const extractedData = documentCapture.documentToText({
    file: fileObj
});

const response = llm.generateText({
    prompt: "What is this invoice for?",
    documents: [{
        id: '14',
        data: extractedData
    }]
});

Keep the following considerations in mind:

The documentCapture.documentToText(options) method supports PDF files only. If you want to extract content from JPG, PNG, or TIFF files, use documentCapture.documentToStructure(options) instead. See Extracting Feature Content from a Document.
This method extracts content as plain text only. If you want to extract other elements in a structured form, such as tables or key-value pairs (fields), use documentCapture.documentToStructure(options) instead.
This method supports PDF files of any size. However, as a best practice, files in the NetSuite File Cabinet should be less than 100 MB if possible. For large files, consider separating them into smaller files before using this method, which can reduce the time it takes for the text to be extracted and prevent timeout errors. For more information, see Best Practices for Preparing Files for Upload to the File Cabinet.
This method doesn't consume usage from the monthly usage pool of free requests provided by NetSuite (unlike documentCapture.documentToStructure(options), which does consume usage).
Encrypted files are not supported.

Related Topics