Custom Document Loaders

55 Custom Document Loaders

For specialized use cases, users can define custom document loaders tailored to their specific needs.

To implement a custom document loader you should create a CDI Bean that implements `com.oracle.coherence.rag.DocumentLoader` interface, and provide the name of the URI protocol it is intended to handle via `@Named` class annotation.

The `load` method should fetch the document based on the contents of the specified `uri`, parse it using the injected `DocumentParser`, and return an instance of a LangChain4J `dev.langchain4j.data.document.Document` containing extracted text and metadata.

As an example, here is the actual implementation of the `OciObjectStorageDocumentLoader` that does all of the above:

java
@Named("oci.os")
@ApplicationScoped
public class OciObjectStorageDocumentLoader
        implements DocumentLoader
    {
    @Inject
    DocumentParser documentParser;                                              // <1>

    @Inject
    ObjectStorageClient client;                                                 // <2>
    
    public Document load(URI uri)
        {
        String ns     = uri.getHost();
        Path   path   = Path.of(uri.getPath());
        String bucket = path.getName(0).toString();
        String object = path.subpath(1, path.getNameCount()).toString();

        var request = GetObjectRequest.builder()                                // <3>
                .namespaceName(ns)
                .bucketName(bucket)
                .objectName(object)
                .build();

        var response = client.getObject(request);                               // <4>

        var source = new DocumentSource()                                       // <5>
            {
            public InputStream inputStream() throws IOException
                {
                return response.getInputStream();                               // <6>
                }

            public Metadata metadata()
                {
                var metadata = Metadata.metadata("url", uri.toString());        // <7>
                metadata.put("ns", ns);
                metadata.put("bucket", bucket);
                metadata.put("object", object);
                metadata.put("content_type", response.getContentType());
                metadata.put("content_md5", response.getContentMd5());
                metadata.put("content_length", response.getContentLength());
                return metadata;
                }
            };

        return dev.langchain4j.data.document.DocumentLoader.load(source, documentParser);  // <8>
        }
    }

`DocumentParser` to use will be injected based on store configuration
`ObjectStorageClient` is created and injected based on configuration
Create a `GetObjectRequest` based on information from the specified `uri`
Execute the request and get a response from the service
Create LangChain4J `DocumentSource` that will be used to access document content and metadata
Return document content from the response
Create document metadata based on the response
Create an instance of a `Document` from the `DocumentSource` created in step #5, using injected `DocumentParser` to extract text from the document content.