55 Custom Document Loaders
For specialized use cases, users can define custom document loaders tailored to their specific needs.
To implement a custom document loader you should create a CDI Bean that implements `com.oracle.coherence.rag.DocumentLoader` interface, and provide the name of the URI protocol it is intended to handle via `@Named` class annotation.
The `load` method should fetch the document based on the contents of the specified `uri`, parse it using the injected `DocumentParser`, and return an instance of a LangChain4J `dev.langchain4j.data.document.Document` containing extracted text and metadata.
As an example, here is the actual implementation of the `OciObjectStorageDocumentLoader` that does all of the above:
java
@Named("oci.os")
@ApplicationScoped
public class OciObjectStorageDocumentLoader
implements DocumentLoader
{
@Inject
DocumentParser documentParser; // <1>
@Inject
ObjectStorageClient client; // <2>
public Document load(URI uri)
{
String ns = uri.getHost();
Path path = Path.of(uri.getPath());
String bucket = path.getName(0).toString();
String object = path.subpath(1, path.getNameCount()).toString();
var request = GetObjectRequest.builder() // <3>
.namespaceName(ns)
.bucketName(bucket)
.objectName(object)
.build();
var response = client.getObject(request); // <4>
var source = new DocumentSource() // <5>
{
public InputStream inputStream() throws IOException
{
return response.getInputStream(); // <6>
}
public Metadata metadata()
{
var metadata = Metadata.metadata("url", uri.toString()); // <7>
metadata.put("ns", ns);
metadata.put("bucket", bucket);
metadata.put("object", object);
metadata.put("content_type", response.getContentType());
metadata.put("content_md5", response.getContentMd5());
metadata.put("content_length", response.getContentLength());
return metadata;
}
};
return dev.langchain4j.data.document.DocumentLoader.load(source, documentParser); // <8>
}
}
- `DocumentParser` to use will be injected based on store configuration
- `ObjectStorageClient` is created and injected based on configuration
- Create a `GetObjectRequest` based on information from the specified `uri`
- Execute the request and get a response from the service
- Create LangChain4J `DocumentSource` that will be used to access document content and metadata
- Return document content from the response
- Create document metadata based on the response
- Create an instance of a `Document` from the `DocumentSource` created in step #5, using injected `DocumentParser` to extract text from the document content.