54 Document Loaders
Coherence RAG supports ingestion from a wide range of document sources, offering built-in support for file system and HTTP protocols, along with integrations for OCI Object Storage, AWS S3, Azure Blob Storage, and Google Cloud Storage.
Additionally, it provides a flexible framework that allows customers to extend support to proprietary document repositories such as OpenText, Documentum, SharePoint, Alfresco, and other enterprise content management systems.
Built-in Document Loaders
Coherence RAG includes built-in implementations for `file` and `http`/`https` protocols, which allow it to load any document from a local file system or from the web.
While the former is really only useful during development, support for `http(s)` allows ingestion of any web page or a document that is available on the Internet.
Cloud Document Loaders
Coherence RAG also supports additional document loaders for cloud storage solutions.
Unlike built-in document loaders, cloud document loaders are packaged into separate modules, as they introduce additional dependencies on cloud provider's SDK, and typically have additional configuration requirements.
OCI ObjectStorage
Oracle Cloud Infrastructure (OCI) Object Storage is implemented by the `oci-object-storage-loader` module. To use it, you will need to add the following dependency to your project:
xml
<dependency>
<groupId>${coherence.groupId}</groupId>
<artifactId>coherence-rag-oci-object-storage-loader</artifactId>
<version>${coherence.version}</version>
</dependency>
This will configure document loader for the custom `oci.os` URI protocol, and allow documents to be loaded from the OCI ObjectStorage using the following URI syntax:
oci.os://<namespace>/<bucket>/<object>
For example, to load document `coherence/coherence-rag.pdf` from the `docs` bucket in the `axaxnpcrorw5` namespace, you would reference it as:
oci.os://axaxnpcrorw5/docs/coherence/coherence-rag.pdf
For more information, please refer to [OCI Object Storage documentation](https://docs.oracle.com/en-us/iaas/Content/Object/home.htm).
Configuration
In order to load documents from a private OCI ObjectStorage bucket, you will need to provide the details necessary to authenticate, unless you configure your cluster members for OCI Instance Authentication.
This requires the following configuration options:
| Property | Env Variable | Description | Required |
|----------------------|----------------------|------------------------------------------|-----------|
| oci.region | OCI_REGION | The OCI region to connect to | Yes |
| oci.tenant.id | OCI_TENANT_ID | The OCI tenant identifier | Yes |
| oci.user.id | OCI_USER_ID | The OCI user identifier | Yes |
| oci.auth.key | OCI_AUTH_KEY | The path to a private authentication key | Yes |
| oci.auth.fingerprint | OCI_AUTH_FINGERPRINT | The authentication fingerprint | Yes |
For more details on OCI Java SDK authentication, please refer to [documentation](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/devtoolslanding.htm).
AWS S3
Amazon Web Services (AWS) S3 loader is implemented by the `aws-s3-loader` module. To use it, you will need to add the following dependency to your project:
xml
<dependency>
<groupId>${coherence.groupId}</groupId>
<artifactId>coherence-rag-aws-s3-loader</artifactId>
<version>${coherence.version}</version>
</dependency>This will configure document loader for the custom `s3` URI protocol, and allow documents to be loaded from the AWS S3 using the following URI syntax:
s3://<bucket>/<object>For example, to load document `coherence/coherence-rag.pdf` from the `coherence-rag-docs` bucket, you would reference it as:
s3://coherence-rag-docs/coherence/coherence-rag.pdf
For more information, please refer to [Amazon S3 documentation](https://docs.aws.amazon.com/s3).
Configuration
In order to load documents from a private Amazon S3 bucket, you will need to provide the details necessary to authenticate. This requires the following configuration options:
| Property | Env Variable | Description | Required |
|----------------------|------------------------|------------------------------|-----------|
| aws.region | AWS_REGION | The AWS region to connect to | Yes |
| aws.accessKeyId | AWS_ACCESS_KEY_ID | The authentication key ID | Yes |
| aws.secretAccessKey | AWS_SECRET_ACCESS_KEY | The secret access key | Yes |
For more details on Amazon S3 authentication, please refer to [documentation](https://docs.aws.amazon.com/AmazonS3/latest/API/MakingRequests.html).
Azure Blob Storage
Azure Blob Storage loader is implemented by the `azure-blob-storage-loader` module. To use it, you will need to add the following dependency to your project:
xml
<dependency>
<groupId>${coherence.groupId}</groupId>
<artifactId>coherence-rag-azure-blob-storage-loader</artifactId>
<version>${coherence.version}</version>
</dependency>This will configure document loader for the custom `azure.blob` URI protocol, and allow documents to be loaded from the Azure Blob Storage using the following URI syntax:
azure.blob://<bucket>/<object>
For example, to load document `coherence/coherence-rag.pdf` from the `coherence-rag-docs` bucket, you would reference it as:
azure.blob://coherence-rag-docs/coherence/coherence-rag.pdf
For more information, please refer to [Azure Blob Storage documentation](https://learn.microsoft.com/en-us/azure/storage/blobs/).