Cohere Embed 4

Review performance benchmarks for the cohere.embed-v4.0 (Cohere Embed 4) model hosted on one Embed Cohere unit of a dedicated AI cluster in OCI Generative AI.

See details for the model and review the following sections:
- Available regions for this model.
- Dedicated AI clusters for hosting this model.
Review the metrics.

Text Embeddings

This scenario applies only to the embedding models with text input. This scenario mimics embedding generation as part of the data ingestion pipeline of a vector database. In each scenario, all requests are the same size, which is 96 documents, each one with the same number of tokens. For example, for the scenario of 512 tokens mimics a collection of large PDF files, each file with 30,000+ words that a user would ingest into a vector DB.

64 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 64 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.09	0.09	11.15	668.45
2	0.09	0.09	10.79	1,293.27
4	0.10	0.10	9.88	2,370.14
8	0.11	0.11	8.55	4,105.40
24	0.19	0.19	5.10	7,360.01
48	0.31	0.31	3.10	8,933.99
96	0.54	0.54	1.78	10,282.68

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.10	0.10	9.50	570.59
2	0.11	0.11	9.23	1,107.06
4	0.11	0.11	8.92	2,141.09
8	0.12	0.12	8.08	3,865.74
24	0.18	0.18	5.43	7,801.83
48	0.28	0.28	3.49	10,077.82
96	0.47	0.47	2.07	11,961.63

128 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 128 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.09	0.09	11.27	1,381.70
2	0.09	0.09	10.67	2,617.09
4	0.10	0.10	9.67	4,750.20
8	0.12	0.12	8.14	7,990.79
24	0.22	0.22	4.29	12,624.79
48	0.35	0.35	2.76	16,251.43
96	0.64	0.64	1.51	17,735.38

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.10	0.10	9.69	1,189.24
2	0.10	0.10	9.38	2,301.32
4	0.11	0.11	8.89	4,357.61
8	0.12	0.12	8.00	7,854.35
24	0.19	0.19	5.01	14,749.07
48	0.29	0.29	3.34	19,707.08
96	0.50	0.50	1.92	22,589.75

512 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 512 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.09	0.09	10.83	5,410.49
2	0.10	0.10	9.65	9,642.11
4	0.12	0.12	7.52	15,025.97
8	0.16	0.16	5.90	23,556.71
24	0.35	0.35	2.71	32,451.55
48	0.68	0.68	1.39	33,273.59
96	1.25	1.25	0.75	36,072.10

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.10	0.10	9.44	4,715.27
2	0.11	0.11	9.06	9,051.76
4	0.11	0.11	8.42	16,813.69
8	0.14	0.14	6.86	27,394.77
24	0.24	0.24	3.88	46,487.91
48	0.42	0.42	2.17	51,986.90
96	0.77	0.77	1.18	56,778.17

1,024 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 1,024 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.09	0.09	9.55	9,559.38
2	0.12	0.12	1.30	2,601.06
4	0.15	0.15	6.06	24,284.74
8	0.23	0.23	4.05	32,432.49
24	0.60	0.60	1.56	37,501.74
48	1.09	1.09	0.85	40,893.60
96	2.11	2.11	0.31	29,835.31

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.10	0.10	9.14	9,158.45
2	0.11	0.11	8.64	17,307.93
4	0.13	0.13	7.25	29,048.00
8	0.16	0.16	5.51	44,150.34
24	0.38	0.38	2.38	57,261.32
48	0.64	0.64	1.39	66,942.72
96	1.20	1.20	0.74	70,865.77

2,048 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 2,048 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.11	0.11	7.58	15,203.74
2	0.14	0.14	6.09	24,431.99
4	0.22	0.22	4.00	32,065.33
8	0.37	0.37	2.48	39,802.12
24	1.02	1.02	0.90	43,230.02
48	2.00	2.00	0.46	44,251.96

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.11	0.11	8.35	16,740.19
2	0.12	0.12	7.14	28,651.67
4	0.16	0.16	5.54	44,470.30
8	0.23	0.23	3.70	59,426.49
24	0.59	0.59	1.46	70,295.49
48	1.11	1.11	0.78	75,560.01
96	2.08	2.08	0.42	80,426.61

8,096 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 8,096 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.25	0.25	3.31	26,290.24
2	0.42	0.42	2.05	32,530.08
4	0.82	0.82	1.09	34,646.38
8	1.59	1.59	0.57	36,389.86
24	4.47	4.47	0.20	39,049.48
48	8.75	8.75	0.11	40,180.09
96	17.30	17.30	0.05	39,843.97

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.17	0.17	4.57	36,262.71
2	0.26	0.26	3.14	49,882.53
4	0.50	0.50	1.69	53,606.93
8	0.90	0.90	0.96	60,838.78
24	2.38	2.38	0.36	69,450.50
48	4.52	4.52	0.19	73,294.47
96	8.72	8.72	0.10	76,456.16

32,000 Tokens

The following tables show the benchmarks for a scenario of 96 documents, 32,000 tokens per document.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.92	0.92	0.89	27,968.24
2	1.74	1.74	0.50	31,141.92
4	2.92	2.92	0.30	37,838.06
8	5.73	5.73	0.16	39,090.65
24	16.86	16.86	0.05	40,623.28

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Time to First Token (TTFT)(second)	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)	Total Throughput (tokens/second)
1	0.53	0.53	1.41	44,178.97
2	0.88	0.88	0.90	56,692.99
4	1.58	1.58	0.52	65,690.47
8	2.99	2.99	0.28	70,962.43
24	8.47	8.47	0.10	75,910.53
48	16.60	16.60	0.05	77,493.42

Image Embeddings

This scenario applies only to the embedding models with image input. In each scenario, I(M,N): Image with height Npx and width Mpx represents an image with the height of M and the width of N pixels. For example, I(1024,512) is an image with the height of 1,024 pixels and the width of 512 pixels.

I(512,512)

The following tables show the benchmarks for a scenario of an image with the height and width of 512 pixels.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.18	4.76
2	0.19	8.89
4	0.27	13.17
8	0.49	14.84
16	0.94	16.14
32	1.84	16.45
64	3.66	16.38
128	7.27	16.06
256	13.57	16.00

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.15	4.98
2	0.16	10.30
4	0.17	19.51
8	0.21	32.83
16	0.33	43.06
32	0.65	44.02
64	1.32	43.77
128	2.71	41.90
256	5.29	40.35

I(1024,512)

The following tables show the benchmarks for a scenario of an image with the height of 1,024 pixels and the width of 512 pixels.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.25	3.42
2	0.25	6.72
4	0.38	9.17
8	0.78	9.52
16	1.52	10.04
32	2.93	10.50
64	5.75	10.48
128	11.23	10.52
256	19.97	10.13

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.19	3.91
2	0.19	8.29
4	0.22	15.05
8	0.36	19.68
16	0.67	22.08
32	1.35	22.21
64	2.71	22.00
128	5.44	21.09
256	10.20	21.29

I(2048,2048)

The following tables show the benchmarks for a scenario of an image with the height and width of 2,048 pixels.

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.86	1.04
2	0.98	1.73
4	1.84	2.04
8	3.02	1.42
16	7.71	2.03
32	14.93	2.10
64	25.73	1.98
128	26.92	1.86
256	27.29	1.91

The cohere.embed-v4.0 model hosted on one Embed Cohere unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region.


Concurrency	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.66	1.25
2	0.69	2.49
4	1.07	3.40
8	2.24	3.41
16	4.57	3.40
32	9.22	3.37
64	18.53	3.30
128	24.61	2.77
256	25.78	2.71

Oracle Cloud Infrastructure Documentation

Cohere Embed 4

Text Embeddings

64 Tokens

128 Tokens

512 Tokens

1,024 Tokens

2,048 Tokens

8,096 Tokens

32,000 Tokens

Image Embeddings

I(512,512)

I(1024,512)

I(2048,2048)