Meta Llama 3.1 (405B)

Review performance benchmarks for the meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster in OCI Generative AI.

See details for the model and review the following sections:
- Available regions for this model.
- Dedicated AI cluster unit size for hosting this model.
Review the metrics.

Important

You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.

The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is hosted on the predecessor cluster type of Large Generic 4, compare the following tables to decide whether to host the model on this new unit.

Random Length

This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. Because of the unknown prompt and response lengths, we've used a stochastic approach where both the prompt and response length follow a normal distribution. The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens. The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.

The meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

The meta.llama-3.1-405b-instruct model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Chat

This scenario covers chat and dialog use cases where the prompt and responses are short. The prompt and response length are each fixed to 100 tokens.

The meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.38	26.65	3.74	15.99
2	26.43	51.30	3.88	30.78
4	25.92	100.61	3.96	60.36
8	25.52	196.72	4.06	118.03
16	21.24	328.32	4.84	196.99
32	19.32	588.59	5.36	353.15
64	16.73	1,003.22	6.29	601.93
128	12.56	1,433.27	8.59	859.96
256	8.60	1,586.86	8.59	952.11

The meta.llama-3.1-405b-instruct model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	28.93	21.65	4.60	13.01
2	31.72	50.89	3.90	30.54
4	30.86	91.23	4.17	54.74
8	29.61	163.06	4.33	97.84
16	27.66	277.48	4.49	166.49
32	26.01	615.83	4.77	369.50
64	22.49	1,027.87	5.67	616.77
128	17.22	1,527.06	7.37	616.77
256	10.67	1,882.65	11.44	1,131.71

Generation Heavy

This scenario is for generation and model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, the prompt length is fixed to 100 tokens and the response length is fixed to 1,000 tokens.

The meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

The meta.llama-3.1-405b-instruct model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

RAG

The retrieval-augmented generation (RAG) scenario has a very long prompt and a short response such as summarizing use cases. The prompt length is fixed to 2,000 tokens and the response length is fixed to 200 tokens.

The meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.30	25.65	7.74	7.69
2	25.70	48.30	8.21	14.49
4	23.48	88.27	8.96	26.48
8	20.09	150.57	10.52	45.17
16	14.89	223.85	14.10	67.15
32	10.97	330.10	19.10	99.03
64	8.80	386.54	32.06	115.96
128	8.82	386.74	62.04	116.02
256	8.82	375.21	119.99	112.56

The meta.llama-3.1-405b-instruct model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.94	25.28	7.91	7.58
2	31.31	49.05	8.15	14.71
4	28.85	87.28	8.85	26.18
8	24.24	141.04	10.42	42.31
16	20.31	219.48	12.52	65.85
32	15.99	366.75	16.70	110.03
64	11.03	485.78	24.63	145.74
128	8.27	560.24	41.22	168.07
256	8.01	583.97	74.21	175.19

Oracle Cloud Infrastructure Documentation

Meta Llama 3.1 (405B)

Random Length

Chat

Generation Heavy

RAG