Meta Llama 3.1 (405B)
Review performance benchmarks for the meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster in OCI
Generative AI.
- See details for the model and review the following sections:
- Available regions for this model.
- Dedicated AI clusters for hosting this model.
- Review the metrics.
You can host the meta.llama-3.1-405b-instruct
model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct
model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is hosted on the predecessor cluster type of Large Generic 4, compare the following tables to decide whether to host the model on this new unit.
Random Length
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. Because of the unknown prompt and response lengths, we've used a stochastic approach where both the prompt and response length follow a normal distribution. The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens. The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
- The
meta.llama-3.1-405b-instruct
model hosted on one Large Generic 2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 27.44 26.84 11.66 5.10 2 26.56 51.93 11.44 10.39 4 25.66 100.31 11.97 19.89 8 24.98 193.34 11.96 39.48 16 20.73 322.99 14.86 63.76 32 18.39 562.55 16.50 114.21 64 15.05 877.61 20.42 180.76 128 10.79 1,210.61 29.53 241.73 256 8.67 1,301.65 47.22 282.78 - The
meta.llama-3.1-405b-instruct
model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.66 25.79 10.78 5.56 2 31.36 50.81 10.06 11.68 4 29.86 96.01 10.87 21.52 8 27.89 170.45 10.87 34.09 16 24.74 282.52 13.51 60.35 32 21.51 457.24 16.73 91.42 64 17.68 676.90 18.29 152.47 128 13.06 1,035.08 25.59 222.67 256 7.82 1,302.71 41.88 289.08
Chat
This scenario covers chat and dialog use cases where the prompt and responses are short. The prompt and response length are each fixed to 100 tokens.
- The
meta.llama-3.1-405b-instruct
model hosted on one Large Generic 2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 27.38 26.65 3.74 15.99 2 26.43 51.30 3.88 30.78 4 25.92 100.61 3.96 60.36 8 25.52 196.72 4.06 118.03 16 21.24 328.32 4.84 196.99 32 19.32 588.59 5.36 353.15 64 16.73 1,003.22 6.29 601.93 128 12.56 1,433.27 8.59 859.96 256 8.60 1,586.86 8.59 952.11 - The
meta.llama-3.1-405b-instruct
model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.93 21.65 4.60 13.01 2 31.72 50.89 3.90 30.54 4 30.86 91.23 4.17 54.74 8 29.61 163.06 4.33 97.84 16 27.66 277.48 4.49 166.49 32 26.01 615.83 4.77 369.50 64 22.49 1,027.87 5.67 616.77 128 17.22 1,527.06 7.37 616.77 256 10.67 1,882.65 11.44 1,131.71
Generation Heavy
This scenario is for generation and model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, the prompt length is fixed to 100 tokens and the response length is fixed to 1,000 tokens.
- The
meta.llama-3.1-405b-instruct
model hosted on one Large Generic 2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 27.35 26.65 36.65 1.60 2 26.72 49.97 37.53 3.00 4 26.21 99.94 38.27 6.00 8 26.42 199.89 38.00 11.99 16 22.60 346.45 44.45 20.79 32 21.97 692.91 45.77 41.57 64 20.10 1,177.63 50.14 70.66 128 17.06 2,086.85 60.70 125.21 256 11.05 2,024.72 109.59 121.48 - The
meta.llama-3.1-405b-instruct
model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65
RAG
The retrieval-augmented generation (RAG) scenario has a very long prompt and a short response such as summarizing use cases. The prompt length is fixed to 2,000 tokens and the response length is fixed to 200 tokens.
- The
meta.llama-3.1-405b-instruct
model hosted on one Large Generic 2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 27.30 25.65 7.74 7.69 2 25.70 48.30 8.21 14.49 4 23.48 88.27 8.96 26.48 8 20.09 150.57 10.52 45.17 16 14.89 223.85 14.10 67.15 32 10.97 330.10 19.10 99.03 64 8.80 386.54 32.06 115.96 128 8.82 386.74 62.04 116.02 256 8.82 375.21 119.99 112.56 - The
meta.llama-3.1-405b-instruct
model hosted on one predecessor Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.94 25.28 7.91 7.58 2 31.31 49.05 8.15 14.71 4 28.85 87.28 8.85 26.18 8 24.24 141.04 10.42 42.31 16 20.31 219.48 12.52 65.85 32 15.99 366.75 16.70 110.03 64 11.03 485.78 24.63 145.74 128 8.27 560.24 41.22 168.07 256 8.01 583.97 74.21 175.19