Meta Llama 3.3 (70B)
meta.llama-3.3-70b-instructhosted on one Large Generic unit of a dedicated AI cluster for all available regions except for UAE East (Dubai)meta.llama-3.3-70b-instruct-fp8-dynamichosted on one LARGE_GENERIC_V1 unit of a dedicated AI cluster for the UAE East (Dubai) region only
Random Length
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. Because of the unknown prompt and response lengths, we've used a stochastic approach where both the prompt and response length follow a normal distribution. The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens. The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
- The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) and UAE East (Dubai) regions. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.05 58.64 58.02 5.19 0.19 143.72 2 0.06 57.98 114.84 5.34 0.37 286.52 4 0.06 56.74 224.06 5.29 0.75 574.60 8 0.07 54.74 425.30 5.44 1.44 1,086.78 16 0.09 50.89 775.13 5.94 2.59 1,999.12 32 0.16 44.32 1,296.53 6.59 4.53 3,456.77 64 0.40 35.74 1,914.20 8.52 6.58 5,132.42 128 1.29 25.60 2,314.73 11.93 8.49 6,334.64 256 4.09 15.27 1,976.65 20.16 8.09 5,691.50 - The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.05 71.89 70.69 4.24 0.23 181.67 2 0.05 71.98 141.95 4.15 0.48 365.92 4 0.05 69.95 275.70 4.36 0.91 707.80 8 0.05 67.52 531.75 4.57 1.74 1,327.51 16 0.06 62.77 982.23 4.99 3.17 2,475.30 32 0.09 52.94 1,639.05 5.74 5.47 4,294.03 64 0.16 42.07 2,522.18 7.24 8.49 6,564.64 128 0.47 28.89 3,274.75 10.69 11.11 8,678.22 256 1.42 16.84 3,407.77 18.21 12.07 9,006.65 - The
meta.llama-3.3-70b-instruct-fp8-dynamicmodel hosted on one LARGE_GENERIC_V1 unit of a dedicated AI cluster for the UAE East (Dubai) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.16 48.04 46.17 6.70 8.79 111.27 2 0.17 47.60 92.31 6.35 18.78 234.74 4 0.19 44.98 173.37 7.10 33.47 455.10 8 0.19 41.03 316.43 7.62 62.35 795.71 16 0.22 33.54 514.93 8.85 107.34 1,365.97 32 0.29 24.98 759.52 12.40 151.90 1,939.62 64 0.64 16.78 984.11 18.71 197.12 2,554.59 128 1.70 9.84 1,099.59 31.40 226.32 2,846.33 256 17.22 6.88 1,094.51 59.29 226.27 2,874.42
Chat
This scenario covers chat and dialog use cases where the prompt and responses are short. The prompt and response length are each fixed to 100 tokens.
- The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) and UAE East (Dubai) regions. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.03 58.84 58.33 1.71 0.58 113.87 2 0.04 58.51 115.31 1.73 1.15 225.20 4 0.05 57.70 225.43 1.77 2.25 440.20 8 0.08 56.45 429.30 1.83 4.29 839.09 16 0.09 53.98 820.89 1.92 8.21 1,602.31 32 0.17 49.80 1,453.58 2.16 14.54 2,839.35 64 0.31 44.96 2,457.59 2.51 24.58 4,800.51 128 0.63 36.70 3,484.65 3.34 34.85 6,797.06 256 1.33 24.95 3,137.39 5.34 31.37 6,131.39 - The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.02 70.87 70.46 1.42 0.70 137.38 2 0.03 71.03 139.91 1.42 1.40 272.93 4 0.03 69.90 275.32 1.45 2.75 537.34 8 0.05 68.57 532.09 1.49 5.32 1,039.21 16 0.06 65.47 1,000.33 1.58 10.00 1,952.54 32 0.13 59.57 1,762.88 1.79 17.63 3,442.56 64 0.21 52.50 2,933.83 2.10 29.34 5,729.27 128 0.52 43.10 4,243.57 2.84 42.44 8,285.42 256 1.06 27.89 5,129.28 4.65 51.29 10,008.78 - The
meta.llama-3.3-70b-instruct-fp8-dynamicmodel hosted on one LARGE_GENERIC_V1 unit of a dedicated AI cluster for the UAE East (Dubai) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.08 48.06 46.46 2.14 27.88 90.59 2 0.08 47.76 92.60 2.15 55.56 180.97 4 0.11 46.29 177.34 2.25 106.40 346.25 8 0.10 41.94 323.36 2.46 194.02 630.83 16 0.23 37.87 556.47 2.85 333.88 1,086.10 32 0.35 29.60 852.79 3.70 511.68 1,664.38 64 0.48 20.76 1,191.76 5.25 715.06 2,325.16 128 0.79 12.25 1,378.27 8.87 826.96 2,691.00 256 3.23 7.21 1,342.09 16.97 805.25 2,620.44
Generation Heavy
This scenario is for generation and model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, the prompt length is fixed to 100 tokens and the response length is fixed to 1,000 tokens.
- The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) and UAE East (Dubai) regions. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.03 58.67 58.30 17.06 0.06 63.82 2 0.04 58.32 113.26 17.17 0.11 124.07 4 0.05 57.67 226.53 17.38 0.23 248.17 8 0.08 56.64 439.73 17.72 0.44 481.54 16 0.14 54.48 863.09 18.48 0.86 945.33 32 0.15 50.83 1,529.11 19.80 1.53 1,674.84 64 0.26 47.10 2,960.77 21.47 2.96 3,242.25 128 0.59 39.95 4,332.27 25.60 4.33 4,743.64 256 1.37 28.47 4,197.95 36.47 4.20 4,597.71 - The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.02 71.70 71.62 13.95 0.07 78.45 2 0.03 71.85 143.25 13.93 0.14 156.92 4 0.04 70.78 279.83 14.15 0.28 306.47 8 0.06 69.80 546.34 14.37 0.55 598.40 16 0.08 67.47 1,066.03 14.88 1.07 1,167.35 32 0.13 62.06 1,931.09 16.23 1.93 2,115.00 64 0.28 56.97 3,575.74 17.82 3.58 3,915.91 128 0.49 47.49 5,876.91 21.53 5.88 6,436.45 256 1.10 31.50 7,660.84 32.82 7.66 8,389.08 - The
meta.llama-3.3-70b-instruct-fp8-dynamicmodel hosted on one LARGE_GENERIC_V1 unit of a dedicated AI cluster for the UAE East (Dubai) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.08 48.04 46.63 20.87 2.80 51.07 2 0.09 48.43 93.27 20.72 5.60 102.15 4 0.13 47.35 186.54 21.22 11.19 204.30 8 0.17 45.78 359.64 21.99 21.58 393.90 16 0.31 42.00 639.44 24.10 38.37 700.29 32 0.38 35.04 1,065.59 28.89 63.94 1,167.20 64 0.48 27.70 1,719.72 36.55 103.18 1,883.30 128 0.84 18.49 2,279.01 54.86 136.74 2,496.10 256 12.49 10.14 1,923.79 112.88 115.43 2,106.78
RAG
The retrieval-augmented generation (RAG) scenario has a very long prompt and a short response such as summarizing use cases. The prompt length is fixed to 2,000 tokens and the response length is fixed to 200 tokens.
- The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for all regions except for the Saudi Arabia Central (Riyadh) and UAE East (Dubai) regions. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.15 58.36 55.63 3.56 0.28 600.44 2 0.21 57.01 107.46 3.70 0.54 1,160.14 4 0.43 55.58 197.86 4.02 0.99 2,135.93 8 0.76 51.24 339.08 4.67 1.70 3,659.93 16 1.17 41.90 528.08 5.97 2.64 5,701.12 32 1.77 29.93 740.37 8.52 3.70 7,992.66 64 2.39 17.06 831.99 14.07 4.16 8,980.85 128 5.24 9.28 793.96 26.69 3.97 8,570.79 256 18.88 5.36 668.72 56.04 3.34 7,219.15 - The
meta.llama-3.3-70b-instructmodel hosted on one Large Generic unit of a dedicated AI cluster for the Saudi Arabia Central (Riyadh) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.14 72.03 68.29 2.90 0.34 737.19 2 0.21 70.65 131.24 3.03 0.66 1,416.72 4 0.42 68.48 238.49 3.34 1.19 2,574.37 8 0.74 62.70 402.85 3.94 2.01 4,348.39 16 1.19 50.86 615.70 5.15 3.08 6,646.93 32 1.50 32.62 821.95 7.64 4.11 8,873.44 64 1.79 18.54 989.99 12.53 4.95 10,686.14 128 2.70 9.82 1,054.49 22.96 5.27 11,384.10 256 5.92 4.91 995.45 46.42 4.98 10,745.88 - The
meta.llama-3.3-70b-instruct-fp8-dynamicmodel hosted on one LARGE_GENERIC_V1 unit of a dedicated AI cluster for the UAE East (Dubai) region. -
Concurrency Time to First Token (TTFT)(second) Token-level Inference Speed (tokens/second) Token-level Throughput (tokens/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) Total Throughput (tokens/second) 1 0.61 47.82 41.63 4.77 12.49 449.51 2 0.71 44.86 76.59 5.15 22.98 826.74 4 0.81 37.37 129.16 6.14 38.75 1,394.37 8 0.88 27.43 194.45 8.13 58.33 2,099.01 16 1.02 17.67 256.65 12.28 77.00 2,770.52 32 1.24 10.19 302.47 20.76 90.74 3,265.01 64 10.99 7.16 318.93 38.77 95.68 3,443.02 128 47.31 7.16 318.49 75.10 95.55 3,438.12 256 117.96 7.16 305.59 145.75 91.68 3,299.34