Dedicated AI Cluster Performance Benchmarks in Generative AI
Review the inference speed, latency, and throughput in several scenarios when one or more concurrent users call large language models hosted on dedicated AI clusters in OCI Generative AI.
The benchmarks are provided for models in the following families:
The following metrics are used for the benchmarks. For metric definitions, see About the Metrics.
Metric | Unit |
---|---|
Token-level inference speed | tokens per second (TPS) |
Token-level throughput | tokens per second (TPS) |
Request-level latency | seconds |
Request-level throughput | requests per minute (RPM) |
About the Metrics
Review the definitions for the following benchmark metrics.
- Metric 1: Token-level inference speed
-
This metric is defined as the number of output tokens generated per unit of end-to-end latency.
For applications where matching the average human reading speed is required, users should focus on scenarios where the speed is 5 tokens/second or more, which is the average human reading speed.
In other scenarios requiring faster near real-time token generation, such as 15 tokens/second inference speed, for example in dialog and chat scenarios where the number of concurrent users that could be served is lower, and the overall throughput is lower.
- Metric 2: Token-level throughput
-
This metric quantifies the average total number of tokens generated by the server across all simultaneous user requests. It provides an aggregate measure of server capacity and efficiency to serve requests across users.
When inference speed is less critical, such as in offline batch processing tasks, the focus should be where throughput peaks and therefore server cost efficiency is highest. This indicates the LLM's capacity to handle a high number of concurrent requests, ideal for batch processing or background tasks where immediate response is not essential.
Note: The token-level throughput benchmark was done using the LLMPerf tool. The throughput computation has an issue where it includes the time it requires to encode the generated text for token computation.
- Metric 3: Request-level latency
-
This metric represents the average time elapsed between the request submission and the time it took to complete the request, such as after the last token of the request was generated.
- Metric 4: Request-level throughput
-
The number of requests served per unit time, in this case per minute.
- Concurrency
-
Number of users that make requests at the same time.
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.