Using Aggregate Functions
You can use built in aggregate functions to find information such as a count, a sum, an average, a minimum, or a maximum.
The following functions are called SQL aggregate functions, because their semantics are similar to those in standard SQL: they work in conjunction with grouping and they aggregate values across the rows of a group. The aggregate functions can be used only in the SELECT or ORDER BY clauses, and they cannot be nested.
If you want to follow along with the examples, create tables and insert the data as described in the Tables used in the Examples topic.
The following aggregate functions are supported:
- long count(*)
- long count(any*)
- number sum(any*)
- number avg(any*)
- any_atomic min(any*)
- any_atomic max(any*)
- ARRAY(any) array_collect(DISTINCT any*)
- ARRAY(any) array_collect(any*)
- long count(DISTINCT any*)
Note:
All SQL aggregate function names are case sensitive.count(*) function
The count star
function returns the number of rows in a group.
long count(*)
The count star
function calculates the number of records fetched by the query.
return type: long
Example 6-14 Find the total number of passengers who have contact details in their records
SELECT count(*) AS COUNT_PASSENGER
FROM BaggageInfo bag
WHERE length(contactPhone) > 0
Explanation: In an airline baggage tracking application, you can calculate the total count of passengers who have furnished their contact details. The contactPhone
field in the BaggageInfo
table includes the contact details of the passengers. You use the count star
function to find the number of passenger records with the contactPhone
entry. There is a possibility that the contact details include NULL values and empty strings. You can use the length function in the WHERE clause to exclude such rows from being counted. If a contactPhone
has an empty or a NULL value, the length function returns a NULL value. You apply a value comparison to select only the rows that yield a value greater than 0, and then determine the total count of the resultant rows using the count star
function.
{"COUNT_PASSENGER":21}
Example 6-15 Find the number of checked bags that arrived on the 1st of Feb 2019
SELECT count(*) AS COUNT_BAGS
FROM BaggageInfo bag,
EXTRACT(DAY FROM CAST (bag.bagInfo[0].bagArrivalDate AS Timestamp(0))) $t1,
EXTRACT(MONTH FROM CAST (bag.bagInfo[0].bagArrivalDate AS Timestamp(0))) $t2,
EXTRACT(YEAR FROM CAST (bag.bagInfo[0].bagArrivalDate AS Timestamp(0))) $t3
WHERE $t3=2019 AND $t2=02 AND $t1=01
Explanation: In the airline baggage tracking application, you can get the total count of checked bags on a particular date. The bagArrivalDate
field in the BaggageInfo
table contains the arrival date of the passenger's checked bags. In the above query, you count the number of rows that have the bagArrivalDate
as 1st of Feb 2019 to fetch the number of checked bags on the given date. You use several filter conditions here to extract only the date part of the timestamp. You first use the CAST operator to convert the bagArrivalDate
to a timestamp and then extract the date, month, and year details from the timestamp using the EXTRACT expression. You use the value comparison to determine if the day, month, and year correspond to the required date value, that is, 01,02, and 2019 respectively. You then use the logical operator AND to select only the rows that match all three conditions and count the resultant rows using the count function.
{"COUNT_BAGS":1}
count function
The count
function computes its input expression on each row in a group and counts all the non-NULL values returned by these evaluations of the input expression.
long count(any*)
any: The count
function accepts any parameter type as the input argument.
return type: long
The count star
function can be used when you want to count the rows of a resultant query. It does not accept any other input argument. Whereas, the count
function can be used when you want to enumerate the outcome of an input expression.
Example 6-16 Find the total number of checked bags that are estimated to arrive at the LAX airport at a particular time
SELECT $estdate as ARRIVALDATE,
count($flight) AS COUNT
FROM BaggageInfo $bag,
$bag.bagInfo.flightLegs.estimatedArrival $estdate,
$bag.bagInfo.flightLegs.flightNo $flight,
$bag.bagInfo.flightLegs.fltRouteDest $flt_dest
WHERE $estdate =any "2019-02-01T11:00:00Z" AND $flt_dest =any "LAX"
GROUP BY $estdate
Explanation: In an airline baggage tracking application, you can get the total count of checked bags that are estimated to arrive at a particular airport and time. For each flight leg, the estimatedArrival
field in the flightLegs
array of the BaggageInfo
table contains the arrival time of the checked bags and the fltRouteDest
field contains the destination airport code. In the above query, to determine the total number of checked bags arriving at the LAX airport at a given time, you first group the data with the estimated arrival time value using the GROUP BY clause. From the group, you select only the rows that have the destination airport as LAX. You then determine the bag count for the resultant rows using the count function.
Here, you can compare the string-formatted dates in ISO-8601 format due to the natural sorting order of strings without having to cast them into timestamp data types. The $bag.bagInfo.flightLegs.estimatedArrival
and $bag.bagInfo.flightLegs.fltRouteDest
are sequences. Since the comparison operator '=' cannot operate on sequences of more than one item, the sequence comparison operator '=any' is used instead to compare the estimatedArrival
and fltRouteDest
fields.
{"ARRIVALDATE":"2019-02-01T11:00:00Z","COUNT":2}
Example 6-17 Find the number of watchers for a particular show in a TV streaming application
SELECT count($a.contentStreamed[$element.showName = "Bienvenu"]) AS WATCHERS
FROM stream_acct $s, $s.acct_data $a
Explanation: In a TV streaming application, the list of shows watched by a subscriber is stored in the contentStreamed
array. Each element of the array corresponds to a single show. The showName
field includes the name of each show. To fetch the number of watchers for a particular show, you use the array-filter step expression in the count
function. You check whether or not the showName
field matches the given show (in this example, Bienvenu
) for each subscriber and count the total number of such subscribers using the count
function.
{"WATCHERS":4}
sum function
sum
function has type long, double, or number, depending on the type of the input items:
- If there is at least one input item of type number, the result will be a number.
- If there is at least one item of type double or float, the result will be double, else the result will be of type long.
- If the input items are a mix of long, doubles, and numbers, the result will be of type double.
- If numeric values are not returned by the sum function’s input, the result is NULL.
long sum(any*)
any: The sum
function accepts any parameter type as the input argument.
return type: long
Example 6-18 Find the total screen time spent on a show by a subscriber in the TV streaming application
SELECT sum($content.seriesInfo[].episodes[].minWatched) AS MINSWATCHED
FROM stream_acct $s, $s.acct_data.contentStreamed[] $content
WHERE acct_id = 1 AND $content.showName = "At the Ranch"
Explanation: In the TV streaming application, you can calculate the total screen time spent on a show by each subscriber. Each subscriber is associated with a unique account ID, stored in the acct_id
field of the stream_acct
table. The information of the shows is stored in the contentStreamed
array. The showName
field holds the name of the show and the minWatched
field stores the time lapsed for each episode in each season of the show. In this query, you use the sum
function to add the values of the minWatched
fields of all the episodes in all the seasons to calculate the total screen time for the subscriber with account ID 1 and show named At the Ranch
.
{"MINSWATCHED":225}
avg function
- If there is at least one input item of type number, the result will be a number, else the result will be double.
- If numeric values are not returned by avg function's input, the result is NULL.
number avg(any*)
any: The avg
function accepts any parameter type as the input argument.
return type: number
Example 6-19 Find the average screen time spent on a show by a subscriber
SELECT avg($content.seriesInfo[].episodes[].minWatched) AS AVERAGETIME
FROM stream_acct $s, $s.acct_data.contentStreamed[] $content
WHERE acct_id = 1 AND $content.showName = "At the Ranch"
Explanation: In the TV streaming application, you can calculate the average screen time spent on a show by each subscriber. Each subscriber is associated with a unique account ID, stored in the acct_id
field of the stream_acct
table. The information of the shows is stored in the contentStreamed
array. The showName
field holds the name of the series and the minWatched
field stores the time lapsed for each episode in each season of the show. In this query, you use the avg
function on the minWatched
field to calculate the average screen time spent on the show named At the Ranch
by a subscriber with account id 1. The avg
function first calculates the total screen time by adding the values in the minWatched
fields for the given show. It then divides the sum by the number of minWatched
fields to calculate the average value.
{"AVERAGETIME":56.25}
min function
min
function returns the minimum value among all the values returned by the evaluations of the input expression on each row in a group. The input expression is evaluated as follows:
- An error is displayed if it can be determined during the compile time that the values returned by the input expression belong to a type for which an order comparison is not defined (for example, RECORD, MAP, BINARY, or FIXED_BINARY). Otherwise, the min value for each group is initialized to NULL.
- Let M be the current minimum value and N be the next input value. The M and N are
compared using Value Comparison Operators. If M is NULL, M is set to N. Else, if N is less than M, that means N
can be the minimum value and hence M is set to N, and N is set to the next input value.
This is continued until all the values in the input expression are compared and a
minimum value is ascertained. When the values are not comparable, the following order is
used:
numeric values < timestamps < strings and enums < booleans
- If N is a record, map, array, binary, or fixed binary value, NULL, or JSON null, it is skipped and the next input value is considered.
any_atomic min(any*)
any: The min
function accepts any parameter type as the input argument.
return type: atomic data type
max function
The max
function returns the maximum value in all the sequences returned by the evaluations of the input expression on each row in a group. The specific rules are the same as for the max
function, except that the current max value M will be replaced by the next input value N if N is not skipped and is greater than M.
any_atomic max(any*)
any: The max
function accepts any parameter type as the input argument.
return type: atomic data type
Example 6-20 For a given show, find the minimum and maximum screen time
SELECT min($content.seriesInfo[].episodes[].lengthMin) AS MINTIME
max($content.seriesInfo[].episodes[].lengthMin) AS MAXTIME
FROM stream_acct $s, $s.acct_data.contentStreamed[] $content
WHERE $content.showName = "At the Ranch"
lengthMin
field in the stream_acct
table stores the length of each episode for a show. In this query, you use:
- The
min
function on thelengthMin
field to fetch the duration of the episode from the showAt the Ranch
that has the least screen time. - The
max
function on thelengthMin
field to fetch the duration of the episode from the showAt the Ranch
that has the most screen time.
{"MINTIME":45,"MAXTIME":85}
Example 6-21 Aggregate Function - Fetch the age and average income of users from the User data table
CREATE INDEX idx11 ON users (age)
SELECT
age, count(*) AS count, avg(income) AS income
FROM users
GROUP BY age
Explanation: Consider an application that maintains the user data. See User data application table in the Tables used in the Examples section. The users
table includes multiple records of users with the same age group. You use the aggregate functions to retrieve the count of such subscribers and their average income. The above query groups users by their age, and for each age, returns the number of users with the same age group and their average income.
{"age":22,"count":2,"income":50000.0}
{"age":45,"count":1,"income":75000.0}
array_collect(DISTINCT any*) function
The array_collect(DISTINCT any*)
function computes the input expression on
each row of a group and collects all the resulting distinct non NULL values into an
array.
ARRAY(any) array_collect(DISTINCT any*)
The array_collect
function
computes the input expression on each row of a group. The input expression may be any kind
of expression, except a SELECT expression. It collects all the resulting distinct values
(except NULL values) into an array. The function returns the constructed array.
Note:
DISTINCT causes values to be compared for equality using the semantics of the value comparison operator, with the following exceptions: strings are comparable with strings only (not enums and timestamps), enums are comparable with enums only (not strings), and timestamps are comparable with timestamps only (not strings).Example 6-22 In the TV streaming application, return the set of distinct show ids (as an array) that have been viewed in every country.
SELECT acct.acct_data.country,
array_collect(distinct acct.acct_data.contentStreamed.showId) AS shows
FROM stream_acct acct group by acct.acct_data.country
Explanation: The TV streaming application streams various shows that are watched by customers across the globe. Here you want a list of distinct show ids for every country. You group the data based on the country and list the show ids for every country as an array. You eliminate duplicate show ids with the distinct operator.
{"country":"USA","shows":[16,15]}
{"country":"France","shows":[15]}
{"country":"Germany","shows":[26,15]}
Example 6-23 In the TV streaming application, return the set of distinct genres (as an array) that have been viewed by customers in every country.
SELECT acct.acct_data.country,
array_collect(distinct acct.acct_data.contentStreamed.genres[]) AS genres
FROM stream_acct acct group by acct.acct_data.country
acct.acct_data.contentStreamed.genres[]
.You eliminate duplicate genres
with the distinct operator.
Note:
In the above query, you useacct.acct_data.contentStreamed.genres[]
as you want
the details of all the genres to be fetched in a single array. If you omit
[]
in acct.acct_data.contentStreamed.genres
, details
of the genres will be fetched as independent arrays for each row and distinct will be
applied only on independent arrays and not in the entire set as shown
below.SELECT acct.acct_data.country,
array_collect(distinct acct.acct_data.contentStreamed.genres) AS genres
FROM stream_acct acct group by acct.acct_data.country
Output:
{"country":"USA","genres":[["comedy","drama","danish"],["comedy","french"]]}
{"country":"France","genres":[["comedy","french"]]}
{"country":"Germany","genres":[["action","crime","spanish"],["comedy","french"]]}
In the above example, you can see that the data is grouped based on the country and for each country, the genres watched by customers are fetched as a row-level array. For example, for the country USA, you see 2 arrays of genres. The DISTINCT is applied at the array level. Only if the arrays are identical, then DISTINCT fetches only one array. Else all the arrays are fetched back as shown in the result above.
{"country":"USA","genres":["drama","danish","comedy","french"]}
{"country":"France","genres":["comedy","french"]}
{"country":"Germany","genres":["spanish","comedy","action","crime","french"]}
array_collect(any*) function
The array_collect(any*)
function computes the input expression on each row
of a group and collects all the resulting non NULL values into an array.
ARRAY(any) array_collect(any*)
The array_collect
function computes the input
expression on each row of a group. The input expression may be any kind of expression,
except a SELECT expression. The function collects all the resulting values (except NULL
values) into an array and returns the populated array. The array_collect
function permits duplicate values to be inserted into the array.
Example 6-24 In the TV streaming application, return the first name and last name of customers (as an array) in USA and Germany.
SELECT acct.acct_data.country,
array_collect(
{"firstName":acct.acct_data.firstName,
"lastName":acct.acct_data.lastName}) as user_info
FROM stream_acct acct WHERE
acct.acct_data.country IN ('USA','Germany')
group by acct.acct_data.country
Explanation: The TV streaming application streams various shows that are watched by customers across the globe. Here you want to fetch the user information (first name and last name) of customers in USA and Germany. You group the data based on the country and filter the data only for two countries USA and Germany. For each of these two countries , you fetch the first name and last name of all the customers and populate that in a single array.
{"country":"USA","user_info":[{"firstName":"John","lastName":"Lewis"}]}
{"country":"Germany","user_info":[{"firstName":"Angela","lastName":"Mercel"}]}
long count(DISTINCT any*) function
The count
function computes the input expression on each row of a group
and counts all the distinct non-NULL values returned by the input expression.
long count(DISTINCT any*)
count
function computes the input
expression on each row of a group. The input expression may be any kind of expression
other than a subquery. The function counts all the distinct non-NULL values returned by
the input expression. The return type for the count function is long.
Note:
DISTINCT causes values to be compared for equality using the semantics of the value comparison operator with the following exceptions: strings are comparable with strings only (not enums and timestamps), enums are comparable with enums only (not strings), and timestamps are comparable with timestamps only (not strings).Example 6-25 In the TV streaming application, return the count of distinct show ids for every genre viewed by customers.
SELECT $genre, count(distinct $content.showId) AS show_count
FROM stream_acct acct, acct.acct_data.contentStreamed[] as $content,
$content.genres[] as $genre, $content.showId as $showid group by $genre
acct.acct_data.contentStreamed[]
. You group the data based on the
genres and for every genre , you fetch the list of showid as an array. You eliminate
duplicate show ids with the distinct operator. Then the count of distinct show ids for every
genre is returned.
Note:
In the above query, if you omit[]
in acct.acct_data.contentStreamed
,
details of the content streamed will be fetched as independent arrays for each row and
count(distinct )will be applied only on independent arrays and not in the entire
set.
{"genres":"crime","show_count":1}
{"genres":"action","show_count":1}
{"genres":"comedy","show_count":2}
{"genres":"spanish","show_count":1}
{"genres":"french","show_count":1}
{"genres":"drama","show_count":1}
{"genres":"danish","show_count":1}