10.5.4 rqGroupEval Function
The rqGroupEval function is a user-defined function that identifies a grouping column.
The user defines an rqGroupEval function in PL/SQL using the SQL object rqGroupEvalImpl, which is a generic implementation of the group apply functionality in SQL. The implementation supports data-parallel execution, in which one or more R engines perform the same R function, or task, on different partitions of data. The data is partitioned according to the values of the grouping column.
Only one grouping column is supported. If you have multiple columns, then combine the columns into one column and use the new column as the grouping column.
The rqGroupEval function executes the R function in the script specified by the EXP_NAM parameter. You pass data to the R function with the INP_CUR parameter. You can pass arguments to the R function with the PAR_CUR parameter.
The R function returns an R data.frame object, which appears as a SQL table in the database. You define the form of the returned value with the OUT_QRY parameter.
To create an rqGroupEval function, you create the following two PL/SQL objects:
-
A PL/SQL package that specifies the types of the result to return.
-
A function that takes the return value of the package and uses the return value with
PIPELINED_PARALLEL_ENABLEset to indicate the column on which to partition data.
Syntax
rqGroupEval (
INP_CUR REF CURSOR IN
PAR_CUR REF CURSOR IN
OUT_QRY VARCHAR2 IN
GRP_COL VARCHAR2 IN
EXP_NAM VARCHAR2 IN)Parameters
| Parameter | Description |
|---|---|
|
|
A cursor that specifies the data to pass to the R function specified by the |
|
|
A cursor that contains argument values to pass to the R function. |
|
|
One of the following:
|
|
|
The name of the grouping column by which to partition the data. |
|
|
The name of a script in the OML4R script repository. |
Return Value
The user-defined rqGroupEval function returns a table that has the structure specified by the OUT_QRY parameter value.
Examples
This example has a PL/SQL block that drops the script myC5.0Function to ensure that the script does not exist in the OML4R script repository. It then creates a function and stores it as the script myC5.0Function in the script repository.
The R function accepts two arguments: the data on which to operate and a prefix to use in creating datastores. The function uses the C50 package to build C5.0 models on the churn data set from C50. The function builds one churn model on the data for each state.
The myC5.0Function function loads the C50 package so that the function body has access to it when the function executes in an R engine on the database server. The function then creates a datastore name using the datastore prefix and the name of a state. To exclude the state name from the model, the function deletes the column from the data.frame. Because factors in the data.frame are converted to character vectors when they are loaded in the user-defined embedded R function, the myC5.0Function function explicitly converts the character vectors back to R factors.
The myC5.0Function function gets the data for the state from the specified columns and then creates a model for the state and saves the model in a datastore. The R function returns TRUE to have a simple value that can appear as the result of the function execution.
The example next creates a PL/SQL package, churnPkg, and a user-defined function, churnGroupEval. In defining an rqGroupEval function implementation, the PARALLEL_ENABLE clause is optional but the CLUSTER BY clause is required.
Finally, the example executes a SELECT statement that invokes the churnGroupEval function. In the INP_CUR argument of the churnGroupEval function, the SELECT statement specifies the PARALLEL hint to use parallel execution of the R function and the data set to pass to the R function. The INP_CUR argument of the churnGroupEval function specifies connecting to OML4R and the datastore prefix to pass to the R function. The OUT_QRY argument specifies returning the value in XML format, the GRP_NAM argument specifies using the state column of the data set as the grouping column, and the EXP_NAM argument specifies the myC5.0Function script in the script repository as the R function to invoke.
For each of 50 states plus Washington, D.C., the SELECT statement returns from the churnGroupEval table function the name of the state and an XML string that contains the value TRUE.
Example 10-22 Using an rqGroupEval Function
BEGIN
sys.rqScriptDrop('myC5.0Function');
sys.rqScriptCreate('myC5.0Function',
'function(dat, datastorePrefix) {
library(C50)
datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
dat$state <- NULL
dat$churn <- as.factor(dat$churn)
dat$area_code <- as.factor(dat$area_code)
dat$international_plan <- as.factor(dat$international_plan)
dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
ore.save(mod, name = datastoreName)
TRUE
}');
END;
/
CREATE OR REPLACE PACKAGE churnPkg AS
TYPE cur IS REF CURSOR RETURN CHURN_TRAIN%ROWTYPE;
END churnPkg;
/
CREATE OR REPLACE FUNCTION churnGroupEval(
inp_cur churnPkg.cur,
par_cur SYS_REFCURSOR,
out_qry VARCHAR2,
grp_col VARCHAR2,
exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("state"))
CLUSTER inp_cur BY ("state")
USING rqGroupEvalImpl;
/
SELECT *
FROM table(churnGroupEval(
cursor(SELECT * /*+ parallel(t,4) */ FROM CHURN_TRAIN t),
cursor(SELECT 1 AS "ore.connect",
'myC5.0model' AS "datastorePrefix" FROM dual),
'XML', 'state', 'myC5.0Function'));