6.2.6 Use the ore.rowApply Function
The ore.rowApply
function invokes an R script with an ore.frame
as the input data.
The ore.rowApply
function passes the ore.frame
to the user-defined input function as the first argument to that function. The rows
argument to the ore.rowApply
function specifies the number of rows to pass to each invocation of the user-defined R function. The last chunk or rows may have fewer rows than the number specified. The ore.rowApply
function can use data-parallel execution, in which one or more R engines perform the same R function, or task, on different partitions of data.
The syntax of the ore.rowApply
function is the following:
ore.rowApply(X, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, rows = 1, FUN.OWNER = NULL, parallel = getOption("ore.parallel", NULL))
The ore.rowApply
function returns an ore.list
object or an ore.frame
object.
See Also:
-
Arguments for Functions that Run Scripts for descriptions of the arguments to function
ore.rowApply
Example 6-14 Using the ore.rowApply Function
This example uses the e1071
package, previously downloaded from CRAN. The example does the following:
-
Loads the package
e1071
. -
Pushes the
iris
data set to the database as theIRIS
temporary table andore.frame
object. -
Creates the Naive Bayes model
nbmod
. -
Creates a copy of
IRIS
asIRIS_PRED
and adds the PRED column toIRIS_PRED
to contain the predictions. -
Invokes the
ore.rowApply
function, passing theIRIS
ore.frame
as the data source for user-defined R function and the user-defined R function itself. The user-defined function does the following:-
Loads the package
e1071
so that it is available to the R engine or engines that run in the database. -
Converts the Species column to a factor because, although the
ore.frame
defined factors, when they are loaded to the user-defined function, factors appear as character vectors. -
Invokes the
predict
method and returns theres
object, which contains the predictions in the column added to the data set.
-
-
Pulls the model to the client R session.
-
Passes
IRIS_PRED
as the argumentFUN.VALUE
, which specifies the structure of the object that theore.rowApply
function returns. -
Specifies the number of rows to pass to each invocation of the user-defined function.
-
Displays the class of
res
, and invokes thetable
function to display the Species column and the PRED column of theres
object.
library(e1071) IRIS <- ore.push(iris) nbmod <- ore.tableApply( ore.push(iris), function(dat) { library(e1071) dat$Species <- as.factor(dat$Species) naiveBayes(Species ~ ., dat) }) IRIS_PRED <- IRIS IRIS_PRED$PRED <- "A" res <- ore.rowApply( IRIS, function(dat, nbmod) { library(e1071) dat$Species <- as.factor(dat$Species) dat$PRED <- predict(nbmod, newdata = dat) dat }, nbmod = ore.pull(nbmod), FUN.VALUE = IRIS_PRED, rows = 10) class(res) table(res$Species, res$PRED)
Listing for This Example
R> library(e1071)
R> IRIS <- ore.push(iris)
R> nbmod <- ore.tableApply(
+ ore.push(iris),
+ function(dat) {
+ library(e1071)
+ dat$Species <- as.factor(dat$Species)
+ naiveBayes(Species ~ ., dat)
+ })
R> IRIS_PRED <- IRIS
R> IRIS_PRED$PRED <- "A"
R> res <- ore.rowApply(
+ IRIS ,
+ function(dat, nbmod) {
+ library(e1071)
+ dat$Species <- as.factor(dat$Species)
+ dat$PRED <- predict(nbmod, newdata = dat)
+ dat
+ },
+ nbmod = ore.pull(nbmod),
+ FUN.VALUE = IRIS_PRED,
+ rows = 10)
R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> table(res$Species, res$PRED)
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
This example uses the C50
package to score churn data (that is, to predict which customers are likely to churn) using C5.0 models. The example partitions the data by a number of rows. It scores the customers from the specified state in parallel. It uses datastores and saves functions to the OML4R script repository, which allows the functions to be used by the OML4R SQL API functions.
The example first loads C50
package and the data sets. It deletes the datastores with names containing myC5.0modelFL
, if they exist. It invokes ore.drop
to delete the CHURN_TEST table, if it exists, and then invokes ore.create
to create the CHURN_TEST table from the churnTest
data set.
The example next invokes ore.getLevels
, which returns a list
of the levels for each factor column. The invocation excludes the first column, which is state, because the levels for that column are not needed. Getting the levels first can ensure that all possible levels are provided during model building, even if some rows do not have values for some of the levels. The ore.delete
invocation ensures that no datastore with the specified name exists and the ore.save
invocation saves the xlevels
object in the datastore named myXLevels
.
The example creates a user-defined function, myC5.0FunctionForLevels
, that generates a C5.0 model. The function uses the list of levels returned by function ore.getXlevels
instead of computing the levels using the as.factor
function. It uses the levels to convert the column type from character vector to factor. The function myC5.0FunctionForLevels
returns the value TRUE.
The example saves the function in the script repository.
The example next gets a list of datastores that have names that include the specified string and deletes those datastores if they exist.
The example then invokes ore.groupApply
, which invokes function myC5.0FunctionForLevels
on each state in the CHURN_TEST
data. To each myC5.0FunctionForLevels
invocation, ore.groupApply
passes the datastore that contains the xlevels
object and a prefix to use in naming the datastore generated by myC5.0FunctionForLevels
. It also passes the ore.connect
control argument to connect to the database in the embedded R function, which enables the use of objects stored in a datastore. The ore.groupApply
invocation returns a list that contains the results of all of the invocations of myC5.0FunctionForLevels
.
The example pulls the result over to the local R session and verifies that myC5.0FunctionForLevels
returned TRUE
for each state in the data source.
The example next creates another user-defined another function, myScoringFunction
, and stores it in the script repository. The function scores a C5.0 model for the levels of a state and returns the results in a data.frame
.
The example then invokes function ore.rowApply
. It filters the input data to use only data for the state of Massachusetts. It specifies myScoringFunction
as the function to invoke and passes that user-defined function the name of the datastore that contains the xlevels
object and a prefix to use in loading the datastore that contains the C5.0 model for the state. The ore.rowApply
invocation specifies invoking myScoringFunction
on 200 rows of the data set in each parallel R engine. It uses the FUN.VALUE
argument so that ore.rowApply
returns an ore.frame
that contains the results of all of the myScoringFunction
invocations. The variable scores
gets the results of the ore.rowApply
invocation.
Finally, the example prints the scores
object and then uses the table function to display the confusion matrix for the scoring.
See Also:
Example A-8 for an invocation of the SQL rqRowEval
function that produces the same result as the ore.rowApply
function in this example
Example 6-15 Using the ore.rowApply Function with Datastores and Scripts
library(C50) data(churn) ore.drop("CHURN_TEST" ore.create(churnTest, "CHURN_TEST") xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1]) ore.delete("myXLevels") ore.save(xlevels, name = "myXLevels") ore.scriptDrop("myC5.0FunctionForLevels") ore.scriptCreate("myC5.0FunctionForLevels", function(dat, xlevelsDatastore, datastorePrefix) { library(C50) state <- dat[1,"state"] datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_") dat$state <- NULL ore.load(name = xlevelsDatastore) for (j in names(xlevels)) dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]]) c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE) ore.save(c5mod, name = datastoreName) TRUE }) ds.v <- ore.datastore(pattern= "myC5.0modelFL")$datastore.name for (ds in ds.v) ore.delete(name = ds) res <- ore.groupApply(CHURN_TEST, INDEX=CHURN_TEST$state, FUN.NAME = "myC5.0FunctionForLevels", xlevelsDatastore = "myXLevels", datastorePrefix = "myC5.0modelFL", ore.connect = TRUE) res <- ore.pull(res) all(as.logical(res) == TRUE) ore.scriptDrop("myScoringFunction") ore.scriptCreate("myScoringFunction", function(dat, xlevelsDatastore, datastorePrefix) { library(C50) state <- dat[1,"state"] datastoreName <- paste(datastorePrefix,state,sep="_") dat$state <- NULL ore.load(name = xlevelsDatastore) for (j in names(xlevels)) dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]]) ore.load(name = datastoreName) res <- data.frame(pred = predict(c5mod, dat, type = "class"), actual = dat$churn, state = state) res } ) scores <- ore.rowApply( CHURN_TEST[CHURN_TEST$state == "MA",], FUN.NAME = "myScoringFunction", xlevelsDatastore = "myXLevels", datastorePrefix = "myC5.0modelFL", ore.connect = TRUE, parallel = TRUE, FUN.VALUE = data.frame(pred = character(0), actual = character(0), state = character(0)), rows=200) scores table(scores$actual, scores$pred)
Listing for This Example
R> library(C50)
R> data(churn)
R>
R> ore.drop("CHURN_TEST"
R> ore.create(churnTest, "CHURN_TEST")
R>
R> xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
R> ore.delete("myXLevels")
[1] "myXLevels
R> ore.save(xlevels, name = "myXLevels")
R>
R> ore.scriptDrop("myC5.0FunctionForLevels")
R> ore.scriptCreate("myC5.0FunctionForLevels",
+ function(dat, xlevelsDatastore, datastorePrefix) {
+ library(C50)
+ state <- dat[1,"state"]
+ datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
+ dat$state <- NULL
+ ore.load(name = xlevelsDatastore)
+ for (j in names(xlevels))
+ dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+ c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
+ ore.save(c5mod, name = datastoreName)
+ TRUE
+ })
R>
R> ds.v <- ore.datastore(pattern="myC5.0modelFL")$datastore.name
R> for (ds in ds.v) ore.delete(name=ds)
R>
R> res <- ore.groupApply(CHURN_TEST,
+ INDEX=CHURN_TEST$state,
+ FUN.NAME="myC5.0FunctionForLevels",
+ xlevelsDatastore = "myXLevels",
+ datastorePrefix = "myC5.0modelFL",
+ ore.connect = TRUE)
R> res <- ore.pull(res)
R> all(as.logical(res) == TRUE)
[1] TRUE
R>
R> ore.scriptDrop("myScoringFunction")
R> ore.scriptCreate("myScoringFunction",
+ function(dat, xlevelsDatastore, datastorePrefix) {
+ library(C50)
+ state <- dat[1,"state"]
+ datastoreName <- paste(datastorePrefix,state,sep="_")
+ dat$state <- NULL
+ ore.load(name = xlevelsDatastore)
+ for (j in names(xlevels))
+ dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+ ore.load(name = datastoreName)
+ res <- data.frame(pred = predict(c5mod, dat, type="class"),
+ actual = dat$churn,
+ state = state)
+ res
+ }
+ )
R>
R> scores <- ore.rowApply(
+ CHURN_TEST[CHURN_TEST$state =="MA",],
+ FUN.NAME = "myScoringFunction",
+ xlevelsDatastore = "myXLevels",
+ datastorePrefix = "myC5.0modelFL",
+ ore.connect = TRUE, parallel = TRUE,
+ FUN.VALUE = data.frame(pred=character(0),
+ actual=character(0),
+ state=character(0)),
+ rows=200
R>
R> scores
pred actual state
1 no no MA
2 no no MA
3 no no MA
4 no no MA
5 no no MA
6 no yes MA
7 yes yes MA
8 yes yes MA
9 no no MA
10 no no MA
11 no no MA
12 no no MA
13 no no MA
14 no no MA
15 yes yes MA
16 no no MA
17 no no MA
18 no no MA
19 no no MA
20 no no MA
21 no no MA
22 no no MA
23 no no MA
24 no no MA
25 no no MA
26 no no MA
27 no no MA
28 no no MA
29 no yes MA
30 no no MA
31 no no MA
32 no no MA
33 yes yes MA
34 no no MA
35 no no MA
36 no no MA
37 no no MA
38 no no MA
Warning message:
ORE object has no unique key - using random order
R> table(scores$actual, scores$pred)
no yes
no 32 0
yes 2 4
Parent topic: R Interface for Embedded R Execution