3.1.7 Sample Data
Sampling is an important capability for statistical analytics.
Typically, you sample data to reduce its size and to perform meaningful work on it. In R you usually must load data into memory to sample it. However, if the data is too large, this isn't possible.
In OML4R, instead of pulling the data from the database and then sampling, you can sample directly in the database and then pull only those records that are part of the sample. By sampling in the database, you minimize data movement and you can work with larger data sets. Note that it is the ordering framework integer row indexing in the transparency layer that enables this capability.
The examples in this section illustrate several sampling techniques.
Example 3-10 Simple Random Sampling
This example demonstrates a simple selection of rows at random. The example creates a small data.frame
object and pushes it to the database to create an ore.frame
object, MYDATA
. Out of 20 rows, the example samples 5. It uses the R sample
function to produce a random set of indices that it uses to get the sample from MYDATA
. The sample, simpleRandomSample
, is an ore.frame
object.
set.seed(1) N <- 20 myData <- data.frame(a=1:N,b=letters[1:N]) MYDATA <- ore.push(myData) head(MYDATA) sampleSize <- 5 simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE] class(simpleRandomSample) simpleRandomSample
Listing for This Example
R> set.seed(1)
R> N <- 20
R> myData <- data.frame(a=1:N,b=letters[1:N])
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
R> sampleSize <- 5
R> simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE]
R> class(simpleRandomSample)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> simpleRandomSample
a b
2 2 b
7 7 g
10 10 j
12 12 l
19 19 s
Example 3-11 Split Data Sampling
This example demonstrates randomly partitioning data into training and testing sets. This splitting of the data is normally done in classification and regression to assess how well a model performs on new data. The example uses the MYDATA
object created in the previous example.
This example produces a sample set of indices to use as the test data set. It then creates the logical vector group
that is TRUE
if the index is in the sample and is FALSE
otherwise. Next, it uses row indexing to produce the training set where the group is FALSE
and the test set where the group is TRUE
. Notice that the number of rows in the training set is 15 and the number of rows in the test set is 5, as specified in the invocation of the sample
function.
set.seed(1) sampleSize <- 5 ind <- sample(1:nrow(MYDATA), sampleSize) group <- as.integer(1:nrow(MYDATA) %in% ind) MYDATA.train <- MYDATA[group==FALSE,] dim(MYDATA.train) MYDATA.test <- MYDATA[group==TRUE,] dim(MYDATA.test)
Listing for This Example
R> set.seed(1)
R> sampleSize <- 5
R> ind <- sample(1:nrow(MYDATA), sampleSize)
R> group <- as.integer(1:nrow(MYDATA) %in% ind)
R> MYDATA.train <- MYDATA[group==FALSE,]
dim(MYDATA.train)
[1] 15 2
R> MYDATA.test <- MYDATA[group==TRUE,]
R> dim(MYDATA.test)
[1] 5 2
Example 3-12 Systematic Sampling
This example demonstrates systematic sampling, in which rows are selected at regular intervals. The example uses the seq
function to create a sequence of values that start at 2 and increase by increments of 3. The number of values in the sequence is equal to the number of rows in MYDATA
. The MYDATA
object is created in the first example.
set.seed(1) N <- 20 myData <- data.frame(a=1:20,b=letters[1:N]) MYDATA <- ore.push(myData) head(MYDATA) start <- 2 by <- 3 systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE] systematicSample
Listing for This Example
R> set.seed(1)
R> N <- 20
R> myData <- data.frame(a=1:20,b=letters[1:N])
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
R> start <- 2
R> by <- 3
R> systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE]
systematicSample
a b
2 2 b
5 5 e
8 8 h
11 11 k
14 14 n
17 17 q
20 20 t
Example 3-13 Stratified Sampling
This example demonstrates stratified sampling, in which rows are selected within each group where the group is determined by the values of a particular column. The example creates a data set that has each row assigned to a group. The function rnorm
produces random normal numbers. The argument 4 is the desired mean for the distribution. The example splits the data according to group and then samples proportionately from each partition. Finally, it row binds the list of subset ore.frame
objects into a single ore.frame
object and then displays the values of the result, stratifiedSample
.
set.seed(1) N <- 200 myData <- data.frame(a=1:N,b=round(rnorm(N),2), group=round(rnorm(N,4),0)) MYDATA <- ore.push(myData) head(MYDATA) sampleSize <- 10 stratifiedSample <- do.call(rbind, lapply(split(MYDATA, MYDATA$group), function(y) { ny <- nrow(y) y[sample(ny, sampleSize*ny/N), , drop = FALSE] })) stratifiedSample
Listing for This Example
R> set.seed(1)
R> N <- 200
R> myData <- data.frame(a=1:N,b=round(rnorm(N),2),
+ group=round(rnorm(N,4),0))
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
a b group
1 1 -0.63 4
2 2 0.18 6
3 3 -0.84 6
4 4 1.60 4
5 5 0.33 2
6 6 -0.82 6
R> sampleSize <- 10
R> stratifiedSample <- do.call(rbind,
+ lapply(split(MYDATA, MYDATA$group),
+ function(y) {
+ ny <- nrow(y)
+ y[sample(ny, sampleSize*ny/N), , drop = FALSE]
+ }))
R> stratifiedSample
a b group
173|173 173 0.46 3
9|9 9 0.58 4
53|53 53 0.34 4
139|139 139 -0.65 4
188|188 188 -0.77 4
78|78 78 0.00 5
137|137 137 -0.30 5
Example 3-15 Quota Sampling
This example demonstrates quota sampling, in which a consecutive number of records are selected as the sample. The example uses the head
function to select the sample. The tail
function could also have been used.
set.seed(1) N <- 200 myData <- data.frame(a=1:N,b=round(runif(N),2)) MYDATA <- ore.push(myData) sampleSize <- 10 quotaSample1 <- head(MYDATA, sampleSize) quotaSample1
Listing for This Example
R> set.seed(1)
R> N <- 200
R> myData <- data.frame(a=1:N,b=round(runif(N),2))
R> MYDATA <- ore.push(myData)
R> sampleSize <- 10
R> quotaSample1 <- head(MYDATA, sampleSize)
R> quotaSample1
a b
1 1 0.15
2 2 0.75
3 3 0.98
4 4 0.97
5 5 0.35
6 6 0.39
7 7 0.95
8 8 0.11
9 9 0.93
10 10 0.35