13.5.5 Run a User-Defined Python Function on Sets of Rows
Use the oml.row_apply
function to chunk data into sets of rows and then run a user-defined Python function on each chunk.
The oml.row_apply
function passes the oml.DataFrame
specified by the data
argument as the first argument to the user-defined func
Python function. The rows
argument specifies the maximum number of rows of the oml.DataFrame
to assign to each chunk. The last chunk of rows may have fewer rows than the number specified.
The oml.row_apply
function runs the Python function in a database-spawned Python engine. The function can use data-parallel execution, in which one or more Python engines perform the same Python function on different chunks of the data.
The syntax of the function is the following.
oml.row_apply(data, func, func_owner=None, rows=1, parallel=None, graphics=False, **kwargs)
The data
argument is an oml.DataFrame
that contains the data that the func
function operates on.
The func
argument is the function to run. It may be one of the following:
-
A Python function
-
A string that is the name of a user-defined Python function in the OML4Py script repository
- A string that defines a Python function
-
An
oml.script.script.Callable
object returned by theoml.script.load
function
The optional func_owner
argument is a string or None
(the default) that specifies the owner of the registered user-defined Python function when argument func
is a registered user-defined Python function name.
The rows
argument is an int
that specifies the maximum number of rows to include in each chunk.
The parallel
argument is a boolean, an int
, or None
(the default) that specifies the preferred degree of parallelism to use in the Embedded Python Execution job. The value may be one of the following:
-
A positive integer greater than or equal to 1 for a specific degree of parallelism
-
False
,None
, or0
for no parallelism -
True
for the default data parallelism
The graphics
argument is a boolean that specifies whether to look for images. The default value is True
.
With the **kwargs
parameter, you can pass additional arguments to the func
function. Special control arguments, which start with oml_
, are not passed to the function specified by func
, but instead control what happens before or after the running of the function.
The oml.row_apply
function returns a pandas.DataFrame
or a list of oml.embed.data_image._DataImage
objects. If no image is rendered in the user-defined Python function, oml.row_apply
returns a pandas.DataFrame
. Otherwise, it returns a list of oml.embed.data_image._DataImage
objects.
Example 13-9 Using the oml.row_apply Function
This example creates the x
and y
variables using the iris data set. It then creates the persistent database table IRIS and the oml.DataFrame
object oml_iris
as a proxy for the table.
The example builds a regression model based on iris data. It defines a function that predicts the Petal_Width values based on the Sepal_Length, Sepal_Width, and Petal_Length columns of the input data. It then concatenates the Species column, the Petal_Width column, and the predicted Petal_Width as the object to return. Finally, the example calls the oml.row_apply
function to apply the make_pred()
function on each 4-row chunk of the input data.
import oml
import pandas as pd
from sklearn import datasets
from sklearn import linear_model
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
# Drop the IRIS database table if it exists.
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Build a regression model to predict Petal_Width using in-memory
# data.
iris = oml_iris.pull()
regr = linear_model.LinearRegression()
regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
iris[['Petal_Width']])
regr.coef_
# Define a Python function.
def make_pred(dat, regr):
import pandas as pd
import numpy as np
pred = regr.predict(dat[['Sepal_Length',
'Sepal_Width',
'Petal_Length']])
return pd.concat([dat[['Species', 'Petal_Width']],
pd.DataFrame(pred,
columns=['Pred_Petal_Width'])],
axis=1)
input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
input_data.crosstab(index = 'Species').sort_values('Species')
res = oml.row_apply(input_data, rows=4, func=make_pred,
regr=regr, parallel=2)
type(res)
res
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn import linear_model
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> # Drop the IRIS database table if it exists.
... try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
>>> oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Build a regression model to predict Petal_Width using in-memory
... # data.
... iris = oml_iris.pull()
>>> regr = linear_model.LinearRegression()
>>> regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
... iris[['Petal_Width']])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
>>> regr.coef_
array([[-0.20726607, 0.22282854, 0.52408311]])
>>>
>>> # Define a Python function.
... def make_pred(dat, regr):
... import pandas as pd
... import numpy as np
... pred = regr.predict(dat[['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length']])
... return pd.concat([dat[['Species', 'Petal_Width']],
... pd.DataFrame(pred,
... columns=['Pred_Petal_Width'])],
... axis=1)
>>>
>>> input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
>>> input_data.crosstab(index = 'Species').sort_values('Species')
SPECIES count
0 setosa 7
1 versicolor 8
2 virginica 4
>>> res = oml.row_apply(input_data, rows=4, func=make_pred, regr=regr,
... columns=['Species',
... 'Petal_Width',
... 'Pred_Petal_Width']))
>>> res = oml.row_apply(input_data, rows=4, func=make_pred,
... regr=regr, parallel=2)
>>> type(res)
<class 'pandas.core.frame.DataFrame'>
>>> res
Species Petal_Width Pred_Petal_Width
0 setosa 0.4 0.344846
1 setosa 0.3 0.335509
2 setosa 0.2 0.294117
3 setosa 0.2 0.220982
4 setosa 0.2 0.080937
5 versicolor 1.5 1.504615
6 versicolor 1.3 1.560570
7 versicolor 1.0 1.008352
8 versicolor 1.3 1.131905
9 versicolor 1.3 1.215622
10 versicolor 1.3 1.272388
11 virginica 1.8 1.623561
12 virginica 1.8 1.878132
Parent topic: Python API for Embedded Python Execution