Run a User-Defined Python Function on Sets of Rows

13.5.5 Run a User-Defined Python Function on Sets of Rows

Use the oml.row_apply function to chunk data into sets of rows and then run a user-defined Python function on each chunk.

The oml.row_apply function passes the oml.DataFrame specified by the data argument as the first argument to the user-defined func Python function. The rows argument specifies the maximum number of rows of the oml.DataFrame to assign to each chunk. The last chunk of rows may have fewer rows than the number specified.

The oml.row_apply function runs the Python function in a database-spawned Python engine. The function can use data-parallel execution, in which one or more Python engines perform the same Python function on different chunks of the data.

The syntax of the function is the following.

oml.row_apply(data, func, func_owner=None, rows=1, parallel=None, graphics=False, **kwargs)

The data argument is an oml.DataFrame that contains the data that the func function operates on.

The func argument is the function to run. It may be one of the following:

A Python function
A string that is the name of a user-defined Python function in the OML4Py script repository
A string that defines a Python function
An oml.script.script.Callable object returned by the oml.script.load function

The optional func_owner argument is a string or None (the default) that specifies the owner of the registered user-defined Python function when argument func is a registered user-defined Python function name.

The rows argument is an int that specifies the maximum number of rows to include in each chunk.

The parallel argument is a boolean, an int, or None (the default) that specifies the preferred degree of parallelism to use in the Embedded Python Execution job. The value may be one of the following:

A positive integer greater than or equal to 1 for a specific degree of parallelism
False, None, or 0 for no parallelism
True for the default data parallelism

The graphics argument is a boolean that specifies whether to look for images. The default value is True.

With the **kwargs parameter, you can pass additional arguments to the func function. Special control arguments, which start with oml_, are not passed to the function specified by func, but instead control what happens before or after the running of the function.

The oml.row_apply function returns a pandas.DataFrame or a list of oml.embed.data_image._DataImage objects. If no image is rendered in the user-defined Python function, oml.row_apply returns a pandas.DataFrame. Otherwise, it returns a list of oml.embed.data_image._DataImage objects.

Example 13-9 Using the oml.row_apply Function

This example creates the x and y variables using the iris data set. It then creates the persistent database table IRIS and the oml.DataFrame object oml_iris as a proxy for the table.

The example builds a regression model based on iris data. It defines a function that predicts the Petal_Width values based on the Sepal_Length, Sepal_Width, and Petal_Length columns of the input data. It then concatenates the Species column, the Petal_Width column, and the predicted Petal_Width as the object to return. Finally, the example calls the oml.row_apply function to apply the make_pred() function on each 4-row chunk of the input data.

import oml
import pandas as pd
from sklearn import datasets
from sklearn import linear_model

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data, 
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: 
                           {0: 'setosa', 1: 'versicolor', 
                            2:'virginica'}[x], iris.target)), 
                 columns = ['Species'])

# Drop the IRIS database table if it exists.
try:
    oml.drop('IRIS')
except: 
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Build a regression model to predict Petal_Width using in-memory 
# data.
iris = oml_iris.pull()
regr = linear_model.LinearRegression()
regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
         iris[['Petal_Width']])
regr.coef_

# Define a Python function.
def make_pred(dat, regr):
    import pandas as pd
    import numpy as np
    pred = regr.predict(dat[['Sepal_Length', 
                             'Sepal_Width',
                             'Petal_Length']])
    return pd.concat([dat[['Species', 'Petal_Width']], 
                     pd.DataFrame(pred, 
                                  columns=['Pred_Petal_Width'])], 
                                  axis=1)

input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
input_data.crosstab(index = 'Species').sort_values('Species')

res = oml.row_apply(input_data, rows=4, func=make_pred, 
                    regr=regr, parallel=2)
type(res)
res

Listing for This Example

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn import linear_model
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> # Drop the IRIS database table if it exists.
... try:
...     oml.drop('IRIS')
... except: 
...     pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
>>> oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Build a regression model to predict Petal_Width using in-memory
... # data.
... iris = oml_iris.pull()
>>> regr = linear_model.LinearRegression()
>>> regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
...          iris[['Petal_Width']])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
>>> regr.coef_
array([[-0.20726607,  0.22282854,  0.52408311]])
>>> 
>>> # Define a Python function.
... def make_pred(dat, regr):
...     import pandas as pd
...     import numpy as np
...     pred = regr.predict(dat[['Sepal_Length', 
...                              'Sepal_Width',
...                              'Petal_Length']])
...     return pd.concat([dat[['Species', 'Petal_Width']], 
...                      pd.DataFrame(pred, 
...                                   columns=['Pred_Petal_Width'])], 
...                                   axis=1)
>>>
>>> input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
>>> input_data.crosstab(index = 'Species').sort_values('Species')
      SPECIES  count
0      setosa      7
1  versicolor      8
2   virginica      4
>>>  res = oml.row_apply(input_data, rows=4, func=make_pred, regr=regr, 
...                     columns=['Species', 
...                              'Petal_Width',
...                              'Pred_Petal_Width']))
>>> res = oml.row_apply(input_data, rows=4, func=make_pred,
...                     regr=regr, parallel=2)
>>> type(res)
<class 'pandas.core.frame.DataFrame'>
>>> res
       Species  Petal_Width  Pred_Petal_Width
0       setosa          0.4          0.344846
1       setosa          0.3          0.335509
2       setosa          0.2          0.294117
3       setosa          0.2          0.220982
4       setosa          0.2          0.080937
5   versicolor          1.5          1.504615
6   versicolor          1.3          1.560570
7   versicolor          1.0          1.008352
8   versicolor          1.3          1.131905
9   versicolor          1.3          1.215622
10  versicolor          1.3          1.272388
11   virginica          1.8          1.623561
12   virginica          1.8          1.878132

Parent topic: Python API for Embedded Python Execution