8.1.2 Select Data
A typical step in preparing data for analysis is selecting or filtering values of interest from a larger data set.
The examples in this section demonstrate selecting data from an oml.DataFrame
object by rows, by columns, and by value.
The examples use the oml_iris
object created by the following code, which imports the sklearn.datasets
package and loads the iris
data set. It creates the x
and y
variables, and then creates the persistent database table IRIS and the oml.DataFrame
object oml.iris
as a proxy for the table.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data, columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
The examples are in the following topics:
Select the First or Last Number of Rows
The head
and tail
methods return the first or last number of elements.
The default number of rows selected is 5.
Example 8-1 Selecting the First and Last Number of Rows
This example selects rows from the oml.DataFrame
object oml_iris
. It displays the first five rows and ten rows of oml_iris
and then the last five and ten rows.
# Display the first 5 rows.
oml_iris.head()
# Display the first 10 rows.
oml_iris.head(10)
# Display the last 5 rows.
oml_iris.tail()
# Display the last 10 rows.
oml_iris.tail(10)
Listing for This Example
>>> # Display the first 5 rows.
... oml_iris.head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>>
>>> # Display the first 10 rows.
... oml_iris.head(10)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
>>>
>>> # Display the last 5 rows.
... oml_iris.tail()
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 6.7 3.0 5.2 2.3 virginica
1 6.3 2.5 5.0 1.9 virginica
2 6.5 3.0 5.2 2.0 virginica
3 6.2 3.4 5.4 2.3 virginica
4 5.9 3.0 5.1 1.8 virginica
>>>
>>> # Display the last 10 rows.
... oml_iris.tail(10)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 6.7 3.1 5.6 2.4 virginica
1 6.9 3.1 5.1 2.3 virginica
2 5.8 2.7 5.1 1.9 virginica
3 6.8 3.2 5.9 2.3 virginica
4 6.7 3.3 5.7 2.5 virginica
5 6.7 3.0 5.2 2.3 virginica
6 6.3 2.5 5.0 1.9 virginica
7 6.5 3.0 5.2 2.0 virginica
8 6.2 3.4 5.4 2.3 virginica
9 5.9 3.0 5.1 1.8 virginica
Select Data by Column
Example 8-2 Selecting Data by Columns
The example selects two columns from oml_iris
and creates the oml.DataFrame
object iris_projected1
with them. It then displays the first three rows of iris_projected1
. The example also selects a range of columns from oml_iris
, creates iris_projected2
, and displays its first three rows. Finally, the example selects columns from oml_iris
by data types, creates iris_projected3
, and displays its first three rows.
# Select all rows with the specified column names.
iris_projected1 = oml_iris[:, ["Sepal_Length", "Petal_Length"]]
iris_projected1.head(3)
# Select all rows with columns whose indices are in the range [1, 4).
iris_projected2 = oml_iris[:, 1:4]
iris_projected2.head(3)
# Select all rows with columns of oml.String data type.
iris_projected3 = oml_iris.select_types(include=[oml.String])
iris_projected3.head(3)
Listing for This Example
>>> # Select all rows with specified column names.
... iris_projected1 = oml_iris[:, ["Sepal_Length", "Petal_Length"]]
>>> iris_projected1.head(3)
Sepal_Length Petal_Length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
>>>
>>> # Select all rows with columns whose indices are in range [1, 4).
... iris_projected2 = oml_iris[:, 1:4]
>>> iris_projected2.head(3)
Sepal_Width Petal_Length Petal_Width
0 3.5 1.4 0.2
1 3.0 1.4 0.2
2 3.2 1.3 0.2
>>>
>>> # Select all rows with columns of oml.String data type.
... iris_projected3 = oml_iris.select_types(include=[oml.String])
>>> iris_projected3.head(3)
Species
0 setosa
1 setosa
2 setosa
Select Data by Value
Example 8-3 Selecting Data by Value
This example filters oml_iris
to produce iris_of_filtered1
, which contains the values from the rows of oml_iris
that have a petal length of less than 1.5 and that are in the Sepal_Length and Petal_Length columns. The example also filters the data using conditions, so that oml_iris_filtered2
contains the values from oml_iris
that have a petal length of less than 1.5 or a sepal length equal to 5.0 and oml_iris_filtered3
contains the values from oml_iris
that have a petal length of less than 1.5 and a sepal length larger than 5.0.
# Select sepal length and petal length where petal length
# is less than 1.5.
oml_iris_filtered1 = oml_iris[oml_iris["Petal_Length"] < 1.5,
["Sepal_Length", "Petal_Length"]]
len(oml_iris_filtered1)
oml_iris_filtered1.head(3)
### Using the AND and OR conditions in filtering.
# Select all rows in which petal length is less than 1.5 or sepal length
# sepal length is 5.0.
oml_iris_filtered2 = oml_iris[(oml_iris["Petal_Length"] < 1.5) |
(oml_iris["Sepal_Length"] == 5.0), :]
len(oml_iris_filtered2)
oml_iris_filtered2.head(3)
# Select all rows in which petal length is less than 1.5 and
# sepal length is larger than 5.0.
oml_iris_filtered3 = oml_iris[(oml_iris["Petal_Length"] < 1.5) &
(oml_iris["Sepal_Length"] > 5.0), :]
len(oml_iris_filtered3)
oml_iris_filtered3.head()
Listing for This Example
>>> # Select sepal length and petal length where petal length
... # is less than 1.5.
... oml_iris_filtered1 = oml_iris[oml_iris["Petal_Length"] < 1.5,
... ["Sepal_Length", "Petal_Length"]]
>>> len(oml_iris_filtered1)
24
>>> oml_iris_filtered1.head(3)
Sepal_Length Petal_Length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
>>>
>>> ### Using the AND and OR conditions in filtering.
... # Select all rows in which petal length is less than 1.5 or
... # sepal length is 5.0.
... oml_iris_filtered2 = oml_iris[(oml_iris["Petal_Length"] < 1.5) |
... (oml_iris["Sepal_Length"] == 5.0), :]
>>> len(oml_iris_filtered2)
30
>>> oml_iris_filtered2.head(3)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
>>>
>>> # Select all rows in which petal length is less than 1.5
... # and sepal length is larger than 5.0.
... oml_iris_filtered3 = oml_iris[(oml_iris["Petal_Length"] < 1.5) &
... (oml_iris["Sepal_Length"] > 5.0), :]
>>> len(oml_iris_filtered3)
7
>>> oml_iris_filtered3.head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.8 4.0 1.2 0.2 setosa
2 5.4 3.9 1.3 0.4 setosa
3 5.1 3.5 1.4 0.3 setosa
4 5.2 3.4 1.4 0.2 setosa
Parent topic: Prepare Data