2 Using Big Data Spatial and Graph with Spatial Data
This chapter provides conceptual and usage information about loading, storing, accessing, and working with spatial data in a Big Data environment.
- About Big Data Spatial and Graph Support for Spatial Data
Oracle Big Data Spatial and Graph features enable spatial data to be stored, accessed, and analyzed quickly and efficiently for location-based decision making. - Oracle Big Data Vector and Raster Data Processing
Oracle Big Data Spatial and Graph supports the storage and processing of both vector and raster spatial data. - Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing
Oracle Spatial Hadoop Image Processing Framework allows the creation of new combined images resulting from a series of processing phases in parallel. - Loading an Image to Hadoop Using the Image Loader
The first step to process images using the Oracle Spatial and Graph Hadoop Image Processing Framework is to actually have the images in HDFS, followed by having the images separated into smart tiles. - Processing an Image Using the Oracle Spatial Hadoop Image Processor
Once the images are loaded into HDFS, they can be processed in parallel using Oracle Spatial Hadoop Image Processing Framework. - Loading and Processing an Image Using the Oracle Spatial Hadoop Raster Processing API
The framework provides a raster processing API that lets you load and process rasters without creating XML but instead using a Java application. The application can be executed inside the cluster or on a remote node. - Using the Oracle Spatial Hadoop Raster Simulator Framework to Test Raster Processing
When you create custom processing classes. you can use the Oracle Spatial Hadoop Raster Simulator Framework to do the following by "pretending" to plug them into the Oracle Raster Processing Framework. - Oracle Big Data Spatial Raster Processing for Spark
Oracle Big Data Spatial Raster Processing for Apache Spark is a spatial raster processing API for Java. - Spatial Raster Processing Support in Big Data Cloud Service
Oracle Big Data Spatial Raster Processing is supported in Big Data Cloud Service (BDCS) by making use of the Oracle Object Storage platform. - Oracle Big Data Spatial Vector Analysis
Oracle Big Data Spatial Vector Analysis is a Spatial Vector Analysis API, which runs as a Hadoop job and provides MapReduce components for spatial processing of data stored in HDFS. - Oracle Big Data Spatial Vector Analysis for Spark
Oracle Big Data Spatial Vector Analysis for Apache Spark is a spatial vector analysis API for Java and Scala that provides spatially-enabled RDDs (Resilient Distributed Datasets) that support spatial transformations and actions, spatial partitioning, and indexing. - Oracle Big Data Spatial Vector Hive Analysis
Oracle Big Data Spatial Vector Hive Analysis provides spatial functions to analyze the data using Hive. - Using the Oracle Big Data SpatialViewer Web Application
You can use the Oracle Big Data SpatialViewer Web Application (SpatialViewer) to perform a variety of tasks.
2.1 About Big Data Spatial and Graph Support for Spatial Data
Oracle Big Data Spatial and Graph features enable spatial data to be stored, accessed, and analyzed quickly and efficiently for location-based decision making.
Spatial data represents the location characteristics of real or conceptual objects in relation to the real or conceptual space on a Geographic Information System (GIS) or other location-based application.
The spatial features are used to geotag, enrich, visualize, transform, load, and process the location-specific two and three dimensional geographical images, and manipulate geometrical shapes for GIS functions.
- What is Big Data Spatial and Graph on Apache Hadoop?
- Advantages of Oracle Big Data Spatial and Graph
- Oracle Big Data Spatial Features and Functions
- Oracle Big Data Spatial Files, Formats, and Software Requirements
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.1.1 What is Big Data Spatial and Graph on Apache Hadoop?
Oracle Big Data Spatial and Graph on Apache Hadoop is a framework that uses the MapReduce programs and analytic capabilities in a Hadoop cluster to store, access, and analyze the spatial data. The spatial features provide a schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial data. Big Data Spatial and Graph on Hadoop supports storing and processing spatial images, which could be geometric shapes, raster, or vector images and stored in one of the several hundred supported formats.
Note:
Oracle Spatial and Graph Developer's Guide for an introduction to spatial concepts, data, and operations
2.1.2 Advantages of Oracle Big Data Spatial and Graph
The advantages of using Oracle Big Data Spatial and Graph include the following:
-
Unlike some of the GIS-centric spatial processing systems and engines, Oracle Big Data Spatial and Graph is capable of processing both structured and unstructured spatial information.
-
Customers are not forced or restricted to store only one particular form of data in their environment. They can have their data stored both as a spatial or nonspatial business data and still can use Oracle Big Data to do their spatial processing.
-
This is a framework, and therefore customers can use the available APIs to custom-build their applications or operations.
-
Oracle Big Data Spatial can process both vector and raster types of information and images.
2.1.3 Oracle Big Data Spatial Features and Functions
The spatial data is loaded for query and analysis by the Spatial Server and the images are stored and processed by an Image Processing Framework. You can use the Oracle Big Data Spatial and Graph server on Hadoop for:
-
Cataloguing the geospatial information, such as geographical map-based footprints, availability of resources in a geography, and so on.
-
Topological processing to calculate distance operations, such as nearest neighbor in a map location.
-
Categorization to build hierarchical maps of geographies and enrich the map by creating demographic associations within the map elements.
The following functions are built into Oracle Big Data Spatial and Graph:
-
Indexing function for faster retrieval of the spatial data.
-
Map function to display map-based footprints.
-
Zoom function to zoom-in and zoom-out specific geographical regions.
-
Mosaic and Group function to group a set of image files for processing to create a mosaic or subset operations.
-
Cartesian and geodetic coordinate functions to represent the spatial data in one of these coordinate systems.
-
Hierarchical function that builds and relates geometric hierarchy, such as country, state, city, postal code, and so on. This function can process the input data in the form of documents or latitude/longitude coordinates.
2.1.4 Oracle Big Data Spatial Files, Formats, and Software Requirements
The stored spatial data or images can be in one of these supported formats:
-
GeoJSON files
-
Shapefiles
-
Both Geodetic and Cartesian data
-
Other GDAL supported formats
You must have the following software, to store and process the spatial data:
-
Java runtime
-
GCC Compiler - Only when the GDAL-supported formats are used
2.2 Oracle Big Data Vector and Raster Data Processing
Oracle Big Data Spatial and Graph supports the storage and processing of both vector and raster spatial data.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.2.1 Oracle Big Data Spatial Raster Data Processing
For processing the raster data, the GDAL loader loads the raster spatial data or images onto a HDFS environment. The following basic operations can be performed on a raster spatial data:
-
Mosaic: Combine multiple raster images to create a single mosaic image.
-
Subset: Perform subset operations on individual images.
-
Raster algebra operations: Perform algebra operations on every pixel in the rasters (for example, add, divide, multiply, log, pow, sine, sinh, and acos).
-
User-specified processing: Raster processing is based on the classes that user sets to be executed in mapping and reducing phases.
This feature supports a MapReduce framework for raster analysis operations. The users have the ability to custom-build their own raster operations, such as performing an algebraic function on a raster data and so on. For example, calculate the slope at each base of a digital elevation model or a 3D representation of a spatial surface, such as a terrain. For details, see Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing.
Parent topic: Oracle Big Data Vector and Raster Data Processing
2.2.2 Oracle Big Data Spatial Vector Data Processing
This feature supports the processing of spatial vector data:
-
Loaded and stored on to a Hadoop HDFS environment
-
Stored either as Cartesian or geodetic data
The stored spatial vector data can be used for performing the following query operations and more:
-
Point-in-polygon
-
Distance calculation
-
Anyinteract
-
Buffer creation
Sevetal data service operations are supported for the spatial vector data:
-
Data enrichment
-
Data categorization
-
Spatial join
In addition, there is a limited Map Visualization API support for only the HTML5 format. You can access these APIs to create custom operations. For details, see "Oracle Big Data Spatial Vector Analysis."
Parent topic: Oracle Big Data Vector and Raster Data Processing
2.3 Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing
Oracle Spatial Hadoop Image Processing Framework allows the creation of new combined images resulting from a series of processing phases in parallel.
It includes the following features:
-
HDFS Images storage, where every block size split is stored as a separate tile, ready for future independent processing
-
Subset, user-defined, and map algebra operations processed in parallel using the MapReduce framework
-
Ability to add custom processing classes to be executed in the mapping or reducing phases in parallel in a transparent way
-
Fast processing of georeferenced images
-
Support for GDAL formats, multiple bands images, DEMs (digital elevation models), multiple pixel depths, and SRIDs
-
Java API providing access to framework operations; useful for web services or standalone Java applications
-
Framework for testing and debugging user processing classes in the local environment
The Oracle Spatial Hadoop Image Processing Framework consists of two modules, a Loader and Processor, each one represented by a Hadoop job running on different stages in a Hadoop cluster, as represented in the following diagram. Also, you can load and process the images using the SpatialViewer web application, and you can use the Java API to expose the framework’s capabilities.
For installation and configuration information, see:
2.3.1 Image Loader
The Image Loader is a Hadoop job that loads a specific image or a group of images into HDFS.
-
While importing, the image is tiled and stored as an HDFS block.
-
GDAL is used to tile the image.
-
Each tile is loaded by a different mapper, so reading is parallel and faster.
-
Each tile includes a certain number of overlapping bytes (user input), so that the tiles cover area from the adjacent tiles.
-
A MapReduce job uses a mapper to load the information for each tile. There are 'n' number of mappers, depending on the number of tiles, image resolution and block size.
-
A single reduce phase per image puts together all the information loaded by the mappers and stores the images into a special
.ohif
format, which contains the resolution, bands, offsets, and image data. This way the file offset containing each tile and the node location is known. -
Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.
The following diagram represents an Image Loader process:
2.3.2 Image Processor
The Image Processor is a Hadoop job that filters tiles to be processed based on the user input and performs processing in parallel to create a new image.
-
Processes specific tiles of the image identified by the user. You can identify one, zero, or multiple processing classes. These classes are executed in the mapping or reducing phase, depending on your configuration. For the mapping phase, after the execution of processing classes, a mosaic operation is performed to adapt the pixels to the final output format requested by the user. If no mosaic operation was requested, the input raster is sent to reduce phase as is. For reducer phase, all the tiles are put together into a GDAL data set that is input for user reduce processing class, where final output may be changed or analyzed according to user needs.
-
A mapper loads the data corresponding to one tile, conserving data locality.
-
Once the data is loaded, the mapper filters the bands requested by the user.
-
Filtered information is processed and sent to each mapper in the reduce phase, where bytes are put together and a final processed image is stored into HDFS or regular File System depending on the user request.
The following diagram represents an Image Processor job:
2.4 Loading an Image to Hadoop Using the Image Loader
The first step to process images using the Oracle Spatial and Graph Hadoop Image Processing Framework is to actually have the images in HDFS, followed by having the images separated into smart tiles.
This allows the processing job to work separately on each tile independently. The Image Loader lets you import a single image or a collection of them into HDFS in parallel, which decreases the load time.
The Image Loader imports images from a file system into HDFS, where each block contains data for all the bands of the image, so that if further processing is required on specific positions, the information can be processed on a single node.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.4.1 Image Loading Job
The image loading job has its custom input format that splits the image into related image splits. The splits are calculated based on an algorithm that reads square blocks of the image covering a defined area, which is determined by
area = ((blockSize - metadata bytes) / number of bands) / bytes per pixel.
For those pieces that do not use the complete block size, the remaining bytes are refilled with zeros.
Splits are assigned to different mappers where every assigned tile is read using GDAL based on the ImageSplit
information. As a result an ImageDataWritable
instance is created and saved in the context.
The metadata set in the ImageDataWritable
instance is used by the processing classes to set up the tiled image in order to manipulate and process it. Since the source images are read from multiple mappers, the load is performed in parallel and faster.
After the mappers finish reading, the reducer picks up the tiles from the context and puts them together to save the file into HDFS. A special reading process is required to read the image back.
Parent topic: Loading an Image to Hadoop Using the Image Loader
2.4.2 Input Parameters
The following input parameters are supplied to the Hadoop command:
hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar -files <SOURCE_IMGS_PATH> -out <HDFS_OUTPUT_FOLDER> -gdal <GDAL_LIB_PATH> -gdalData <GDAL_DATA_PATH> [-overlap <OVERLAPPING_PIXELS>] [-thumbnail <THUMBNAIL_PATH>] [-expand <false|true>] [-extractLogs <false|true>] [-logFilter <LINES_TO_INCLUDE_IN_LOG>] [-pyramid <OUTPUT_DIRECTORY, LEVEL, [RESAMPLING]>]
Where:
SOURCE_IMGS_PATH
is a path to the source image(s) or folder(s). For multiple inputs use a comma separator. This path must be accessible via NFS to all nodes in the cluster.HDFS_OUTPUT_FOLDER
is the HDFS output folder where the loaded images are stored.OVERLAPPING_PIXELS
is an optional number of overlapping pixels on the borders of each tile, if this parameter is not specified a default of two overlapping pixels is considered.GDAL_LIB_PATH
is the path where GDAL libraries are located.GDAL_DATA_PATH
is the path where GDAL data folder is located. This path must be accessible through NFS to all nodes in the cluster.THUMBNAIL_PATH
is an optional path to store a thumbnail of the loaded image(s). This path must be accessible through NFS to all nodes in the cluster and must have write access permission for yarn users.-expand
controls whether the HDFS path of the loaded raster expands the source path, including all directories. If you set this tofalse
, the.ohif
file is stored directly in the output directory (specified using the-o
option) without including that directory’s path in the raster.-extractLogs
controls whether the logs of the executed application should be extracted to the system temporary directory. By default, it is not enabled. The extraction does not include logs that are not part of Oracle Framework classes.-logFilter <LINES_TO_INCLUDE_IN_LOG>
is a comma-separated String that lists all the patterns to include in the extracted logs, for example, to include custom processing classes packages.-pyramid <OUTPUT_DIRECTORY, LEVEL, [RESAMPLING]>
allows the creation of pyramids while making the initial raster load. An OUPUT_DIRECTORY must be provided to store the local pyramids before uploading to HDFS; pyramids are loaded in the same HDFSA directory requested for load. A pyramid LEVEL must be provided to indicate how many pyramids are required for each raster. A RESAMPLING algorithm is optional to specify the method used to execute the resampling; if none is set, thenBILINEAR
is used.
For example, the following command loads all the georeferenced images under the images
folder and adds an overlapping of 10 pixels on every border possible. The HDFS output folder is ohiftest
and thumbnail of the loaded image are stored in the processtest
folder.
hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar -files /opt/shareddir/spatial/demo/imageserver/images/hawaii.tif -out ohiftest -overlap 10 -thumbnail /opt/shareddir/spatial/processtest –gdal /opt/oracle/oracle-spatial-graph/spatial/raster/gdal/lib –gdalData /opt/shareddir/data
By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed. This properties file may list all the configuration properties that you want to override. For example,
mapreduce.map.memory.mb=2560 mapreduce.reduce.memory.mb=2560 mapreduce.reduce.java.opts=-Xmx2684354560 mapreduce.map.java.opts=-Xmx2684354560
Java heap memory (java.opts
properties) must be equal to or less than the total memory assigned to mappers and reducers (mapreduce.map.memory
and mapreduce.reduce.memory
). Thus, if you increase Java heap memory, you might also need to increase the memory for mappers and reducers.
For GDAL to work properly, the libraries must be available using $LD_LIBRARY_PATH. Make sure that the shared libraries path is set properly in your shell window before executing a job. For example:
export LD_LIBRARY_PATH=$ALLACCESSDIR/gdal/native
Parent topic: Loading an Image to Hadoop Using the Image Loader
2.4.3 Output Parameters
The reducer generates two output files per input image. The first one is the .ohif
file that concentrates all the tiles for the source image, each tile may be processed as a separated instance by a processing mapper. Internally each tile is stored as a HDFS block, blocks are located in several nodes, one node may contain one or more blocks of a specific .ohif
file. The .ohif
file is stored in user specified folder with -out flag, under the /user/<USER_EXECUTING_JOB>/OUT_FOLDER/<PARENT_DIRECTORIES_OF_SOURCE_RASTER>
if the flag –expand
was not used. Otherwise, the .ohif
file will be located at /user/<USER_EXECUTING_JOB>/OUT_FOLDER/
, and the file can be identified as original_filename.ohif
.
The second output is a related metadata file that lists all the pieces of the image and the coordinates that each one covers. The file is located in HDFS under the metadata location, and its name is hash generated using the name of the ohif
file. This file is for Oracle internal use only, and lists important metadata of the source raster. Some example lines from a metadata file:
srid:26904 datatype:1 resolution:27.90809458890406,-27.90809458890406 file:/user/hdfs/ohiftest/opt/shareddir/spatial/data/rasters/hawaii.tif.ohif bands:3 mbr:532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762 0,532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762 thumbnailpath:/opt/shareddir/spatial/thumb/
If the -thumbnail
flag was specified, a thumbnail of the source image is stored in the related folder. This is a way to visualize a translation of the .ohif
file. Job execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
Parent topic: Loading an Image to Hadoop Using the Image Loader
2.5 Processing an Image Using the Oracle Spatial Hadoop Image Processor
Once the images are loaded into HDFS, they can be processed in parallel using Oracle Spatial Hadoop Image Processing Framework.
You specify an output, and the framework filters the tiles to fit into that output, processes them, and puts them all together to store them into a single file. Map algebra operations are also available and, if set, will be the first part of the processing phase. You can specify additional processing classes to be executed before the final output is created by the framework.
The image processor loads specific blocks of data, based on the input (mosaic description or a single raster), and selects only the bands and pixels that fit into the final output. All the specified processing classes are executed and the final output is stored into HDFS or the file system depending on the user request.
- Image Processing Job
- Input Parameters
- Job Execution
- Processing Classes and ImageBandWritable
- Map Algebra Operations
- Multiple Raster Algebra Operations
- Pyramids
- Output
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.5.1 Image Processing Job
The image processing job has different flows depending on the type of processing requested by the user.
-
Default Image Processing Job Flow: executed for processing that includes a mosaic operation, single raster operation, or basic multiple raster operation.
-
Multiple Raster Image Processing Job Flow: executed for processing that includes complex multiple raster algebra operations.
2.5.1.1 Default Image Processing Job Flow
The default image processing job flow is executed when any of the following processing is requested:
-
Mosaic operation
-
Single raster operation
-
Basic multiple raster algebra operation
The flow has its own custom FilterInputFormat
, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same data type (pixel depth) as the mosaic input data type (pixel depth) are considered. Only the tiles that intersect with coordinates specified by the user for the mosaic output are included. For processing of a single raster or basic multiple raster algebra operation (excluding mosaic), the filter includes all the tiles of the input rasters, because the processing will be executed on the complete images. Once the tiles are selected, a custom ImageProcessSplit
is created for each image.
When a mapper receives the ImageProcessSplit
, it reads the information based on what the ImageSplit
specifies, performs a filter to select only the bands indicated by the user, and executes the list of map operations and of processing classes defined in the request, if any.
Each mapper process runs in the node where the data is located. After the map algebra operations and processing classes are executed, a validation verifies if the user is requesting mosaic operation or if analysis includes the complete image; and if a mosaic operation is requested, the final process executes the operation. The mosaic operation selects from every tile only the pixels that fit into the output and makes the necessary resolution changes to add them in the mosaic output. The single process operation just copies the previous raster tile bytes as they are. The resulting bytes are stored in NFS to be recovered by the reducer.
A single reducer picks the tiles and puts them together. If you specified any basic multiple raster algebra operation, then it is executed at the same time the tiles are merged into the final output. This operation affects only the intersecting pixels in the mosaic output, or in every pixel if no mosaic operation was requested. If you specified a reducer processing class, the GDAL data set with the output raster is sent to this class for analysis and processing. If you selected HDFS output, the ImageLoader
is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the file system (NFS).
Parent topic: Image Processing Job
2.5.1.2 Multiple Raster Image Processing Job Flow
The multiple raster image processing job flow is executed when a complex multiple raster algebra operation is requested. It applies to rasters that have the same MBR, pixel type, pixel size, and SRID, since these operations are applied pixel by pixel in the corresponding cell, where every pixel represents the same coordinates.
The flow has its own custom MultipleRasterInputFormat
, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same MBR, pixel type, pixel size and SRID are considered. Only the rasters that match with coordinates specified by the first raster in the catalog are included. All the tiles of the input rasters are considered, because the processing will be executed on the complete images.
Once the tiles are selected, a custom MultipleRasterSplit
is created. This split contains a small area of every original tile, depending on the block size, because now all the rasters must be included in a split, even if it is only a small area. Each of these is called an IndividualRasterSplit
, and they are contained in a parent MultipleRasterSplit
.
When a mapper receives the MultipleRasterSplit
, it reads the information of all the raster´s tiles that are included in the parent split, performs a filter to select only the bands indicated by the user and only the small corresponding area to process in this specific mapper, and then executes the complex multiple raster algebra operation.
Data locality may be lost in this part of the process, because multiple rasters are included for a single mapper that may not be in the same node. The resulting bytes for every pixel are put in the context to be recovered by the reducer.
A single reducer picks pixel values and puts them together. If you specified a reducer processing class, the GDAL data set with the output raster is sent to this class for analysis and processing. The list of tiles that this class receives is null for this scenario, and the class can only work with the output data set. If you selected HDFS output, the ImageLoader
is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the file system (NFS).
Parent topic: Image Processing Job
2.5.2 Input Parameters
The following input parameters can be supplied to the hadoop command:
hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar -config <MOSAIC_CONFIG_PATH> -gdal <GDAL_LIBRARIES_PATH> -gdalData <GDAL_DATA_PATH> [-catalog <IMAGE_CATALOG_PATH>] [-usrlib <USER_PROCESS_JAR_PATH>] [-thumbnail <THUMBNAIL_PATH>] [-nativepath <USER_NATIVE_LIBRARIES_PATH>] [-params <USER_PARAMETERS>] [-file <SINGLE_RASTER_PATH>]
Where:
MOSAIC_CONFIG_PATH
is the path to the mosaic configuration xml, that defines the features of the output.GDAL_LIBRARIES_PATH
is the path where GDAL libraries are located.GDAL_DATA_PATH
is the path where the GDAL data folder is located. This path must be accessible via NFS to all nodes in the cluster.IMAGE_CATALOG_PATH
is the path to the catalog xml that lists the HDFS image(s) to be processed. This is optional because you can also specify a single raster to process using–file
flag.USER_PROCESS_JAR_PATH
is an optional user-defined jar file or comma-separated list of jar files, each of which contains additional processing classes to be applied to the source images.THUMBNAIL_PATH
is an optional flag to activate the thumbnail creation of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and is valid only for an HDFS output.USER_NATIVE_LIBRARIES_PATH
is an optional comma-separated list of additional native libraries to use in the analysis. It can also be a directory containing all the native libraries to load in the application.USER_PARAMETERS
is an optional key/value list used to define input data for user processing classes. Use a semicolon to separate parameters. For example:azimuth=315;altitude=45
SINGLE_RASTER_PATH
is an optional path to the.ohif
file that will be processed by the job. If this is set, you do not need to set a catalog.
For example, the following command will process all the files listed in the catalog file input.xml
file using the mosaic output definition set in testFS.xml
file.
hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar -catalog /opt/shareddir/spatial/demo/imageserver/images/input.xml -config /opt/shareddir/spatial/demo/imageserver/images/testFS.xml -thumbnail /opt/shareddir/spatial/processtest –gdal /opt/oracle/oracle-spatial-graph/spatial/raster/gdal/lib –gdalData /opt/shareddir/data
By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed.
For GDAL to work properly, the libraries must be available using $LD_LIBRARY_PATH. Make sure that the shared libraries path is set properly in your shell window before executing a job. For example:
export LD_LIBRARY_PATH=$ALLACCESSDIR/gdal/native
2.5.2.1 Catalog XML Structure
The following is an example of input catalog XML used to list every source image considered for mosaic operation generated by the image processing job.
-<catalog> -<image> <raster>/user/hdfs/ohiftest/opt/shareddir/spatial/data/rasters/maui.tif.ohif</raster> <bands datatype='1' config='1,2,3'>3</bands> </image> </catalog>
A <catalog>
element contains the list of <image> elements to process.
Each <image>
element defines a source image or a source folder within the <raster> element. All the images within the folder are processed.
The <bands>
element specifies the number of bands of the image, The datatype
attribute has the raster data type and the config
attribute specifies which band should appear in the mosaic output band order. For example: 3,1,2 specifies that mosaic output band number 1 will have band number 3 of this raster, mosaic band number 2 will have source band 1, and mosaic band number 3 will have source band 2. This order may change from raster to raster.
Parent topic: Input Parameters
2.5.2.2 Mosaic Definition XML Structure
The following is an example of a mosaic configuration XML used to define the features of the output generated by the image processing job.
-<mosaic exec="false"> -<output> <SRID>26904</SRID> <directory type="FS">/opt/shareddir/spatial/processOutput</directory> <!--directory type="HDFS">newData</directory--> <tempFSFolder>/opt/shareddir/spatial/tempOutput</tempFSFolder> <filename>littlemap</filename> <format>GTIFF</format> <width>1600</width> <height>986</height> <algorithm order="0">2</algorithm> <bands layers="3" config="3,1,2"/> <nodata>#000000</nodata> <pixelType>1</pixelType> </output> -<crop> -<transform> 356958.985610072,280.38843650364862,0,2458324.0825054757,0,-280.38843650364862 </transform> </crop> <process><classMapper params="threshold=454,2954">oracle.spatial.hadoop.twc.FarmTransformer</classMapper><classReducer params="plot_size=100400">oracle.spatial.hadoop.twc.FarmAlignment</classReducer></process> <operations> <localif operator="<" operand="3" newvalue="6"/> <localadd arg="5"/> <localsqrt/> <localround/> </operations> </mosaic>
The <mosaic>
element defines the specifications of the processing output. The exec
attribute specifies if the processing will include mosaic operation or not. If set to “false”
, a mosaic operation is not executed and a single raster is processed; if set to “true”
or not set, a mosaic operation is performed. Some of the following elements are required only for mosaic operations and ignored for single raster processing.
The <output>
element defines the features such as <SRID> considered for the output. All the images in different SRID are converted to the mosaic SRID in order to decide if any of its tiles fit into the mosaic or not. This element is not required for single raster processing, because the output rster has the same SRID as the input.
The <directory>
element defines where the output is located. It can be in an HDFS or in regular FileSystem (FS), which is specified in the tag type.
The <tempFsFolder>
element sets the path to store the mosaic output temporarily. The attribute delete=”false”
can be specified to keep the output of the process even if the loader was executed to store it in HDFS.
The <filename>
and <format>
elements specify the output filename. <filename>
is not required for single raster process; and if it is not specified, the name of the input file (determined by the -file
attribute during the job call) is used for the output file. <format>
is not required for single raster processing, because the output raster has the same format as the input.
The <width>
and <height>
elements set the mosaic output resolution. They are not required for single raster processing, because the output raster has the same resolution as the input.
The <algorithm>
element sets the order algorithm for the images. A 1 order means, by source last modified date, and a 2 order means, by image size. The order tag represents ascendant or descendant modes. (These properties are for mosaic operations where multiple rasters may overlap.)
The <bands>
element specifies the number of bands in the output mosaic. Images with fewer bands than this number are discarded. The config
attribute can be used for single raster processing to set the band configuration for output, because there is no catalog.
The <nodata>
element specifies the color in the first three bands for all the pixels in the mosaic output that have no value.
The <pixelType>
element sets the pixel type of the mosaic output. Source images that do not have the same pixel size are discarded for processing. This element is not required for single raster processing: if not specified, the pixel type will be the same as for the input.
The <crop>
element defines the coordinates included in the mosaic output in the following order: startcoordinateX
, pixelXWidth
, RotationX
, startcoordinateY
, RotationY
, and pixelheightY
. This element is not required for single raster processing: if not specified, the complete image is considered for analysis.
The <process>
element lists all the classes to execute before the mosaic operation.
The <classMapper>
element is used for classes that will be executed during mapping phase, and the <classReducer>
element is used for classes that will be executed during reduce phase. Both elements have the params
attribute, where you can send input parameters to processing classes according to your needs.
The <operations>
element lists all the map algebra operations that will be processed for this request. This element can also include a request for pyramid operations; for example:
<operations>
<pyramid resampling="NEAREST_NEIGHBOR" redLevel="6"/>
</operations>
Parent topic: Input Parameters
2.5.3 Job Execution
The first step of the job is to filter the tiles that would fit into the output. As a start, the location files that hold tile metadata are sent to theInputFormat
.
By extracting the pixelType
, the filter decides whether the related source image is valid for processing or not. Based on the user definition made in the catalog xml, one of the following happens:
-
If the image is valid for processing, then the SRID is evaluated next
-
If it is different from the user definition, then the MBR coordinates of every tile are converted into the user SRID and evaluated.
This way, every tile is evaluated for intersection with the output definition.
-
For a mosaic processing request, only the intersecting tiles are selected, and a split is created for each one of them.
-
For a single raster processing request, all the tiles are selected, and a split is created for each one of them.
-
For a complex multiple raster algebra processing request, all the tiles are selected if the MBR and pixel size is the same. Depending on the number of rasters selected and the blocksize, a specific area of every tile´s raster (which does not always include the complete original raster tile) is included in a single parent split.
A mapper processes each split in the node where it is stored. (For complex multiple raster algebra operations, data locality may be lost, because a split contains data from several rasters.) The mapper executes the sequence of map algebra operations and processing classes defined by the user, and then the mosaic process is executed if requested. A single reducer puts together the result of the mappers and, for user-specified reducing processing classes, sets the output data set to these classes for analysis or process. Finally, the process stores the image into FS or HDFS upon user request. If the user requested to store the output into HDFS, then the ImageLoader
job is invoked to store the image as an .ohif
file.
By default, the mappers and reducers are configured to get 1 GB of JVM, but you can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed.
2.5.4 Processing Classes and ImageBandWritable
The processing classes specified in the catalog XML must follow a set of rules to be correctly processed by the job. All the processing classes in the mapping phase must implement the ImageProcessorInterface
interface. For the reducer phase, they must implement the ImageProcessorReduceInterface
interface.
When implementing a processing class, you may manipulate the raster using its object representation ImageBandWritable
. An example of an processing class is provided with the framework to calculate the slope on DEMs. You can create mapping operations, for example, to transforms the pixel values to another value by a function. The ImageBandWritable
instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the mosaic output depends on the correct manipulation of these properties.
The ImageBandWritable
instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the output depends on the correct manipulation of these properties.
Table 2-1 ImageBandWritable Properties
Type - Property | Description |
---|---|
IntWritable dstWidthSize |
Width size of the tile |
IntWritable dstHeightSize |
Height size of the tile |
IntWritable bands |
Number of bands in the tile |
IntWritable dType |
Data type of the tile |
IntWritable offX |
Starting X pixel, in relation to the source image |
IntWritable offY |
Starting Y pixel, in relation to the source image |
IntWritable totalWidth |
Width size of the source image |
IntWritable totalHeight |
Height size of the source image |
IntWritable bytesNumber |
Number of bytes containing the pixels of the tile and stored into baseArray |
BytesWritable[] baseArray |
Array containing the bytes representing the tile pixels, each cell represents a band |
IntWritable[][] basePaletteArray |
Array containing the int values representing the tile palette, each array represents a band. Each integer represents an entry for each color in the color table, there are four entries per color |
IntWritable[] baseColorArray |
Array containing the int values representing the color interpretation, each cell represents a band |
DoubleWritable[] noDataArray |
Array containing the NODATA values for the image, each cell contains the value for the related band |
ByteWritable isProjection |
Specifies if the tile has projection information with Byte.MAX_VALUE |
ByteWritable isTransform |
Specifies if the tile has the geo transform array information with Byte.MAX_VALUE |
ByteWritable isMetadata |
Specifies if the tile has metadata information with Byte.MAX_VALUE |
IntWritable projectionLength |
Specifies the projection information length |
BytesWritable projectionRef |
Specifies the projection information in bytes |
DoubleWritable[] geoTransform |
Contains the geo transform array |
IntWritable metadataSize |
Number of metadata values in the tile |
IntWritable[] metadataLength |
Array specifying the length of each metadataValue |
BytesWritable[] metadata |
Array of metadata of the tile |
GeneralInfoWritable mosaicInfo |
The user-defined information in the mosaic xml. Do not modify the mosaic output features. Modify the original xml file in a new name and run the process using the new xml |
MapWritable extraFields |
Map that lists key/value pairs of parameters specific to every tile to be passed to the reducer phase for analysis |
Processing Classes and Methods
When modifying the pixels of the tile, first get the band information into an array using the following method:
byte [] bandData1 =(byte []) img.getBand(0);
The bytes representing the tile pixels of band 1 are now in the bandData1 array. The base index is zero.
The getBand(int bandId)
method will get the band of the raster in the specified bandId
position. You can cast the object retrieved to the type of array of the raster; it could be byte, short (unsigned int 16 bits, int 16 bits), int (unsigned int 32 bits, int 32 bits), float (float 32 bits), or double (float 64 bits).
With the array of pixels available, it is possible now to transform them upon a user request.
After processing the pixels, if the same instance of ImageBandWritable must be used, then execute the following method:
img.removeBands;
This removes the content of previous bands, and you can start adding the new bands. To add a new band use the following method:
img.addBand(Object band);
Otherwise, you may want to replace a specific band by using trhe following method:
img.replaceBand(Object band, int bandId)
In the preceding methods, band
is an array containing the pixel information, and bandID
is the identifier of the band to be replaced.. Do not forget to update the instance size, data type, bytesNumber and any other property that might be affected as a result of the processing operation. Setters are available for each property.
2.5.4.1 Location of the Classes and Jar Files
All the processing classes must be contained in a single jar file if you are using the Oracle SpatialViewer web application. The processing classes might be placed in different jar files if you are using the command line option.
When new classes are visible in the classpath, they must be added to the mosaic XML in the <process><classMapper>
or <process><classReducer>
section. Every <class>
element added is executed in order of appearance: for mappers, just before the final mosaic operation is performed; and for reducers, just after all the processed tiles are put together in a single data set.
Parent topic: Processing Classes and ImageBandWritable
2.5.5 Map Algebra Operations
You can process local map algebra operations on the input rasters, where pixels are altered depending on the operation. The order of operations in the configuration XML determines the order in which the operations are processed. After all the map algebra operations are processed, the processing classes are run, and finally the mosaic operation is performed.
The following map algebra operations can be added in the <operations>
element in the mosaic configuration XML, with the operation name serving as an element name. (The data types for which each operation is supported are listed in parentheses.)
-
localnot
: Gets the negation of every pixel, inverts the bit pattern. If the result is a negative value and the data type is unsigned, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set. (Byte, Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits) -
locallog
: Returns the natural logarithm (base e) of a pixel. If the result is NaN, then original pixel value is set; if the result is Infinite, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
locallog10
: Returns the base 10 logarithm of a pixel. If the result is NaN, then the original pixel value is set; if the result is Infinite, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localadd
: Adds the specified value as argument to the pixel .Example:<localadd arg="5"/>
. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localdivide
: Divides the value of each pixel by the specified value set as argument. Example:<localdivide arg="5"/>
. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localif
: Modifies the value of each pixel based on the condition and value specified as argument. Valid operators: = , <, >, >=, < !=. Example:<localif operator="<" operand="3" newvalue="6"/>
, which modifies all the pixels whose value is less than 3, setting the new value to 6. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localmultiply
: Multiplies the value of each pixel times the value specified as argument. Example:<localmultiply arg="5"/>
. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localpow
: Raises the value of each pixel to the power of the value specified as argument. Example:<localpow arg="5"/>
. If the result is infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localsqrt
: Returns the correctly rounded positive square root of every pixel. If the result is infinite or NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localsubstract
: Subtracts the value specified as argument to every pixel value. Example:<localsubstract arg="5"/>
. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localacos
: Calculates the arc cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localasin
: Calculates the arc sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localatan
: Calculates the arc tangent of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localcos
: Calculates the cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localcosh
: Calculates the hyperbolic cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localsin
: Calculates the sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localtan
: Calculates the tangent of a pixel. The pixel is not modified if the cosine of this pixel is 0. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localsinh
: Calculates the arc hyperbolic sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localtanh
: Calculates the hyperbolic tangent of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localdefined
: Maps an integer typed pixel to 1 if the cell value is not NODATA; otherwise, 0. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits, Float 32 bits) -
localundefined
: Maps an integer typed Raster to 0 if the cell value is not NODATA; otherwise, 1. (Unsigned int 16 bits, Unsigned int 32 bits, Int 16 bits, Int 32 bits) -
localabs
: Returns the absolute value of signed pixel. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localnegate
: Multiplies by -1 the value of each pixel. (Int 16 bits, Int 32 bits, Float 32 bits, Float 64 bits) -
localceil
: Returns the smallest value that is greater than or equal to the pixel value and is equal to a mathematical integer. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Float 32 bits, Float 64 bits) -
localfloor
: Returns the smallest value that is less than or equal to the pixel value and is equal to a mathematical integer. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set. (Float 32 bits, Float 64 bits) -
localround
: Returns the closest integer value to every pixel. (Float 32 bits, Float 64 bits)
2.5.6 Multiple Raster Algebra Operations
You can process raster algebra operations that involve more than one raster, where pixels are altered depending on the operation and taking in consideration the pixels from all the involved rasters in the same cell.
Only one operation can be processed at a time and it is defined in the configuration XML using the <multipleops>
element. Its value is the operation to process.
There are two types of operations:
-
Basic Multiple Raster Algebra Operations are executed in the reduce phase right before the Reduce User Processing classes.
-
Complex Multiple Raster Algebra Operations are processed in the mapping phase.
2.5.6.1 Basic Multiple Raster Algebra Operations
Basic multiple raster algebra operations are executed in the reducing phase of the job.
They can be requested along with a mosaic operation or just a process request. If requested along with a mosaic operation, the input rasters must have the same MBR, pixel size, SRID and data type.
When a mosaic operation is performed, only the intersecting pixels (pixels that are identical in both rasters) are affected.
The operation is processed at the time that mapping tiles are put together in the output dataset, the pixel values that intersect (if a mosaic operation was requested) or all the pixels (when mosaic is not requested) are altered according to the requested operation.
The order in which rasters are added to the data set is the mosaic operation order if it was requested; otherwise, it is the order of appearance in the catalog.
The following basic multiple raster algebra operations are available:
-
add
: Adds every pixel in the same cell for the raster sequence. -
substract
: Subtracts every pixel in the same cell for the raster sequence. -
divide
: Divides every pixel in the same cell for the raster sequence. -
multiply
: Multiplies every pixel in the same cell for the raster sequence. -
min
: Assigns the minimum value of the pixels in the same cell for the raster sequence. -
max
: Assigns the maximum value of the pixels in the same cell for the raster sequence. -
mean
: Calculates the mean value for every pixel in the same cell for the raster sequence. -
and
: Processes binary “and” operation on every pixel in the same cell for raster sequence, “and“ operation copies a bit to the result if it exists in both operands. -
or
: Processes binary “or” operation on every pixel in the same cell for raster sequence, “or” operation copies a bit if it exists in either operand. -
xor
: Processes binary “xor” operation on every pixel in the same cell for raster sequence, “xor” operation copies the bit if it is set in one operand but not both.
Parent topic: Multiple Raster Algebra Operations
2.5.6.2 Complex Multiple Raster Algebra Operations
Complex multiple raster algebra operations are executed in the mapping phase of the job, and a job can only process this operation; any request for resizing, changing the SRID, or custom mapping must have been previously executed. The input for this job is a series of rasters with the same MBR, SRID, data type, and pixel size.
The tiles for this job include a piece of all the rasters in the catalog. Thus, every mapper has access to an area of cells in all the rasters, and the operation is processed there. The resulting pixel for every cell is written in the context, so that reducer can put results in the output data set before processing the reducer processing classes.
The order in which rasters are considered to evaluate the operation is the order of appearance in the catalog.
The following complex multiple raster algebra operations are available:
-
combine
: Assigns a unique output value to each unique combination of input values in the same cell for the raster sequence. -
majority
: Assigns the value within the same cells of the rasters sequence that is the most numerous. If there is a values tie, the one on the right is selected. -
minority
: Assigns the value within the same cells of the raster sequence that is the least numerous. If there is a values tie, the one on the right is selected. -
variety
: Assigns the count of unique values at each same cell in the sequence of rasters. -
mask
: Generates a raster with the values from the first raster, but only includes pixels in which the corresponding pixel in the rest of rasters of the sequence is set to the specified mask values. Otherwise, 0 is set. -
inversemask
: Generates a raster with the values from the first raster, but only includes pixels in which the corresponding pixel in the rest of rasters of the sequence is not set to the specified mask values. Otherwise, 0 is set. -
equals
: Creates a raster with data type byte, where cell values equal 1 if the corresponding cells for all input rasters have the same value. Otherwise, 0 is set. -
unequal
: Creates a raster with data type byte, where cell values equal 1 if the corresponding cells for all input rasters have a different value. Otherwise, 0 is set. -
greater
: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is greater than the rest of corresponding cells for all input. Otherwise, 0 is set. -
greaterorequal
: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is greater or equal than the rest of corresponding cells for all input. Otherwise, 0 is set. -
less
: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is less than the rest of corresponding cells for all input. Otherwise, 0 is set. -
lessorequal
: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is less or equal than the rest of corresponding cells for all input. Otherwise, 0 is set.
Parent topic: Multiple Raster Algebra Operations
2.5.7 Pyramids
Pyramids are subobjects of a raster object that represent the raster image or raster data at differing sizes and degrees of resolution.
The size is usually related to the amount of time that an application needs to retrieve and display an image, particularly over the web. That is, the smaller the image size, the faster it can be displayed; and as long as detailed resolution is not needed (for example, if the user has "zoomed out" considerably), the display quality for the smaller image is adequate.
Pyramid levels represent reduced or increased resolution images that require less or more storage space, respectively. (Big Data Spatial and Graph supports only reduced resolution pyramids.) A pyramid level of 0 indicates the original raster data; that is, there is no reduction in the image resolution and no change in the storage space required. Values greater than 0 (zero) indicate increasingly reduced levels of image resolution and reduced storage space requirements.
A single raster is processed for each pyramid request, and the following parameters apply:
-
Pyramid level (required): the maximum reduction level; that is, the number of pyramid levels to create at a reduced size than the original object. For example,
redLevel=”6”
causes pyramid levels to be created for levels 0 through 5.The dimension sizes at each lower level are:
r(n) = r(n - 1)/2
andc(n) = c(n - 1)/2
where:r(n)
andc(n)
are the row and column sizes for a pyramid at leveln
The smaller of the row and column dimension sizes of the top-level overview is between 64 and 128 (maximum reduced-resolution level):
(int)(log2(a/64))
wherea
is the smaller of the original row or column dimension size.If an
rLevel
value greater than the maximum reduced-resolution level is specified, therLevel
value is set to the maximum reduced-resolution level. -
Resampling algorithm: the resampling method to use.
Must be one of the following:
NEAREST_NEIGHBOR
,BILINEAR
,AVERAGE4
,AVERAGE16
. (BILINEAR
andAVERAGE4
have the same effect.) If no resampling algorithm is specified,BILINEAR
is used by default.
Pyramids can be created while loading multiple rasters or processing a single raster:
-
While loading the rasters in HDFS, by adding the
-pyramid
parameter to the loader command line call or by using the APIloader.addPyramid()
-
For processing a single raster, by adding the operation in the user request XML or by using the API
processor.addPyramid()
2.5.8 Output
When you specify an HDFS directory in the configuration XML, the output generated is an .ohif
file as in the case of an ImageLoader
job,
When the user specifies a FS directory in the configuration XML, the output generated is an image with the filename and type specified and is stored into regular FileSystem.
In both the scenarios, the output must comply with the specifications set in the configuration XML. The job execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
2.6 Loading and Processing an Image Using the Oracle Spatial Hadoop Raster Processing API
The framework provides a raster processing API that lets you load and process rasters without creating XML but instead using a Java application. The application can be executed inside the cluster or on a remote node.
The API provides access to the framework operations, and is useful for web service or standalone Java applications.
To execute any of the jobs, a HadoopConfiguration
object must be created. This object is used to set the necessary configuration information (such as the jar file name and the GDAL paths) to create the job, manipulate rasters, and execute the job. The basic logic is as follows:
//Creates Hadoop Configuration HadoopConfiguration hadoopConf = new HadoopConfiguration(); //Assigns GDAL_DATA location based on specified SHAREDDIR, this data folder is required by gdal to look for data tables that allow SRID conversions String gdalData = sharedDir + ProcessConstants.DIRECTORY_SEPARATOR + "data"; hadoopConf.setGdalDataPath(gdalData); //Sets jar name for processor hadoopConf.setMapreduceJobJar("hadoop-imageprocessor.jar"); //Creates the job RasterProcessorJob processor = (RasterProcessorJob) hadoopConf.createRasterProcessorJob();
If the API is used on a remote node, you can set properties in the Hadoop Configuration object to connect to the cluster. For example:
//Following config settings are required for standalone execution. (REMOTE ACCESS) hadoopConf.setUser("hdfs"); hadoopConf.setHdfsPathPrefix("hdfs://sys3.example.com:8020"); hadoopConf.setResourceManagerScheduler("sys3.example.com:8030"); hadoopConf.setResourceManagerAddress("sys3.example.com:8032"); hadoopConf.setYarnApplicationClasspath("/etc/hadoop/conf/,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*," + "/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*," + "/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/* ");
After the job is created, the properties for its execution must be set depending on the job type. There are two job classes: RasterLoaderJob
to load the rasters into HDFS, and RasterProcessorJob
to process them.
The following example loads a Hawaii raster into the APICALL_HDFS directory. It creates a thumbnail in a shared folder, and specifies 10 pixels overlapping on each edge of the tiles.
private static void executeLoader(HadoopConfiguration hadoopConf){ hadoopConf.setMapreduceJobJar("hadoop-imageloader.jar"); RasterLoaderJob loader = (RasterLoaderJob) hadoopConf.createRasterLoaderJob(); loader.setFilesToLoad("/net/den00btb/scratch/zherena/hawaii/hawaii.tif"); loader.setTileOverlap("10"); loader.setOutputFolder("APICALL"); loader.setRasterThumbnailFolder("/net/den00btb/scratch/zherena/processOutput"); try{ loader.setGdalPath("/net/den00btb/scratch/zherena/gdal/lib"); boolean loaderSuccess = loader.execute(); if(loaderSuccess){ System.out.println("Successfully executed loader job"); } else{ System.out.println("Failed to execute loader job"); } }catch(Exception e ){ System.out.println("Problem when trying to execute raster loader " + e.getMessage()); } } }
The following example processes the loaded raster.
private static void executeProcessor(HadoopConfiguration hadoopConf){ hadoopConf.setMapreduceJobJar("hadoop-imageprocessor.jar"); RasterProcessorJob processor = (RasterProcessorJob) hadoopConf.createRasterProcessorJob(); try{ processor.setGdalPath("/net/den00btb/scratch/zherena/gdal/lib"); MosaicConfiguration mosaic = new MosaicConfiguration(); mosaic.setBands(3); mosaic.setDirectory("/net/den00btb/scratch/zherena/processOutput"); mosaic.setFileName("APIMosaic"); mosaic.setFileSystem(RasterProcessorJob.FS); mosaic.setFormat("GTIFF"); mosaic.setHeight(3192); mosaic.setNoData("#FFFFFF"); mosaic.setOrderAlgorithm(ProcessConstants.ALGORITMH_FILE_LENGTH); mosaic.setOrder("1"); mosaic.setPixelType("1"); mosaic.setPixelXWidth(67.457513); mosaic.setPixelYWidth(-67.457513); mosaic.setSrid("26904"); mosaic.setUpperLeftX(830763.281336); mosaic.setUpperLeftY(2259894.481403); mosaic.setWidth(1300); processor.setMosaicConfigurationObject(mosaic.getCompactMosaic()); RasterCatalog catalog = new RasterCatalog(); Raster raster = new Raster(); raster.setBands(3); raster.setBandsOrder("1,2,3"); raster.setDataType(1); raster.setRasterLocation("/user/hdfs/APICALL/net/den00btb/scratch/zherena/hawaii/hawaii.tif.ohif"); catalog.addRasterToCatalog(raster); processor.setCatalogObject(catalog.getCompactCatalog()); boolean processorSuccess = processor.execute(); if(processorSuccess){ System.out.println("Successfully executed processor job"); } else{ System.out.println("Failed to execute processor job"); } }catch(Exception e ){ System.out.println("Problem when trying to execute raster processor " + e.getMessage()); } }
In the preceding example, the thumbnail is optional if the mosaic results will be stored in HDFS. If a processing jar file is specified (used when the additional user processing classes are specified), the location of the jar file containing these lasses must be specified. The other parameters are required for the mosaic to be generated successfully.
Several examples of using the processing API are provided /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/src
. Review the Java classes to understand their purpose. You may execute them using the scripts provided for each example located under /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/cmd
.
After you have executed the scripts and validated the results, you can modify the Java source files to experiment on them and compile them using the provided script /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/build.xml
. Ensure that you have write access on the /opt/oracle/oracle-spatial-graph/spatial/raster/jlib directory
.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.7 Using the Oracle Spatial Hadoop Raster Simulator Framework to Test Raster Processing
When you create custom processing classes. you can use the Oracle Spatial Hadoop Raster Simulator Framework to do the following by "pretending" to plug them into the Oracle Raster Processing Framework.
-
Develop user processing classes on a local computer
-
Avoid the need to deploy user processing classes in a cluster or in Big Data Lite to verify their correct functioning
-
Debug user processing classes
-
Use small local data sets
-
Create local debug outputs
-
Automate unit tests
The Simulator framework will emulate the loading and processing processes in your local environment, as if they were being executed in a cluster. You only need to create a Junit test case that loads one or more rasters and processes them according to your specification in XML or a configuration object.
Tiles are generated according to specified block size, so you must set a block size. The number of mappers and reducers to execute depends on the number of tiles, just as in regular cluster execution. OHIF files generated during the loading process are stored in local directory, because no HDFS is required.
-
Simulator (“Mock”) Objects
-
User Local Environment Requirements
-
Sample Test Cases to Load and Process Rasters
Simulator (“Mock”) Objects
To load rasters and convert them into .OHIF files that can be processed, a RasterLoaderJobMock
must be executed. This class constructor receives the HadoopConfiguration
that must include the block size, the directory or rasters to load, the output directory to store the OHIF files, and the gdal directory. The parameters representing the input files and the user configuration vary in terms of how you specify them:
-
Location Strings for catalog and user configuration XML file
-
Catalog object (
CatalogMock
) -
Configuration objects (
MosaicProcessConfigurationMock
orSingleProcessConfigurationMock
) -
Location for a single raster processing and a user configuration (
MosaicProcessConfigurationMock
orSingleProcessConfigurationMock
)
User Local Environment Requirements
Before you create test cases, you need to configure your local environment.
-
1. Ensure that a directory has the native gdal libraries,
gdal-data
andlibproj
.For Linux:
-
Follow the steps in Getting and Compiling the Cartographic Projections Library to obtain
libproj.so
. -
Get the gdal distribution from the Spatial installation on your cluster or BigDataLite VM at
/opt/oracle/oracle-spatial-graph/spatial/raster/gdal
. -
Move
libproj.so
to your local gdal directory undergdal/lib
with the rest of the native gdal libraries.
For Windows:
-
Get the gdal distribution from your Spatial install on your cluster or BigDataLite VM at
/opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/mock/lib/gdal_windows.x64.zip
. -
Be sure that Visual Studio installed. When you install it, make sure you select the Common Tools for Visual C++.
-
Download the PROJ 4 source code, version branch 4.9 from https://trac.osgeo.org/proj4j.
-
Open the Visual Studio Development Command Prompt and type:
cd PROJ4/src_dir nmake /f makefile.vc
-
Move
proj.dll
to your local gdal directory undergdal/bin
with the rest of the native gdal libraries.
-
-
Add GDAL native libraries to system path.
For Linux: Export LD_LIBRARY_PATH with corresponding native gdal libraries directory
For Windows: Add to the Path environment variable the native gdal libraries directory.
-
Ensure that the Java project has Junit libraries.
-
Ensure that the Java project has the following Hadoop jar and Oracle Image Processing Framework files in the classpath You may get them from the Oracle BigDataLite VM or from your cluster; these are all jars included in the Hadoop distribution, and for specific framework jars, go to
/opt/oracle/oracle-spatial-graph/spatial/raster/jlib
:(In the following list,
VERSION_INCLUDED
refers to the version number from the Hadoop installation containing the files; it can be a BDA cluster or a BigDataLite VM.)commons-collections-VERSION_INCLUDED.jar commons-configuration-VERSION_INCLUDED.jar commons-lang-VERSION_INCLUDED.jar commons-logging-VERSION_INCLUDED.jar commons-math3-VERSION_INCLUDED.jar gdal.jar guava-VERSION_INCLUDED.jar hadoop-auth-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar hadoop-common-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar hadoop-imageloader.jar hadoop-imagemocking-fwk.jar hadoop-imageprocessor.jar hadoop-mapreduce-client-core-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar hadoop-raster-fwk-api.jar jackson-core-asl-VERSION_INCLUDED.jar jackson-mapper-asl-VERSION_INCLUDED.jar log4j-VERSION_INCLUDED.jar slf4j-api-VERSION_INCLUDED.jar slf4j-log4j12-VERSION_INCLUDED.jar
Sample Test Cases to Load and Process Rasters
After your Java project is prepared for your test cases, you can test the loading and processing of rasters.
The following example creates a class with asetUp
method to configure the directories for gdal, the rasters to load, your configuration XML files, the output thumbnails, ohif files, and process results. It also configures the block size (8 MB). (A small block size is recommended for single computers.)/** * Set the basic directories before starting the test execution */ @Before public void setUp(){ String sharedDir = "C:\\Users\\zherena\\Oracle Stuff\\Hadoop\\Release 4\\MockTest"; String allAccessDir = sharedDir + "/out/"; gdalDir = sharedDir + "/gdal"; directoryToLoad = allAccessDir + "rasters"; xmlDir = sharedDir + "/xmls/"; outputDir = allAccessDir; blockSize = 8; }
The following example creates a RasterLoaderJobMock object, and sets the rasters to load and the output path for OHIF files:
/** * Loads a directory of rasters, and generate ohif files and thumbnails * for all of them * @throws Exception if there is a problem during load process */ @Test public void basicLoad() throws Exception { System.out.println("***LOAD OF DIRECTORY WITHOUT EXPANSION***"); HadoopConfiguration conf = new HadoopConfiguration(); conf.setBlockSize(blockSize); System.out.println("Set block size of: " + conf.getProperty("dfs.blocksize")); RasterLoaderJobMock loader = new RasterLoaderJobMock(conf, outputDir, directoryToLoad, gdalDir); //Puts the ohif file directly in the specified output directory loader.dontExpandOutputDir(); System.out.println("Starting execution"); System.out.println("------------------------------------------------------------------------------------------------------------"); loader.waitForCompletion(); System.out.println("Finished loader"); System.out.println("LOAD OF DIRECTORY WITHOUT EXPANSION ENDED"); System.out.println(); System.out.println(); }
The following example specifies catalog and user configuration XML files to the RasterProcessorJobMock
object. Make sure your catalog xml
points to the correct location of your local OHIF files.
/** * Creates a mosaic raster by using configuration and catalog xmls. * Only two bands are selected per raster. * @throws Exception if there is a problem during mosaic process. */ @Test public void mosaicUsingXmls() throws Exception { System.out.println("***MOSAIC PROCESS USING XMLS***"); HadoopConfiguration conf = new HadoopConfiguration(); conf.setBlockSize(blockSize); System.out.println("Set block size of: " + conf.getProperty("dfs.blocksize")); String catalogXml = xmlDir + "catalog.xml"; String configXml = xmlDir + "config.xml"; RasterProcessorJobMock processor = new RasterProcessorJobMock(conf, configXml, catalogXml, gdalDir); System.out.println("Starting execution"); System.out.println("------------------------------------------------------------------------------------------------------------"); processor.waitForCompletion(); System.out.println("Finished processor"); System.out.println("***********************************************MOSAIC PROCESS USING XMLS ENDED***********************************************"); System.out.println(); System.out.println();
Additional examples using the different supported configurations for RasterProcessorJobMock
are provided in /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/mock/src
.They include an example using an external processing class, which is also included and can be debugged.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.8 Oracle Big Data Spatial Raster Processing for Spark
Oracle Big Data Spatial Raster Processing for Apache Spark is a spatial raster processing API for Java.
This API allows the creation of new combined images resulting from a series of user-defined processing phases, with the following features:
-
HDFS images storage, where every block size split is stored as a separate tile, ready for future independent processing
-
Subset, mosaic, and raster algebra operations processed in parallel using Spark to divide the processing.
-
Support for GDAL formats, multiple bands images, DEMs (digital elevation models), multiple pixel depths, and SRIDs
Currently the API supports Spark 1.6 and Spark 2.2. The only visible change in the API is the substitution of Dataframe
with Dataset<Row>
in the results of Spark 2.2 SQL queries.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.8.1 Spark Raster Loader
The first step in using the raster processing for Spark Java API is to have the images in HDFS, followed by having the images separated into smart tiles. This allows the processor to work on each tile independently. The Spark raster loader lets you import a single image or a collection of them into HDFS in parallel, which decreases the load time. Each block contains data for all the raster bands, so that if further processing is required on specific pixels, the information can be processed on a single node.
The basic workflow for the Spark raster loader is as follows.
-
GDAL is used to import the rasters, tiling them according to block size and then storing each tile as an HDFS block.
-
The set of rasters to be loaded is read into a
SpatialRasterJavaRDD
, which is an extension ofJavaRDD
. This RDD is a collection ofImagePieceWritable
objects that represent the information of the tiles to create per raster, based on the number of bands, pixel size, HDFS block size, and raster resolution. This is accomplished by using the custom input format used in the spatial Hadoop loader. -
The raster information for each tile is loaded. This load is performed by an executor for each tile, so reading is performed parallel. Each tile includes a certain number of overlapping bytes (user input), so that the tiles cover area from the adjacent tiles. There are “n” number of Spark executors, depending on the number of tiles, image resolution, and block size.
-
The RDD is grouped by key, so that all the tiles that correspond to the same raster are part of the same record. This RDD is saved as OHIF using the
OhifOutputFormat
, which puts together all the information loaded by the executors and stores the images into a special.ohif
format, which contains the resolution, bands, offsets, and image data. In this way, the file offset containing each tile and the node location is known. A special reading process is required to read the image back and is included in the Spark SQL raster processor.
Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.
The loader can be configured by setting parameters on the command line or by using the Spark API.
Parent topic: Oracle Big Data Spatial Raster Processing for Spark
2.8.1.1 Input Parameters to the Spark Raster Loader
The following example shows input parameters supplied using the spark-submit command:
spark-submit
--class <DRIVER_CLASS>
--driver-memory <DRIVER_JVM>
--driver-class-path <DRIVER_CLASSPATH>
--jars <EXECUTORS_JARS>
<DRIVER_JAR>
-files <SOURCE_IMGS_PATH>
-out <HDFS_OUTPUT_FOLDER>
-gdal <GDAL_LIB_PATH>
-gdalData <GDAL_DATA_PATH>
[-overlap <OVERLAPPING_PIXELS>]
[-thumbnail <THUMBNAIL_PATH>]
[-expand <false|true>]
[-rasterSource objectStore]
[-containerService containername.service]
[-workDir <OS_WORK_DIR>]
[-credential spatialraster]
Where:
-
DRIVER_CLASS
is the class that has the driver code and that Spark will execute. -
DRIVER_JVM
is the memory to assign to driver´s JVM. -
DRIVER_CLASSPATH
is the classpath for driver class, jars are separated by colon. -
EXECUTOR_JARS
is the classpath to be distributed to executors, jars are separated by comma. -
DRIVER_JAR
is the jar that contains the <DRIVER_CLASS> to execute by Spark. -
SOURCE_IMGS_PATH
is a path to the source raster(s) or folder(s). For multiple inputs use a comma separator. This path must be accessible via NFS to all nodes in the cluster. -
HDFS_OUTPUT_FOLDER
is the HDFS output folder where the loaded images are stored. -
OVERLAPPING_PIXELS
is an optional number of overlapping pixels on the borders of each tile, if this parameter is not specified a default of two overlapping pixels is considered. -
GDAL_LIB_PATH
is the path where GDAL libraries are located. -
GDAL_DATA_PATH
is the path where GDAL data folder is located. This path must be accessible through NFS to all nodes in the cluster. -
THUMBNAIL_PATH
is an optional path to store a thumbnail of the loaded image(s). This path must be accessible through NFS to all nodes in the cluster and must have write access permission for yarn users. -
-expand
controls whether the HDFS path of the loaded raster expands the source path, including all directories. If you set this tofalse
, the.ohif
file is stored directly in the output directory (specified using the-o
option) without including that directory’s path in the raster. -
-rasterSource
should be set to objectStore if the rasters are in Oracle Object Storage; if not set, the default flow is followed. -
-containerService
specifies the Object Storage container and service where the rasters are stored. This is a required field. -
-workDir
specifies the local temporary directory to download Object Storage rasters. Required for Object Storage loads. -
-credential
specifies the ID of the credential stored in Credential Store to the secure Object Storage password.
Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.
The loader can be configured by setting parameters on the command line or by using the Spark API.
Parent topic: Spark Raster Loader
2.8.1.2 Expected Output of the Spark Raster Loader
For each input image to the Spark raster loader, there are two output files per input image.
-
The
.ohif
file that concentrates all the tiles for the source image. Each tile (stored as a HDFS block) may be processed as a separated instance by a processing executor. The.ohif
file is stored in a user-specified folder with-out
flag, under/user/<USER_EXECUTING_JOB>/OUT_FOLDER/<PARENT_DIRECTORIES_OF_SOURCE_RASTER>
if the flag–expand
was not used. Otherwise, the.ohif
file will be located at/user/<USER_EXECUTING_JOB>/OUT_FOLDER/
, and the file can be identified asoriginal_filename.ohif
. -
A related metadata file that lists all the pieces of the image and the coordinates that each one covers. This file is located in HDFS under the
spatial_raster/metadata
location, and its name is hash-generated using the name of the.ohif
file. This file is for Oracle internal use only, and lists important metadata of the source raster. Some example lines from a metadata file:size:3200,2112 srid:26904 datatype:1 resolution:27.90809458890406,-27.90809458890406 file:/user/hdfs/ohiftest/opt/shareddir/spatial/data/rasters/hawaii.tif.ohif bands:3 mbr:532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762 0,532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762 thumbnailpath:/opt/shareddir/spatial/thumb/
If the -thumbnail
flag was specified, a thumbnail of the source image is stored in the related folder. This is a way to visualize a translation of the .ohif
file. Execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
Parent topic: Spark Raster Loader
2.8.2 Spark SQL Raster Processor
Once the images are loaded into HDFS, they can be processed using Spark SQL Raster Processor. You specify the expected raster output features using the Mosaic Definition XML Structure or the Spark API, and the mosaic UDF filters the tiles to fit into that output and processes them. Raster algebra operations are also available in UDF.
A custom InputFormat
, which is also used in the Hadoop raster processing framework, loads specific blocks of data, based on the input (mosaic description or a single raster) using raster SRID and coordinates, and selects only the bands and pixels that fit into the final output before accepting processing operations:
-
For a mosaic processing request, only the intersecting tiles are selected, and a split is created for each one of them.
-
For a single raster processing request, all the tiles are selected, and a split is created for each one of them.
The Spark SQL Raster Processor allows you to filter the OHIF tiles based on input catalog or raster into a Dataframe, with every row representing a tile, and to use Spatial UDF Spark functions to process them.
A simplified pseudocode representation of Spark SQL raster processing is:
sqlContext.udf().register("localop", new LocalOperationsFunction(),DataTypes.createStructType(SpatialRasterJavaRDD.createSimpleTileStructField(dataTypeOfTileToProcess)));
tileRows.registerTempTable("tiles");
String query = "SELECT localop(tileInfo, userRequest, \"localnot\"), userRequest FROM tiles";
DataFrame processedTiles = sqlContext.sql(query);
The basic workflow of the Spark SQL raster processor is as follows.
-
The rasters to process are first loaded in tiles metadata as RDD. These tiles may be filtered if the user set a configuration for mosaic operation. The RDD is later converted to a Spark DataFrame (or Dataset<Row> for Spark 2.2) of two complex rows: the first row is
tileInfo
, which has all the metadata for the tiles, and the second row is theuserRequest
, which has the user input configuration listing the expected features of the raster output. -
Once the DataFrame or Dataset<Row> is created, the driver must register the “localop” UDF, and also register the DataFrame or Dataset<Row> as a table before executing a query to process. The mosaic UDF can only be executed if the user configured all the required parameters correctly. If no XML is used and the configuration is set using the API, then by default a mosaic operation configuration is expected unless the
setExecuteMosaic(false)
method is set. -
The mosaic operation selects from every tile only the pixels that fit into the output, and makes the necessary resolution changes to add them in the mosaic output.
-
Once the query is executed, an executor loads the data corresponding tile, conserving data locality, and the specified local raster algebra operation is executed.
-
The row in the DataFrame or Dataset<Row> is updated with the new pixel data and returned to the driver for further processing if required.
-
Once the processing is done, the DataFrame or Dataset<Row> is converted to a list of
ImageBandWritable
objects, which are the MapReduce representation of processed tiles. These are input to theProcessedRasterCreator
, where resulting bytes of local raster algebra and/or mosaic operations are put together, and a final raster is stored into HDFS or the regular file system depending on the user request.
Only images with same data type (pixel depth) as the user configuration input data type (pixel depth) are considered. Only the tiles that intersect with coordinates specified by the user for the mosaic output are included. For processing of a single raster, the filter includes all the tiles of the input rasters, because the processing will be executed on the complete images.
- Input Parameters to the Spark SQL Raster Processor
- Expected Output of the Spark SQL Raster Processor
Parent topic: Oracle Big Data Spatial Raster Processing for Spark
2.8.2.1 Input Parameters to the Spark SQL Raster Processor
The following example shows input parameters supplied using the spark-submit command:
spark-submit
--class <DRIVER_CLASS>
--driver-memory <DRIVER_JVM>
--driver-class-path <DRIVER_CLASSPATH>
--jars <EXECUTORS_JARS>
<DRIVER_JAR>
-config <MOSAIC_CONFIG_PATH>
-gdal <GDAL_LIBRARIES_PATH>
-gdalData <GDAL_DATA_PATH>
[-catalog <IMAGE_CATALOG_PATH>]
[-file <SINGLE_RASTER_PATH>]
Where:
-
DRIVER_CLASS
is the class that has the driver code and that Spark will execute. -
DRIVER_JVM
is the memory to assign to driver´s JVM. -
DRIVER_CLASSPATH
is the classpath for driver class, jars are separated by colon. -
EXECUTOR_JARS
is the classpath to be distributed to executors, jars are separated by comma. -
DRIVER_JAR
is the jar that contains the <DRIVER_CLASS> to execute by Spark. -
MOSAIC_CONFIG_PATH
is the path to the mosaic configuration XML, which defines the features of the output. -
GDAL_LIBRARIES_PATH
is the path where GDAL libraries are located. -
GDAL_DATA_PATH
is the path where the GDAL data folder is located. This path must be accessible via NFS to all nodes in the cluster. -
IMAGE_CATALOG_PATH
is the path to the catalog xml that lists the HDFS image(s) to be processed. This is optional because you can also specify a single raster to process using –file flag. -
SINGLE_RASTER_PATH
is an optional path to the .ohif file that will be processed by the job. If this is set, you do not need to set a catalog.
The following example command will process all the files listed in the catalog file inputSPARK.xml
using the mosaic output definition set in the testFS.xml
file.
spark-submit --class oracle.spatial.spark.raster.test.SpatialRasterTest --driver-memory 2048m --driver-class-path /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-raster-fwk-api.jar:/opt/oracle/oracle-spatial-graph/spatial/raster/jlib/gdal.jar:/opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar:/opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar --jars /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar,/opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar,/opt/oracle/oracle-spatial-graph/spatial/raster/jlib/gdal.jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/spark-raster-fwk-api.jar -taskType algebra -catalog /opt/shareddir/spatial/data/xmls/inputSPARK.xml -config /opt/shareddir/spatial/data/xmls/testFS.xml -gdal /opt/oracle/oracle-spatial-graph/spatial/raster/gdal/lib –gdalData /opt/shareddir/data
Note:
For Spark 2.2, use the spark2-submit
command if there is a Spark 1.6 installation on the same node.
Parent topic: Spark SQL Raster Processor
2.8.2.2 Expected Output of the Spark SQL Raster Processor
For Spark processing, only file system output is supported, which means that the output generated is an image with the file name and type specified and is stored in a regular FileSystem.
The job execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
Parent topic: Spark SQL Raster Processor
2.8.3 Using the Spark Raster Processing API
You can use the Spark raster API to load and process rasters by creating the driver class.
Some example classes are provided under /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/src
. The /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/cmd
directory also contains scripts to execute these examples from command line.
After executing the scripts and validated the results, you can modify the Java source files to experiment on them and compile them using the provided script /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/build.xml
. Ensure that there is write access on the /opt/oracle/oracle-spatial-graph/spatial/raster/jlib
directory.
For GDAL to work properly, the libraries must be available using $LD_LIBRARY_PATH. Make sure that the shared libraries path is set properly in your shell window before executing a job. For example:
export LD_LIBRARY_PATH=$ALLACCESSDIR/gdal/native
- Using the Spark Raster Loader API
- Configuring for Using the Spark SQL Processor API
- Creating the DataFrame or Dataset<Row>
- Using the Spark SQL UDF for Raster Algebra Operations
Parent topic: Oracle Big Data Spatial Raster Processing for Spark
2.8.3.1 Using the Spark Raster Loader API
To perform image loading, you must create a SpatialRasterLoader
object. This object is used to set the necessary configuration information for the execution. There are two ways of creating an instance:
-
Send as a parameter the array of arguments received from the command line. For example:
//args is the String[] received from command line SpatialRasterLoader core = new SpatialRasterLoaderCore(args);
-
Configure directly in the driver class using the API, which is the subject of this topic
Using the Loader API, set the GDAL library path, since it will internally initialize the SparkContext
and its corresponding Hadoop configuration. For example:
SpatialRasterLoader core = new SpatialRasterLoader();
core.setGdalLibrary("/opt/sharedddir/spatial/gdal");
core.setFilesToLoad("/opt/shareddir/spatial/rasters");
core.setHDFSOutputDirectory("ohifsparktest");
core.setGdalData("/opt/shareddir/data");
core.setOverlap("20");
core.setThumbnailDirectory("/opt/shareddir/spatial/");
You can optionally change the block size, depending on the most common size of rasters involved. For example, if the cluster HDFS block size is by default too big (such as 256 MB) and the average size of the user rasters is 64 MB in average, you should avoid using HDFS space that contains no real data, because every tile occupies a block in HDFS even if the pixels do not fill it. In this scenario, you can change the block side to 64 MB, as in this example:
JavaSparkContext sc = core.getRasterSparkContext();
core.getHadoopConfiguration().set("dfs.blocksize", "67108864");
To execute the loader, use the loadRasters
method, which returns true
if rasters were loaded with success and false
otherwise. For example:
if (core.loadRasters(sc, StorageLevel.DISK_ONLY())) {
LOG.info("Successfully loaded raster files");
}
If the processing finished successfully, the OHIF files are in HDFS and the corresponding thumbnails are in the specified directory for user validation.
Parent topic: Using the Spark Raster Processing API
2.8.3.2 Configuring for Using the Spark SQL Processor API
To execute a processor, you must create a SpatialRasterProcessor
object to set the necessary configuration information for the execution. There are two ways to create an instance:
-
Send as a parameter the array of arguments received from the command line. For example:
//args is the String[] received from command line SpatialRasterProcessor processor = new SpatialRasterProcessor(args);
-
Configure directly in the driver class using the API, which is the subject of this topic.
Using the Loader API, set the GDAL library path, because it will internally initialize the SparkContext
and its corresponding Hadoop configuration. For example:
SpatialRasterProcessor processor = new SpatialRasterProcessor();
processor.setGdalLibrary("/opt/sharedddir/spatial/gdal");
processor.setGdalData("/opt/sharedddir/spatial/data");
Specify the rasters that will be processed.
-
For adding a catalog of rasters to process, especially if a mosaic operation will be performed, consider the following example:
String ohifPath = "ohifsparktest/opt/shareddir/spatial/data/rasters"); //Creates a catalog to list the rasters to process RasterCatalog catalog = new RasterCatalog(); //Creates a raster object for the catalog Raster raster = new Raster(); //raster of 3 bands raster.setBands(3); //the tree bands will appear in order 1,2,3. You may list less bands here. raster.setBandsOrder("1,2,3"); //raster data type is byte raster.setDataType(1); raster.setRasterLocation(ohifPath + "hawaii.tif.ohif"); //Add raster to catalog //catalog.addRasterToCatalog(raster); Raster rasterKahoolawe = new Raster(); rasterKahoolawe.setBands(3); rasterKahoolawe.setBandsOrder("1,2,3"); rasterKahoolawe.setDataType(1); rasterKahoolawe.setRasterLocation(ohifPath + "kahoolawe.tif.ohif"); catalog.addRasterToCatalog(rasterKahoolawe); //Sets the catalog to the job processor.setCatalogObject(catalog.getCompactCatalog());
-
For processing a single raster, consider the following example:
String ohifPath = "ohifsparktest/opt/shareddir/spatial/data/rasters"); //Set the file to process to the job processor.setFileToProcess(ohifPath + "NapaDEM.tif.ohif");*/
Set the user configuration request, which defines details for the output raster.
-
If a mosaic operation will be performed, then all the features of the expected output must be set in a MosaicConfiguration object, including the coordinates. the following example creates a raster that includes both Hawaii rasters added to the catalog previously:
MosaicConfiguration mosaic = new MosaicConfiguration(); mosaic.setFormat("GTIFF"); mosaic.setBands(3); mosaic.setFileSystem(RasterProcessorJob.FS); mosaic.setDirectory("/opt/shareddir/spatial/processtest"); mosaic.setFileName("HawaiiIslands"); mosaic.setHeight(986); //value for pixels where there is no data, starts with #, followed by //two characters per band mosaic.setNoData("#FFFFFF"); //byte datatype mosaic.setPixelType("1"); //width for pixels in X and Y mosaic.setPixelXWidth(280.388143); mosaic.setPixelYWidth(-280.388143); mosaic.setSrid("26904"); //upper left coordinates mosaic.setUpperLeftX(556958.985610); mosaic.setUpperLeftY(2350324.082505); mosaic.setWidth(1600); mosaic.setOrderAlgorithm(ProcessConstants.ALGORITHM_FILE_LENGTH); mosaic.setOrder(RasterProcessorJob.DESC); //mosaic configuration must be set to the job processor.setUserRequestConfigurationObject(mosaic.getCompactMosaic());
-
If a mosaic operation will not be performed, then a much simpler configuration is required. For example:
MosaicConfiguration mosaic = new MosaicConfiguration(); mosaic.setExecuteMosaic(false); mosaic.setBands(1); mosaic.setLayers("1"); mosaic.setDirectory("/opt/shareddir/spatial/processtest"); mosaic.setFileSystem(RasterProcessorJob.FS); mosaic.setNoData("#00");
At this point, all required configuration is done. You can now start processing.
Parent topic: Using the Spark Raster Processing API
2.8.3.3 Creating the DataFrame or Dataset<Row>
Before running queries against the rasters, you must load them into a distributed collection of structured data, organized into named columns where every row represents a split. This structure is known as Dataframe, and Dataset<Row> for Spark 2.2.
The splits are created into a SpatialJavaRDD
of tiles, which are then converted to a DataFrame or Dataset<Row>. Depending on your available JVM runtime memory, it is recommended that you cache the DataFrame in memory or on disk. For disk caching, your Spark installation must have Kryo.
The DataFrame or Dataset<Row> consists of two complex columns: tileInfo
and userRequest
.
-
tileInfo
: Data for every tile, including not only pixel information but also metadata details.Table 2-2 tileInfo Column Data
Column DataType Nullable Description dstWidthSize
Integer
False
Width
dstHeightSize
Integer
False
Height
bands
Integer
False
Number of bands
dType
Integer
False
Data type
piece
Integer
False
Piece number of total pieces in source raster
offX
Integer
False
Offset in X
offY
Integer
False
Offset in Y
sourceWidth
Integer
False
Source raster width
sourceHeight
Integer
False
Source raster height
bytesNumber
Integer
False
Number of bytes
baseArray
[[Pixel DataType]]
False
Array of pixels, one per band
basePaletteArray
[[Integer]]
True
Array of palette interpretation, if the raster has it, one per band
baseColorArray
[Integer]
False
Array of colors, one per band
noDataArray
[Double]
False
Array of NODATA value, one per band
Overlap
Integer
False
Number of overlapping pixels
leftOv
Byte
False
Flag to indicate if there are any overlapping pixels on the left
rightOv
Byte
False
Flag to indicate if there are any overlapping pixels on the right
upOv
Byte
False
Flag to indicate if there are any overlapping pixels on the top
downOv
Byte
False
Flag to indicate if there are any overlapping pixels on the bottom
projectionRef
String
False
Projection reference
geoTransform
[Double]
False
Geo Transformation array
Metadata
[String]
False
Location metadata
lastModified
Long
False
Source raster last modification date
imageLength
Double
False
Source raster length
dataLength
Integer
True
Number of bytes after mosaic
xCropInit
Integer
True
Pixel start in X after mosaic
yCropInit
Integer
True
Pixel start in Y after mosaic
xCropLast
Integer
True
Pixel end in X after mosaic
yCropLast
Integer
True
Pixel end in Y after mosaic
catalogOrder
Integer
False
Order in the catalog
baseMountPoint
String
False
Source raster path
sourceResolution
String
False
Source raster resolution
extraFields
[String]
True
Extra fields map, NA
-
userRequest
: User request configuration, where expected output raster features are defined.Table 2-3 userRequest Column Data
Column DataType Nullable Description offset
Long
False
Offset
piece
Integer
False
Piece number
splitSize
Long
False
Split size
bandsToAdd
String
False
Bands to include in output i.e. “1,2,3”
upperLeftX
Double
True
Coordinate of output in X upper left, used when mosaic is requested
upperLeftY
Double
True
Coordinate of output in Y upper left, used when mosaic is requested
lowerRightX
Double
True
Coordinate of output in X lower right, used when mosaic is requested
lowerRightY
Double
True
Coordinate of output in Y lower right, used when mosaic is requested
width
Integer
True
Output width, used when mosaic is requested
height
Integer
True
Output height, used when mosaic is requested
srid
String
True
Output SRID, used when mosaic is requested
order
String
True
Output order , Ascendant or Descendant, used when mosaic is requested
format
String
True
Output GDALformat, used when mosaic is requested
noData
String
False
Output NODATA value, a # followed by two digits per band, i.e. for 3 band output “#000000”
pixelType
String
True
Output GDAL Data type, used when mosaic is requested
Directory
String
False
Output directory
pixelXWidth
Double
True
Output pixel width, used when mosaic is requested
pixelYWidth
Double
True
Output pixel height, used when mosaic is requested
wkt
String
False
Source projection reference
mosaicWkt
String
True
Output projection reference, used when mosaic is requested
processingClasses
String
True
User processing classes to execute, still not supported in Spark
reducingClasses
String
True
User reducing classes to execute, still not supported in Spark
tempOut
String
True
Temporary output folder when HDFS output is requested, still not supported in Spark
filename
String
False
Output filename
contextId
String
False
Execution context Id
sourceResolution
String
False
Source raster resolution
catalogOrder
Integer
False
Source raster order in catalog
executeMosaic
Boolean
False
Flag to indicate if mosaic operation is requested or not
osContainer
[String]
True
Object Storage Container and Service, used to connect using swift
swiftConfig
[String]
True
List of swift configuration properties, including url , user and tenant
hdfsConfig
[String]
False
HDFS URL for BDCS
The following example creates a DataFrame and displays information about it:
JavaSparkContext sc = processor.getRasterSparkContext();
SpatialRasterJavaRDD<GeneralInfoWritable> spatialRDD = processor.getProcessSplits();
HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame tileRows = RDDUtils.createDataFrameFromRDD(sqlContext, spatialRDD.createSpatialTileRows(StorageLevel.DISK_ONLY()));
Row[] rows = tileRows.collect();
System.out.println("First Tile info: ");
System.out.println("Width " + rows[0].getStruct(0).getInt(0));
System.out.println("Height " + rows[0].getStruct(0).getInt(1));
System.out.println("Total width " + rows[0].getStruct(0).getInt(7));
System.out.println("Total height " + rows[0].getStruct(0).getInt(8));
System.out.println("File " + rows[0].getStruct(0).getString(30));
System.out.println("First Tile User request data: ");
System.out.println("Bands to add " + rows[0].getStruct(1).getString(3));
Parent topic: Using the Spark Raster Processing API
2.8.3.4 Using the Spark SQL UDF for Raster Algebra Operations
A Spark UDF localop
allows the execution of the raster algebra operations described in Map Algebra Operations for processing images using the Hadoop image processor. The operation names and supported data types for the Spark SQL UDF are the same as for Hadoop
Before any query is executed, the driver class must register the UDF and must register the tiles' DataFrame or Dataset<Row> as a temporary table. For example:
sqlContext.udf().register("localop", new LocalOperationsFunction(), DataTypes.createStructType(RDDUtils.createSimpleTileStructField(dataType)));
Now that localop UDF is registered, it is ready to be used. This function accepts two parameters:
-
A
tileInfo
row -
A string with the raster algebra operations to execute. Multiple operations may be executed in the same query, and they must be separated by a semicolon. For operations that receive parameters, they must be separated by commas.
The function returns the tileInfo
that was sent to query, but with the pixel data updated based on the executed operations.
Following are some examples for the execution of different operations.
String query = "SELECT localop(tileInfo, \"localnot\"),
userRequest FROM tiles";
String query = "SELECT localop(tileInfo,\"localadd,456;localdivide,2;
localif,>,0,12;localmultiply,20;
localpow,2;localsubstract,4;
localsqrt;localacos\"),
userRequest FROM tiles";
String query = "SELECT localop(tileInfo,\"localnot;localatan;localcos;
localasin;localtan;localcosh;
localtanh\"), userRequest FROM tiles";
To execute the query (Spark 1.6), enter the following:
DataFrame cachedTiles = RDDUtils.queryAndCache(query, sqlContext);
This new DataFrame has the updated pixels. You can optionally save the content of a specific tile as a TIF file, in which it will be stored in the configured output directory. For example:
Row[] pRows = cachedTiles.collect();
processor.debugTileBySavingTif(pRows[0],
processor.getHadoopConfiguration());
To execute the mosaic operation (Spark 1.6), first perform any raster algebra processing, and then perform the mosaic operation. A Spark UDF is used for the mosaic operation; it receives the tileInfo
and userRequest
columns, and returns the updated tileInfo
that fits in the mosaic. For example:
sqlContext.udf().register("mosaic", new MosaicFunction(), DataTypes.createStructType(RDDUtils.createSimpleTileStructField(dataType)));
DataFrame mosaicTiles = RDDUtils.queryAndCache(queryMosaic, sqlContext);
After the processing is done, you can put together the tiles into the output raster by using ProcessedRasterCreator
, which receives a temporary HDFS directory for internal work, the DataFrame to merge, and the Spark Context from the Hadoop configuration. This will create the expected output raster in the specified output directory. For example:
try {
ProcessedRasterCreator creator = new ProcessedRasterCreator();
creator.create(new Text("createOutput"), mosaicTiles,
sc.hadoopConfiguration());
LOG.info("Finished");
} catch (Exception e) {
LOG.error("Failed processor job due to " + e.getMessage());
throw e;
}
To execute the query in Spark 2.2, enter the following:
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL raster processor")
.getOrCreate();
Dataset<Row> tileRows = DatasetSupport.createDatasetFromRDD(spark.sqlContext(), spatialRDD.createSpatialTileRows(StorageLevel.DISK_ONLY()));
spark.sqlContext().udf().register("localop", new LocalOperationsFunction(), DataTypes.createStructType(RDDUtils.createSimpleTileStructField(dataType)));
tileRows.createTempView("tiles");
String query = "SELECT localop(tileInfo, \"localnot\"), userRequest FROM tiles";
Dataset<Row> cachedTiles = DatasetSupport.queryAndCache(query, spark.sqlContext());
This new Dataset has the updated pixels. You can optionally save the content of a specific tile as a TIF file, which will be stored in the configured output directory. For example:
List<Row> pRows = cachedTiles.collectAsList();
processor.debugTileBySavingTif(pRows.get(0), processor.getHadoopConfiguration());
To execute the mosaic operation (Spark 2.2), first perform any raster algebra processing, and then perform the mosaic operation. A new Spark UDF is used for the mosaic operation; it receives the tileInfo and userRequest columns, and returns the updated tileInfo that fits in the mosaic. For example:
spark.sqlContext().udf().register("mosaic", new MosaicFunction(), DataTypes.createStructType(RDDUtils.createSimpleTileStructField(dataType)));
cachedTiles.createTempView("processedTiles");
String queryMosaic = "SELECT mosaic(tileInfo, userRequest), userRequest FROM processedTiles";
Dataset<Row> mosaicTiles = DatasetSupport.queryAndCache(queryMosaic, spark.sqlContext());
After the processing is done, you can put together the tiles into the output raster by using ProcessedRasterCreator
, which receives a temporary HDFS directory for internal work, the DataFrame to merge, and the Spark Context from the Hadoop configuration. This will create the expected output raster in the specified output directory. For example:
try {
ProcessedRasterCreator creator = new ProcessedRasterCreator();
creator.create(new Text("createOutput"), (Row[]) mRows.toArray(), sc.hadoopConfiguration());
LOG.info("Finished");
} catch (Exception e) {
LOG.error("Failed processor job");
throw e;
}
Parent topic: Using the Spark Raster Processing API
2.9 Spatial Raster Processing Support in Big Data Cloud Service
Oracle Big Data Spatial Raster Processing is supported in Big Data Cloud Service (BDCS) by making use of the Oracle Object Storage platform.
NFS shared directories SHARED_DIR and ALL_ACCESS_DIR are not required in this configuration, instead a connection and valid account for Oracle Object Storage is required.
The general flow of raster processing in Big Data Cloud Service is shown in the following figure.
Figure 2-1 Spatial Raster Processing in Big Data Cloud Service

Description of "Figure 2-1 Spatial Raster Processing in Big Data Cloud Service"
As shown in the preceding figure:
- Rasters (.tif) are loaded in Oracle Object Storage.
- Big Data Spatial Raster Loader imports them and converts them to OHIF (.ohif) into HDFS.
- Big Data Spatial Raster Processor stores the output in Oracle Object Storage.
- Oracle Object Storage Container Configuration
- BDCS Loader Configuration
- Raster Processing in BDCS
- Oracle Object Storage Password Configuration
Oracle Object Storage Container Configuration
You must have an Oracle Object Storage connection and a valid user. You must create a container and store your input rasters there. For information about how to create the container and add rasters to bject Storage, see REST API for Standard Storage in Oracle Cloud Infrastructure Object Storage Classic.
You must configure the Object Storage endpoint and login details for your jobs by using the following properties.
- fs.swift.service.YOURSERVICEID.auth.url: Endpoint authentication RL.
- fs.swift.service.YOURSERVICEID.public: Indicates if all URLs are public.
- fs.swift.service.YOURSERVICEID.tenant: Tenant to connect (required).
- fs.swift.service.YOURSERVICEID.username: Username to authenticate.
These properties must be set using Hadoop Configuration.
BDCS Loader Configuration
When loading rasters from Object Storage, you must provide a directory with enough space to download from Object Storage all the rasters that will be processed using the Spatial Raster Processor. This directory must be located in the node where the job is being called.
Use a command such as in the following example:
hadoop jar hadoop-imageloader.jar
-files <SOURCE_IMGS_PATH>
-out <HDFS_OUTPUT_FOLDER>
-gdal <GDAL_LIB_PATH>
-gdalData <GDAL_DATA_PATH>
[-overlap <OVERLAPPING_PIXELS>]
[-thumbnail <THUMBNAIL_PATH>]
[-rasterSource objectStore]
[-containerService containername.service]
[-workDir /system123/scratch/user3]
[-credential spatialraster]
Where:
rasterSource
: Set to objectStore if the rasters are in Oracle Object Storage, if not set it follows default flow.containerService
: Specifies the Object Storage container and service where the rasters are stored. This is required field.workDir
: Local temporary directory to download Object Storage rasters, this is required for Object Storage loads. This is required field.credential
: Specifies the ID of the credential stored in Credential Store to secure Object Storage password.
The same properties must be set when using Spatial Raster Processing API. For example:
loader.setObjectRasterSource();
loader.setObjectStorageContainer(“spatial.melli”);
loader.setObjectStorageTemporaryDir(“/opt/test”);
loader.setObjectStorageCredential(“spatialraster”);
Raster Processing in BDCS
To request a raster process in BDCS, you must specify the following elements in the User Request Configuration XML:
directory type
: Set to OS for Oracle Object Storage Processing.directory container
: Specifies the container and service where the output will be stored <containername.service>, which is required for Object Storage Processing.directory credential
: Set the ID of the credential stored in Credential Store to the secure Object Storage password.directory
: Output directory in Object Storage, which is required for Object Storage Processing.tempFsFolder
: HDFS temporary directory to store mapper processing results, which is required for Object Storage Processing.
The XML should look like this:
<directory type="OS" container="spatial.user3" credential=”spatialraster”>/user/oracle/output/<directory>
<tempFsFolder>tempOS/<tempFsFolder>
The same properties must be set when using Spatial Raster Processing API. For example:
mosaic.setFileSystem(RasterProcessorJob.OBJECT_STORAGE);
mosaic.setOSContainer(“spatial.user3”);
//Output directory in Object Storage
mosaic.setDirectory(“/user/oracle/RasterAPIOutput”);
//HDFS directory for temporary local processing results
mosaic.setTemporaryFolder(“processtmpOS”);
//Credential ID in Credential Store to connect to Object Storage
mosaic.setCredential(“spatialraster”);
Oracle Object Storage Password Configuration
To connect to Oracle Object Storage to download the input rasters or to store the processing results, you must provide a password. To secure this password, you must store it in Cluster Credential Store so that it is not passed in clear text in command line parameters or job code.
To store credentials in the credential store for a cluster:
- Open the cluster console for the cluster.
- Click Settings, then click the Credentials tab.
- In the User Credentials section, click New Credential to create a new credential.
- In the Key field, enter the desired name or identifier for the credential. For example:
database_password
. - In the Value field, enter the value for the credential, then click Save.
The identifier used to store the credential (key field) will be provided to the Raster Loader and Processor as the credential. The framework will retrieve the password using this identifier and use it to connect to Object Storage.
Use the oracle
user to execute the jobs in order to correctly retrieve the password, because Credential Store will currently grant access to this password only to oracle
.
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.10 Oracle Big Data Spatial Vector Analysis
Oracle Big Data Spatial Vector Analysis is a Spatial Vector Analysis API, which runs as a Hadoop job and provides MapReduce components for spatial processing of data stored in HDFS.
These components make use of the Spatial Java API to perform spatial analysis tasks. There is a web console provided along with the API.
- Multiple Hadoop API Support
- Spatial Indexing
- Using MVSuggest
- Spatial Filtering
- Classifying Data Hierarchically
- Generating Buffers
- Spatial Binning
- Spatial Clustering
- Spatial Join
- Spatial Partitioning
- RecordInfoProvider
- HierarchyInfo
- Using JGeometry in MapReduce Jobs
- Support for Different Data Sources
- Job Registry
- Tuning Performance Data of Job Running Times Using the Vector Analysis API
See Also:
See the following topics for understanding the implementation details:
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.10.1 Multiple Hadoop API Support
Oracle Big Data Spatial Vector Analysis provides classes for both the old and new (context objects) Hadoop APIs. In general, classes in the mapred
package are used with the old API, while classes in the mapreduce
package are used with the new API
The examples in this guide use the old Hadoop API; however, all the old Hadoop Vector API classes have equivalent classes in the new API. For example, the old class oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing
has the equivalent new class named oracle.spatial.hadoop.vector.mapreduce.job.SpatialIndexing
. In general, and unless stated otherwise, only the change from mapred
to mapreduce
is needed to use the new Hadoop API Vector classes.
Classes such as oracle.spatial.hadoop.vector.RecordInfo
, which are not in the mapred or mapreduce package, are compatible with both Hadoop APIs.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.2 Spatial Indexing
A spatial index is in the form of a key/value pair and generated as a Hadoop MapFile. Each MapFile entry contains a spatial index for one split of the original data. The key and value pair contains the following information:
-
Key: a split identifier in the form: path + start offset + length.
-
Value: a spatial index structure containing the actual indexed records.
The following figure depicts a spatial index in relation to the user data. The records are represented as r1, r2, and so on. The records are grouped into splits (Split 1, Split 2, Split 3, Split n). Each split has a Key-Value pair where the key identifies the split and the value identifies an Rtree index on the records in that split.
- Spatial Indexing Class Structure
- Configuration for Creating a Spatial Index
- Spatial Index Metadata
- Input Formats for a Spatial Index
- Support for GeoJSON and Shapefile Formats
- Removing a Spatial Index
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.2.1 Spatial Indexing Class Structure
Records in a spatial index are represented using the class oracle.spatial.hadoop.vector.RecordInfo
. A RecordInfo
typically contains a subset of the original record data and a way to locate the record in the file where it is stored. The specific RecordInfo
data depends on two things:
-
InputFormat
used to read the data -
RecordInfoProvider
implementation, which provides the record's data
The fields contained within a RecordInfo
:
-
Id: Text field with the record Id.
-
Geometry:
JGeometry
field with the record geometry. -
Extra fields: Additional optional fields of the record can be added as name-value pairs. The values are always represented as text.
-
Start offset: The position of the record in a file as a byte offset. This value depends on the
InputFormat
used to read the original data. -
Length: The original record length in bytes.
-
Path: The file path can be added optionally. This is optional because the file path can be known using the spatial index entry key. However, to add the path to the
RecordInfo
instances when a spatial index is created, the value of the configuration propertyoracle.spatial.recordInfo.includePathField
key is set totrue
.
Parent topic: Spatial Indexing
2.10.2.2 Configuration for Creating a Spatial Index
A spatial index is created using a combination of FileSplitInputFormat
, SpatialIndexingMapper
, InputFormat
, and RecordInfoProvider
, where the last two are provided by the user. The following code example shows part of the configuration needed to run a job that creates a spatial index for the data located in the HDFS folder /user/data
.
//input conf.setInputFormat(FileSplitInputFormat.class); FileSplitInputFormat.setInputPaths(conf, new Path("/user/data")); FileSplitInputFormat.setInternalInputFormatClass(conf, GeoJsonInputFormat.class); FileSplitInputFormat.setRecordInfoProviderClass(conf, GeoJsonRecordInfoProvider.class); //output conf.setOutputFormat(MapFileOutputFormat.class); FileOutputFormat.setOutputPath(conf, new Path("/user/data_spatial_index")); //mapper conf.setMapperClass(SpatialIndexingMapper.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(RTreeWritable.class);
In this example,
-
The
FileSplitInputFormat
is set as the jobInputFormat
.FileSplitInputFormat
is a subclass ofCompositeInputFormat
(WrapperInputFormat
in the new Hadoop API version), an abstract class that uses anotherInputFormat
implementation (internalInputFormat
) to read the data. The internalInputFormat
and theRecordInfoProvider
implementations are specified by the user and they are set toGeoJsonInputFormat
andGeoJsonRecordInfoProvider
, respectively. -
The
MapFileOutputFormat
is set as theOutputFormat
in order to generate aMapFile
-
The mapper is set to
SpatialIndexingMappper
. The mapper output key and value types areText
(splits identifiers) andRTreeWritable
(the actual spatial indexes). -
No reducer class is specified so it runs with the default reducer. The reduce phase is needed to sort the output MapFile keys.
Alternatively, this configuration can be set easier by using the oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing
class. SpatialIndexing
is a job driver that creates a spatial index. In the following example, a SpatialIndexing
instance is created, set up, and used to add the settings to the job configuration by calling the configure()
method. Once the configuration has been set, the job is launched.
SpatialIndexing<LongWritable, Text> spatialIndexing = new SpatialIndexing<LongWritable, Text>(); //path to input data spatialIndexing.setInput("/user/data"); //path of the spatial index to be generated spatialIndexing.setOutput("/user/data_spatial_index"); //input format used to read the data spatialIndexing.setInputFormatClass(TextInputFormat.class); //record info provider used to extract records information spatialIndexing.setRecordInfoProviderClass(TwitterLogRecordInfoProvider.class); //add the spatial indexing configuration to the job configuration spatialIndexing.configure(jobConf); //run the job JobClient.runJob(jobConf);
Parent topic: Spatial Indexing
2.10.2.3 Spatial Index Metadata
A metadata file is generated for every spatial index that is created. The spatial index metadata can be used to quickly find information related to a spatial index, such as the number of indexed records, the minimum bounding rectangle (MBR) of the indexed data, and the paths of both the spatial index and the indexed source data. The spatial index metadata can be retrieved using the spatial index name.
A spatial index metadata file contains the following information:
-
Spatial index name
-
Path to the spatial index
-
Number of indexed records
-
Number of local indexes
-
Extra fields contained in the indexed records
-
Geometry layer information such as th SRID, dimensions, tolerance, dimension boundaries, and whether the geometries are geodetic or not
-
The following information for each of the local spatial index files: path to the indexed data, path to the local index, and MBR of the indexed data
The following metadata proeprties can be set when creating a spatial index using the SpatialIndexing
class:
-
indexName
: Name of the spatial index. If not set, the output folder name is used. -
metadataDir
: Path to the directory where the metadata file will be stored.-
By default, it will be stored in the following path relative to the user directory:
oracle_spatial/index_metadata
. If the user ishdfs
, it will be/user/hdfs/oracle_spatial/index_metadata
.
-
-
overwriteMetadata
: If set totrue
, then when a spatial index metadata file already exists for a spatial index with the sameindexName
in the currentmetadataDir
, the spatial index metadata will be overwritten. If set tofalse
and if a spatial index metadata file already exists for a spatial index with the sameindexName
in the currentmetadataDir
, then an error is raised.
The following example sets the metadata directory and spatial index name, and specifies to overwrite any existing metadata if the index already exists:
spatialIndexing.setMetadataDir("/user/hdfs/myIndexMetadataDir");
spatialIndexing.setIndexName("testIndex");
spatialIndexing.setOverwriteMetadata(true);
An existing spatial index can be passed to other jobs by specifying only the indexName
and optionally the indexMetadataDir
where the index metadata can be found. When the index name is provided, there is no need to specify the spatial index path and the input format.
The following job drivers accept the indexName
as a parameter:
-
oracle.spatial.hadoop.vector.mapred.job.Categorization
-
oracle.spatial.hadoop.vector.mapred.job.SpatialFilter
-
oracle.spatial.hadoop.vector.mapred.job.Binning
-
Any driver that accepts
oracle.spatial.hadoop.vector.InputDataSet
, such asSpatialJoin
andPartitioning
If the index name is not found in the indexMetadataDir
path, an error is thrown indicating that the spatial index could not be found.
The following example shows a spatial index being set as the input data set for a binning job:
Binning binning = new Binning();
binning.setIndexName("indexExample");
binning.setIndexMetadataDir("indexMetadataDir");
Parent topic: Spatial Indexing
2.10.2.4 Input Formats for a Spatial Index
An InputFormat
must meet the following requisites to be supported:
-
It must be a subclass of
FileInputFormat
. -
The
getSplits()
method must return eitherFileSplit
orCombineFileSplit
split types. -
For the old Hadoop API, the
RecordReader
’sgetPos()
method must return the current position to track back a record in the spatial index to its original record in the user file. If the current position is not returned, then the original record cannot be found using the spatial index.However, the spatial index still can be created and used in operations that do not require the original record to be read. For example, additional fields can be added as extra fields to avoid having to read the whole original record.
Note:
The spatial indexes are created for each split as returned by the
getSplits()
method. When the spatial index is used for filtering (see Spatial Filtering), it is recommended to use the sameInputFormat
implementation than the one used to create the spatial index to ensure the splits indexes can be found.
The getPos()
method has been removed from the Hadoop new API; however, org.apache.hadoop.mapreduce.lib.input.TextInputFormat
and CombineTextInputFormat
are supported, and it is still possible to get the record start offsets.
Other input formats from the new API are supported, but the record start offsets will not be contained in the spatial index. Therefore, it is not possible to find the original records. The requirements for a new API input format are the same as for the old API. However, they must be translated to the new APIs FileInputFormat
, FileSplit
, and CombineFileSplit
.
Parent topic: Spatial Indexing
2.10.2.5 Support for GeoJSON and Shapefile Formats
The Vector API comes with InputFormat
and RecordInfoProvider
implementations for GeoJSON and Shapefile file formats.
The following InputFormat/RecordInfoProvider
pairs can be used to read and interpret GeoJSON and ShapeFiles, respectively:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat / oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider oracle.spatial.hadoop.vector.shapefile.mapred.ShapeFileInputFormat / oracle.spatial.hadoop.vector.shapefile.ShapeFileRecordInfoProvider
More information about the usage and properties is available in the Javadoc.
Parent topic: Spatial Indexing
2.10.2.6 Removing a Spatial Index
A previously generated spatial index can be removed by executing the following.
oracle.spatial.hadoop.vector.util.Tools removeSpatialIndex indexName=<INDEX_NAME> [indexMetadataDir=<PATH>] [removeIndexFiles=<true|false*>]
Where:
-
indexName
: Name of a previously generated index. -
indexMetadataDir
(optional): Path to the index metadata directory. If not specified, the following path relative to the user directory will be used: oracle_spatial/index_metadata -
removeIndexFiles
(optional):true
if generated index map files need to be removed in addition to the index metadata file. By default, it isfalse
.
Parent topic: Spatial Indexing
2.10.3 Using MVSuggest
MVSuggest
can be used at the time of spatial indexing to get an approximate location for records that do not have geometry but have some text field. This text field can be used to determine the record location. The geometry returned by MVSuggest
is used to include the record in the spatial index.
Because it is important to know the field containing the search text for every record, the RecordInfoProvider
implementation must also implement LocalizableRecordInfoProvider
. Alternatively, the configuration parameter oracle.spatial.recordInfo.locationField
can be set with the name of the field containing the search text. For more information, see the Javadoc for LocalizableRecordInfoProvider
.
A standalone version of MVSuggest
is shipped with the Vector API and it can be used in some jobs that accept the MVSConfig
as an input parameter.
The following job drivers can work with MVSuggest
and all of them have the setMVSConfig()
method which accepts an instance of MVSConfig
:
-
oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing: has the option of using
MVSuggest
to get approximate spatial location for records which do not contain geometry. -
oracle.spatial.hadoop.vector.mapred.job.Categorization:
MVSuggest
can be used to assign a record to a specific feature in a layer, for example, the feature California in the USA states layer. -
oracle.spatial.hadoop.vector.mapred.job.SuggestService: A simple job that generates a file containing a search text and its match per input record.
The MVSuggest configuration is passed to a job using the MVSConfig or the LocalMVSConfig classes. The basic MVSuggest properties are:
-
serviceLocation
: It is the minimum property required in order to useMVSuggest
. It contains the path or URL where theMVSuggest
directory is located or in the case of a URL, where theMVSuggest
service is deployed. -
serviceInterfaceType
: the type ofMVSuggest
implementation used. It can be LOCAL(default) for a standalone version and WEB for the web service version. -
matchLayers
: an array of layer names used to perform the searches.
When using the standalone version of MVSuggest
, you must specify an MVSuggest
directory or repository as the serviceLocation
. An MVSuggest
directory must have the following structure:
mvsuggest_config.json repository folder one or more layer template files in .json format optionally, a _config_ directory optionally, a _geonames_ directory
The examples
folder comes with many layer template files and a _config_
directory with the configuration for each template.
It is possible to set the repository folder (the one that contains the templates) as the mvsLocation instead of the whole MVSuggest
directory. In order to do that, the class LocalMVSConfig
can be used instead of MVSConfig
and the repositoryLocation
property must be set to true
as shown in the following example:
LocalMVSConfig lmvsConf = new LocalMVSConfig(); lmvsConf.setServiceLocation(“file:///home/user/mvs_dir/repository/”); lmvsConf.setRepositoryLocation(true); lmvsConf.setPersistentServiceLocation(“/user/hdfs/hdfs_mvs_dir”); spatialIndexingJob.setMvsConfig(lmvsConf);
The preceding example sets a repository folder as the MVS service location. setRepositoryLocation
is set to true
to indicate that the service location is a repository instead of the whole MVSuggest
directory. When the job runs, a whole MVSuggest
directory will be created using the given repository location; the repository will be indexed and will be placed in a temporary folder while the job finishes. The previously indexed MVSuggest
directory can be persisted so it can be used later. The preceding example saves the generated MVSuggest
directory in the HDFS path /user/hdfs/hdfs_mvs_dir
. Use the MVSDirectory if the MVSuggest
directory already exists.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.4 Spatial Filtering
Once the spatial index has been generated, it can be used to spatially filter the data. The filtering is performed before the data reaches the mapper and while it is being read. The following sample code example demonstrates how the SpatialFilterInputFormat
is used to spatially filter the data.
//set input path and format FileInputFormat.setInputPaths(conf, new Path("/user/data/")); conf.setInputFormat(SpatialFilterInputFormat.class); //set internal input format SpatialFilterInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class); if( spatialIndexPath != null ) { //set the path to the spatial index and put it in the distributed cache boolean useDistributedCache = true; SpatialFilterInputFormat.setSpatialIndexPath(conf, spatialIndexPath, useDistributedCache); } else { //as no spatial index is used a RecordInfoProvider is needed SpatialFilterInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class); } //set spatial operation used to filter the records SpatialOperationConfig spatialOpConf = new SpatialOperationConfig(); spatialOpConf.setOperation(SpatialOperation.IsInside); spatialOpConf.setJsonQueryWindow("{\"type\":\"Polygon\", \"coordinates\":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997 , -106.64595, 25.83997]]}"); spatialOpConf.setSrid(8307); spatialOpConf.setTolerance(0.5); spatialOpConf.setGeodetic(true);
SpatialFilterInputFormat
has to be set as the job's InputFormat
. The InputFormat
that actually reads the data must be set as the internal InputFormat
. In this example, the internal InputFormat
is TextInputFormat
.
If a spatial index is specified, it is used for filtering. Otherwise, a RecordInfoProvider
must be specified in order to get the records geometries, in which case the filtering is performed record by record.
As a final step, the spatial operation and query window to perform the spatial filter are set. It is recommended to use the same internal InputFormat
implementation used when the spatial index was created or, at least, an implementation that uses the same criteria to generate the splits. For details see "Input Formats for a Spatial Index."
If a simple spatial filtering needs to be performed (that is, only retrieving records that interact with a query window), the built-in job driver oracle.spatial.hadoop.vector.mapred.job.SpatialFilter
can be used instead. This job driver accepts indexed or non-indexed input and a SpatialOperationConfig
to perform the filtering.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.4.1 Filtering Records
The following steps are executed when records are filtered using the SpatialFilterInputFormat
and a spatial index.
-
SpatialFilterInputFormat getRecordReader()
method is called when the mapper requests aRecordReader
for the current split. -
The spatial index for the current split is retrieved.
-
A spatial query is performed over the records contained in it using the spatial index.
As a result, the ranges in the split that contains records meeting the spatial filter are known. For example, if a split goes from the file position 1000 to 2000, upon executing the spatial filter it can be determined that records that fulfill the spatial condition are in the ranges 1100-1200, 1500-1600 and 1800-1950. So the result of performing the spatial filtering at this stage is a subset of the original filter containing smaller splits.
-
An Internal
InputFormat
RecordReader
is requested for every small split from the resulting split subset. -
A RecordReader is returned to the caller mapper. The returned RecordReader is actually a wrapper RecordReader with one or more RecordReaders returned by the internal
InputFormat
. -
Every time the mapper calls the RecordReader, the call to next method to read a record is delegated to the internal RecordReader.
These steps are shown in the following spatial filter interaction diagram.
Parent topic: Spatial Filtering
2.10.4.2 Filtering Using the Input Format
A previously generated Spatial Index can be read using the input format implementation oracle.spatial.hadoop.vector.mapred.input.SpatialIndexInputFormat
(or its new Hadoop API equivalent with the mapreduce
package instead of mapred
). SpatialIndexInputFormat
is used just like any other FileInputFormat
subclass in that it takes an input path and it is set as the job’s input format. The key and values returned are the id (Text)
and record information (RecordInfo)
of the records stored in the spatial index.
Aditionally, a spatial filter opertion can be performed by specifying a spatial operation configuration to the input format, so that only the records matching some spatial interaction will be returned to a mapper. The following example shows how to configure a job to read a spatial index to retrieve all the records that are inside a specific area.
JobConf conf = new JobConf();
conf.setMapperClass(MyMapper.class);
conf.setInputFormat(SpatialIndexInputFormat.class);
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setQueryWindow(JGeometry.createLinearPolygon(new double[]{47.70, -124.28, 47.70, -95.12, 35.45, -95.12, 35.45, -124.28, 47.70, -124.28}, 2, 8307));
SpatialIndexInputFormat.setFilterSpatialOperationConfig(spatialOpConf, conf);
The mapper in the preceding example can add a nonspatial filter by using the RecordInfo
extra fields, as shown in the following example.
public class MyMapper extends MapReduceBase implements Mapper<Text, RecordInfo, Text, RecordInfo>{
@Override
public void map(Text key, RecordInfo value, OutputCollector<Text, RecordInfo> output, Reporter reporter)
throws IOException {
if( Integer.valueOf(value.getField("followers_count")) > 0){
output.collect(key, value);
}
}
}
Parent topic: Spatial Filtering
2.10.5 Classifying Data Hierarchically
The Vector Analysis API provides a way to classify the data into hierarchical entities. For example, in a given set of catalogs with a defined level of administrative boundaries such as continents, countries and states, it is possible to join a record of the user data to a record of each level of the hierarchy data set. The following example generates a summary count for each hierarchy level, containing the number of user records per continent, country, and state or province:
Categorization catJob = new Categorization(); //set a spatial index as the input catJob.setIndexName("indexExample"); //set the job's output catJob.setOutput("hierarchy_count"); //set HierarchyInfo implementation which describes the world administrative boundaries hierarchy catJob.setHierarchyInfoClass( WorldDynaAdminHierarchyInfo.class ); //specify the paths of the hierarchy data Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json")}; catJob.setHierarchyDataPaths(hierarchyDataPaths); //set the path where the index for the previous hierarchy data will be generated catJob.setHierarchyIndexPath(new Path("/user/hierarchy_data_index/")); //setup the spatial operation which will be used to join records from the two datasets (spatial index and hierarchy data). SpatialOperationConfig spatialOpConf = new SpatialOperationConfig(); spatialOpConf.setOperation(SpatialOperation.IsInside); spatialOpConf.setSrid(8307); spatialOpConf.setTolerance(0.5); spatialOpConf.setGeodetic(true); catJob.setSpatialOperationConfig(spatialOpConf); //add the previous setup to the job configuration catJob.configure(conf); //run the job RunningJob rj = JobClient.runJob(conf);
The preceding example uses the Categorization
job driver. The configuration can be divided into the following categories:
-
Input data: A previously generated spatial index (received as the job input).
-
Output data: A folder that contains the summary counts for each hierarchy level.
-
Hierarchy data configuration: This contains the following:
-
HierarchyInfo
class: This is an implementation ofHierarchyInfo
class in charge of describing the current hierarchy data. It provides the number of hierarchy levels, level names, and the data contained at each level. -
Hierarchy data paths: This is the path to each one of the hierarchy catalogs. These catalogs are read by the
HierarchyInfo
class. -
Hierarchy index path: This is the path where the hierarchy data index is stored. Hierarchy data needs to be preprocessed to know the parent-child relationships between hierarchy levels. This information is processed once and saved at the hierarchy index, so it can be used later by the current job or even by any other jobs.
-
-
Spatial operation configuration: This is the spatial operation to be performed between records of the user data and the hierarchy data in order to join both datasets. The parameters to set here are the Spatial Operation type (IsInside), SRID (8307), Tolerance (0.5 meters), and whether the geometries are Geodetic (true).
Internally, the Categorization.configure()
method sets the mapper and reducer to be SpatialHierarchicalCountMapper
and SpatialHierarchicalCountReducer
, respectively. SpatialHierarchicalCountMapper
's output key is a hierarchy entry identifier in the form hierarchy_level
+ hierarchy_entry_id
. The mapper output value is a single count for each output key. The reducer sums up all the counts for each key.
Note:
The entire hierarchy data may be read into memory and hence the total size of all the catalogs is expected to be significantly less than the user data. The hierarchy data size should not be larger than a couple of gigabytes.
If you want another type of output instead of counts, for example, a list of user records according to the hierarchy entry. In this case, the SpatialHierarchicalJoinMapper
can be used. The SpatialHierarchicalJoinMapper
output value is a RecordInfo
instance, which can be gathered in a user-defined reducer to produce a different output. The following user-defined reducer generates a MapFile for each hierarchy level using the MultipleOutputs
class. Each MapFile has the hierarchy entry ids as keys and ArrayWritable
instances containing the matching records for each hierarchy entry as values. The following is an user-defined reducer that returns a list of records by hierarchy entry:
public class HierarchyJoinReducer extends MapReduceBase implements Reducer<Text, RecordInfo, Text, ArrayWritable> { private MultipleOutputs mos = null; private Text outKey = new Text(); private ArrayWritable outValue = new ArrayWritable( RecordInfo.class ); @Override public void configure(JobConf conf) { super.configure(conf); //use MultipleOutputs to generate different outputs for each hierarchy level mos = new MultipleOutputs(conf); } @Override public void reduce(Text key, Iterator<RecordInfo> values, OutputCollector<Text, RecordInfoArrayWritable> output, Reporter reporter) throws IOException { //Get the hierarchy level name and the hierarchy entry id from the key String[] keyComponents = HierarchyHelper.getMapRedOutputKeyComponents(key.toString()); String hierarchyLevelName = keyComponents[0]; String entryId = keyComponents[1]; List<Writable> records = new LinkedList<Writable>(); //load the values to memory to fill output ArrayWritable while(values.hasNext()) { RecordInfo recordInfo = new RecordInfo( values.next() ); records.add( recordInfo ); } if(!records.isEmpty()) { //set the hierarchy entry id as key outKey.set(entryId); //list of records matching the hierarchy entry id outValue.set( records.toArray(new Writable[]{} ) ); //get the named output for the given hierarchy level hierarchyLevelName = FileUtils.toValidMONamedOutput(hierarchyLevelName); OutputCollector<Text, ArrayWritable> mout = mos.getCollector(hierarchyLevelName, reporter); //Emit key and value mout.collect(outKey, outValue); } } @Override public void close() throws IOException { mos.close(); } }
The same reducer can be used in a job with the following configuration to generate a list of records according to the hierarchy levels:
JobConf conf = new JobConf(getConf()); //input path FileInputFormat.setInputPaths(conf, new Path("/user/data_spatial_index/") ); //output path FileOutputFormat.setOutputPath(conf, new Path("/user/records_per_hier_level/") ); //input format used to read the spatial index conf.setInputFormat( SequenceFileInputFormat.class); //output format: the real output format will be configured for each multiple output later conf.setOutputFormat(NullOutputFormat.class); //mapper conf.setMapperClass( SpatialHierarchicalJoinMapper.class ); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(RecordInfo.class); //reducer conf.setReducerClass( HierarchyJoinReducer.class ); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(ArrayWritable.class); //////////////////////////////////////////////////////////////////// //hierarchy data setup //set HierarchyInfo class implementation conf.setClass(ConfigParams.HIERARCHY_INFO_CLASS, WorldAdminHierarchyInfo.class, HierarchyInfo.class); //paths to hierarchical catalogs Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json")}; //path to hierarchy index Path hierarchyDataIndexPath = new Path("/user/hierarchy_data_index/"); //instantiate the HierarchyInfo class to index the data if needed. HierarchyInfo hierarchyInfo = new WorldAdminHierarchyInfo(); hierarchyInfo.initialize(conf); //Create the hierarchy index if needed. If it already exists, it will only load the hierarchy index to the distributed cache HierarchyHelper.setupHierarchyDataIndex(hierarchyDataPaths, hierarchyDataIndexPath, hierarchyInfo, conf); /////////////////////////////////////////////////////////////////// //setup the multiple named outputs: int levels = hierarchyInfo.getNumberOfLevels(); for(int i=1; i<=levels; i++) { String levelName = hierarchyInfo.getLevelName(i); //the hierarchy level name is used as the named output String namedOutput = FileUtils.toValidMONamedOutput(levelName); MultipleOutputs.addNamedOutput(conf, namedOutput, MapFileOutputFormat.class, Text.class, ArrayWritable.class); } //finally, setup the spatial operation SpatialOperationConfig spatialOpConf = new SpatialOperationConfig(); spatialOpConf.setOperation(SpatialOperation.IsInside); spatialOpConf.setSrid(8307); spatialOpConf.setTolerance(0.5); spatialOpConf.setGeodetic(true); spatialOpConf.store(conf); //run job JobClient.runJob(conf);
Supposing the output value should be an array of record ids instead of an array of RecordInfo
instances, it would be enough to perform a couple of changes in the previously defined reducer.
The line where outValue
is declared, in the previous example, changes to:
private ArrayWritable outValue = new ArrayWritable(Text.class);
The loop where the input values are retrieved, in the previous example, is changed. Therefore, the record ids are got instead of the whole records:
while(values.hasNext()) { records.add( new Text(values.next().getId()) ); }
While only the record id is needed the mapper emits the whole RecordInfo
instance. Therefore, a better approach is to change the mappers output value. The mappers output value can be changed by extending AbstractSpatialJoinMapper
. In the following example, the mapper emits only the record ids instead of the whole RecorInfo
instance every time a record matches some of the hierarchy entries:
public class IdSpatialHierarchicalMapper extends AbstractSpatialHierarchicalMapper< Text > { Text outValue = new Text(); @Override protected Text getOutValue(RecordInfo matchingRecordInfo) { //the out value is the record's id outValue.set(matchingRecordInfo.getId()); return outValue; } }
- Changing the Hierarchy Level Range
- Controlling the Search Hierarchy
- Using MVSuggest to Classify the Data
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.5.1 Changing the Hierarchy Level Range
By default, all the hierarchy levels defined in the HierarchyInfo
implementation are loaded when performing the hierarchy search. The range of hierarchy levels loaded is from level 1 (parent level) to the level returned by HierarchyInfo.getNumberOfLevels()
method. The following example shows how to setup a job to only load the levels 2 and 3.
conf.setInt( ConfigParams.HIERARCHY_LOAD_MIN_LEVEL, 2); conf.setInt( ConfigParams.HIERARCHY_LOAD_MAX_LEVEL, 3);
Note:
These parameters are useful when only a subset of the hierarchy levels is required and when you do not want to modify the HierarchyInfo
implementation.
Parent topic: Classifying Data Hierarchically
2.10.5.2 Controlling the Search Hierarchy
The search is always performed only at the bottom hierarchy level (the higher level number). If a user record matches some hierarchy entry at this level, then the match is propagated to the parent entry in upper levels. For example, if a user record matches Los Angeles, then it also matches California, USA, and North America. If there are no matches for a user record at the bottom level, then the search does not continue into the upper levels.
This behavior can be modified by setting the configuration parameter ConfigParams.HIERARCHY_SEARCH_MULTIPLE_LEVELS
to true
. Therefore, if a search at the bottom hierarchy level resulted in some unmatched user records, then search continues into the upper levels until the top hierarchy level is reached or there are no more user records to join. This behavior can be used when the geometries of parent levels do not perfectly enclose the geometries of their child entries
Parent topic: Classifying Data Hierarchically
2.10.5.3 Using MVSuggest to Classify the Data
MVSuggest
can be used instead of the spatial index to classify data. For this case, an implementation of LocalizableRecordInfoProvider
must be known and sent to MVSuggest to perform the search. See the information about LocalizableRecordInfoProvider
.
In the following example, the program option is changed from spatial to MVS. The input is the path to the user data instead of the spatial index. The InputFormat
used to read the user record and an implementation of LocalizableRecordInfoProvider
are specified. The MVSuggest
service configuration is set. Notice that there is no spatial operation configuration needed in this case.
Categorization<LongWritable, Text> hierCount = new Categorization<LongWritable, Text>(); // the input path is the user's data hierCount.setInput("/user/data/"); // set the job's output hierCount.setOutput("/user/mvs_hierarchy_count"); // set HierarchyInfo implementation which describes the world // administrative boundaries hierarchy hierCount.setHierarchyInfoClass(WorldDynaAdminHierarchyInfo.class); // specify the paths of the hierarchy data Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json") }; hierCount.setHierarchyDataPaths(hierarchyDataPaths); // set the path where the index for the previous hierarchy data will be // generated hierCount.setHierarchyIndexPath(new Path("/user/hierarchy_data_index/")); // No spatial operation configuration is needed, Instead, specify the // InputFormat used to read the user's data and the // LocalizableRecordInfoProvider class. hierCount.setInputFormatClass(TextInputFormat.class); hierCount.setRecordInfoProviderClass(MyLocalizableRecordInfoProvider.class); // finally, set the MVSuggest configuration LocalMVSConfig lmvsConf = new LocalMVSConfig(); lmvsConf.setServiceLocation("file:///home/user/mvs_dir/oraclemaps_pub"); lmvsConf.setRepositoryLocation(true); hierCount.setMvsConfig(lmvsConf); // add the previous setup to the job configuration hierCount.configure(conf); // run the job JobClient.runJob(conf);
Note:
When using MVSuggest
, the hierarchy data files must be the same as the layer template files used by MVSuggest
. The hierarchy level names returned by the HierarchyInfo.getLevelNames()
method are used as the matching layers by MVSuggest
.
Parent topic: Classifying Data Hierarchically
2.10.6 Generating Buffers
The API provides a mapper to generate a buffer around each record's geometry. The following code sample shows how to run a job to generate a buffer for each record geometry by using the BufferMapper
class.
//configure input conf.setInputFormat(FileSplitInputFormat.class); FileSplitInputFormat.setInputPaths(conf, "/user/waterlines/"); FileSplitInputFormat.setRecordInfoProviderClass(conf, GeoJsonRecordInfoProvider.class); //configure output conf.setOutputFormat(SequenceFileOutputFormat.class); SequenceFileOutputFormat.setOutputPath(conf, new Path("/user/data_buffer/")); //set the BufferMapper as the job mapper conf.setMapperClass(BufferMapper.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(RecordInfo.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(RecordInfo.class); //set the width of the buffers to be generated conf.setDouble(ConfigParams.BUFFER_WIDTH, 0.2); //run the job JobClient.runJob(conf);
BufferMapper
generates a buffer for each input record containing a geometry. The output key and values are the record id and a RecordInfo
instance containing the generated buffer. The resulting file is a Hadoop MapFile
containing the mapper output key and values. If necessary, the output format can be modified by implementing a reducer that takes the mapper’s output keys and values, and outputs keys and values of a different type.
BufferMapper
accepts the following parameters:
Parameter | ConfigParam constant | Type | Description |
---|---|---|---|
oracle.spatial.buffer.width |
BUFFER_WIDTH |
double |
The buffer width |
oracle.spatial.buffer.sma |
BUFFER_SMA |
double |
The semi major axis for the datum used in the coordinate system of the input |
oracle.spatial.buffer.iFlat |
BUFFER_IFLAT |
double |
The flattening value |
oracle.spatial.buffer.arcT |
BUFFER_ARCT |
double |
The arc tolerance used for geodetic densification |
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.7 Spatial Binning
The Vector API provides the class oracle.spatial.hadoop.vector.mapred.job.Binning
to perform spatial binning over a spatial data set. The Binning
class is a MapReduce job driver that takes an input data set (which can be spatially indexed or not), assigns each record to a bin, and generates a file containing all the bins (which contain one or more records and optionally aggregated values).
A binning job can be configured as follows:
-
Specify the data set to be binned and the way it will be read and interpreted (
InputFormat
andRecordInfoProvider
), or, specify the name of an existing spatial index. -
Set the output path.
-
Set the grid MBR, that is, the rectangular area to be binned.
-
Set the shape of the bins:
RECTANGLE
orHEXAGON
. -
Specify the bin (cell) size. For rectangles, specify the width and height. For hexagon-shaped cells, specify the hexagon width. Each hexagon is always drawn with only one of its vertices as the base.
-
Optionally, pass a list of numeric field names to be aggregated per bin.
The resulting output is a text file where each record is a bin (cell) in JSON format and contains the following information:
-
id: the bin id
-
geom: the bin geometry; always a polygon that is a rectangle or a hexagon
-
count: the number of points contained in the bin
-
aggregated fields: zero or more aggregated fields
The following example configures and runs a binning job:
//create job driver Binning<LongWritable, Text> binJob = new Binning<LongWritable, Text>(); //setup input binJob.setInput("/user/hdfs/input/part*"); binJob.setInputFormatClass(GeoJsonInputFormat.class); binJob.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class); //set binning output binJob.setOutput("/user/hdfs/output/binning"); //create a binning configuration to produce rectangular cells BinningConfig binConf = new BinningConfig(); binConf.setShape(BinShape.RECTANGLE); //set the bin size binConf.setCellHeight(0.2); binConf.setCellWidth(0.2); //specify the area to be binned binConf.setGridMbr(new double[]{-50,10,50,40}); binJob.setBinConf(binConf); //save configuration binJob.configure(conf); //run job JobClient.runJob(conf);
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.8 Spatial Clustering
The job driver class oracle.spatial.hadoop.mapred.KMeansClustering
can be used to find spatial clusters in a data set. This class uses a distributed version of the K-means algorithm.
Required parameters:
-
Path to the input data set, the
InputFormat
class used to read the input data set and theRecordInfoProvider
used to extract the spatial information from records. -
Path where the results will be stored.
-
Number of clusters to be found.
Optional parameters:
-
Maximum number of iterations before the algorithm finishes.
-
Criterion function used to determine when the clusters converge. It is given as an implementation of
oracle.spatial.hadoop.vector.cluster.kmeans.CriterionFunction
. The Vector API contains the following criterion function implementations:SquaredErrorCriterionFunction
andEuclideanDistanceCriterionFunction
. -
An implementation of
oracle.spatial.hadoop.vector.cluster.kmeans.ClusterShapeGenerator
, which is used to generate a geometry for each cluster. The default implementation isConvexHullClusterShapeGenerator
and generates a convex hull for each cluster. If no cluster geometry is needed, theDummyClusterShapeGenerator
class can be used. -
The initial k cluster points as a sequence of x,y ordinates. For example: x1,y1,x2,y2,…xk,yk
The result is a file named clusters.json
, which contains an array of clusters called features. Each cluster contains the following information:
-
id: Cluster id
-
memberCount: Number of elements in the cluster
-
geom: Cluster geometry
The following example runs the KMeansClustering
algorithm to find 5 clusters. By default, the SquredErrorCriterionFunction
and ConvexHullClusterShapeGenerator
are used , so you do not need yo set these classes explicitly. Also note that runIterations()
is called to run the algorithm; internally, it launches one MapReduce per iteration. In this example, the number 20 is passed to runIterations()
as the maximum number of iterations allowed.
//create the cluster job driver KMeansClustering<LongWritable, Text> clusterJob = new KMeansClustering<LongWritable, Text>(); //set input properties: //input dataset path clusterJob.setInput("/user/hdfs/input/part*"); //InputFormat class clusterJob.setInputFormatClass(GeoJsonInputFormat.class); //RecordInfoProvider implementation clusterJob.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class); //specify where the results will be saved clusterJob.setOutput("/user/hdfs/output/clusters"); //5 cluster will be found clusterJob.setK(5); //run the algorithm success = clusterJob.runIterations(20, conf);
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.9 Spatial Join
The spatial join feature allows detecting spatial interactions between records of two different large data sets.
The driver class oracle.spatial.hadoop.vector.mapred.job.SpatialJoin
can be used to execute or configure a job to perform a spatial join between two data sets. The job driver takes the following inputs:
-
Input data sets: Two input data sets are expected. Each input data set is represented using the class
oracle.spatial.hadoop.vector.InputDataSet
, which holds information about where to find and how to read a data set, such as path(s), spatial index, input format, and record info provider used to interpret records from the data set. It also accepts a spatial configuration for the data set. -
Spatial operation configuration: The spatial operation configuration defines the spatial interaction used to determine if two records are related to each other. It also defines the area to cover (MBR), that is, only records within or intersecting the MBR will be considered in the search.
-
Partitioning result file path: An optional parameter that points to a previously generated partitioning result for both data sets. Data need to be partitioned in order to distribute the work; if this parameter is not provided, a partitioning process will be executed over the input data sets. (See Spatial Partitioning for more information.)
-
Output path: The path where the result file will be written.
The spatial join result is a text file where each line is a pair of records that meet the spatial interaction defined in the spatial operation configuration.
The following table shows the currently supported spatial interactions for the spatial join.
Spatial Operation | Extra Parameters | Type |
---|---|---|
AnyInteract |
None |
(NA) |
IsInside |
None |
(N/A) |
WithinDistance |
oracle.spatial.hadoop.vector.util.SpatialOperationConfig.PARAM_WD_DISTANCE |
double |
For a WithinDistance
operation, the distance parameter can be specified in the SpatialOperationConfig
, as shown in the following example:
spatialOpConf.setOperation(SpatialOperation.WithinDistance);
spatialOpConf.addParam(SpatialOperationConfig.PARAM_WD_DISTANCE, 5.0);
The following example runs a Spatial Join job for two input data sets. The first data set, postal boundaries, is specified providing the name of its spatial index. For the second data set, tweets, the path to the file, input format, and record info provider are specified. The spatial interaction to detect is IsInside
, so only tweets (points) that are inside a postal boundary (polygon) will appear in the result along with their containing postal boundary.
SpatialJoin spatialJoin = new SpatialJoin();
List<InputDataSet> inputDataSets = new ArrayList<InputDataSet>(2);
// set the spatial index of the 3-digit postal boundaries of the USA as the first input data set
InputDataSet pbInputDataSet = new InputDataSet();
pbInputDataSet.setIndexName("usa_pcb3_index");
//no input format or record info provider are required here as a spatial index is provided
inputDataSets.add(pbInputDataSet);
// set the tweets data set in GeoJSON format as the second data set
InputDataSet tweetsDataSet = new InputDataSet();
tweetsDataSet.setPaths(new Path[]{new Path("/user/example/tweets.json")});
tweetsDataSet.setInputFormatClass(GeoJsonInputFormat.class);
tweetsDataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
inputDataSets.add(tweetsDataSet);
//set input data sets
spatialJoin.setInputDataSets(inputDataSets);
//spatial operation configuration
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setBoundaries(new double[]{47.70, -124.28, 35.45, -95.12});
spatialOpConf.setSrid(8307);
spatialOpConf.setTolerance(0.5);
spatialOpConf.setGeodetic(true);
spatialJoin.setSpatialOperationConfig(spatialOpConf);
//set output path
spatialJoin.setOutput("/user/example/spatialjoin");
// prepare job
JobConf jobConf = new JobConf(getConf());
//preprocess will partition both data sets as no partitioning result file was specified
spatialJoin.preprocess(jobConf);
spatialJoin.configure(jobConf);
JobClient.runJob(jobConf);
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.10 Spatial Partitioning
The partitioning feature is used to spatially partition one or more data sets.
Spatial partitioning consists of dividing the space into multiple rectangles, where each rectangle is intended to contain approximately the same number of points. Eventually these partitions can be used to distribute the work among reducers in other jobs, such as Spatial Join.
The spatial partitioning process is run or configured using the oracle.spatial.hadoop.mapred.job.Partitioning
driver class, which accepts the following input parameters:
-
Input data sets: One or more input data sets can be specified. Each input data set is represented using the class
oracle.spatial.hadoop.vector.InputDataSet
, which holds information about where to find and how to read a data set, such as path(s), spatial index, input format, and record info provider used to interpret records from the data set. It also accepts a spatial configuration for the data set. -
Sampling ratio: Only a fraction of the entire data set or sets is used to perform the partitioning. The sample ratio is the ratio of the sample size to the whole input data set size. If it is not specified, 10 percent (0.1) of the input data set size is used.
-
Spatial configuration: Defines the spatial properties of the input data sets, such as the SRID. You must specify at least the dimensional boundaries.
-
Output path: The path where the result file will be written.
The generated partitioning result file is in GeoJSON format and contains information for each generated partition, including the partition’s geometry and the number of points contained (from the sample).
The following example partitions a tweets data set. Because the sampling ratio is not provided, 0.1 is used by default.
Partitioning partitioning = new Partitioning();
List<InputDataSet> inputDataSets = new ArrayList<InputDataSet>(1);
//define the input data set
InputDataSet dataSet = new InputDataSet();
dataSet.setPaths(new Path[]{new Path("/user/example/tweets.json")});
dataSet.setInputFormatClass(GeoJsonInputFormat.class);
dataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
inputDataSets.add(dataSet);
partitioning.setInputDataSets(inputDataSets);
//spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
spatialConf.setBoundaries(new double[]{-180,-90,180,90});
partitioning.setSpatialConfig(spatialConf);
//set output
partitioning.setOutput("/user/example/tweets_partitions.json");
//run the partitioning process
partitioning.runFullPartitioningProcess(new JobConf());
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.11 RecordInfoProvider
A record read by a MapReduce job from HDFS is represented in memory as a key-value pair using a Java type (typically) Writable subclass, such as LongWritable, Text, ArrayWritable or some user-defined type. For example, records read using TextInputFormat are represented in memory as LongWritable, Text key-value pairs.
RecordInfoProvider
is the component that interprets these memory record representations and returns the data needed by the Vector Analysis API. Thus, the API is not tied to any specific format and memory representations.
The RecordInfoProvider
interface has the following methods:
-
void setCurrentRecord(K key, V value)
-
String getId()
-
JGeometry getGeometry()
-
boolean getExtraFields(Map<String, String> extraFields)
There is always a RecordInfoProvider
instance per InputFormat
. The method setCurrentRecord()
is called passing the current key-value pair retrieved from the RecordReader
. The RecordInfoProvider
is then used to get the current record id, geometry, and extra fields. None of these fields are required fields. Only those records with a geometry participates in the spatial operations. The Id is useful for differentiating records in operations such as categorization. The extra fields can be used to store any record information that can be represented as text and which is desired to be quickly accessed without reading the original record, or for operations where MVSuggest
is used.
Typically, the information returned by RecordInfoProvider is used to populate RecordInfo
instances. A RecordInfo can be thought as a light version of a record and contains the information returned by the RecordInfoProvider plus information to locate the original record in a file.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.11.1 Sample RecordInfoProvider Implementation
This sample implementation, called JsonRecordInfoProvider
, takes text records in JSON format, which are read using TextInputFormat
. A sample record is shown here:
{ "_id":"ABCD1234", "location":" 119.31669, -31.21615", "locationText":"Boston, Ma", "date":"03-18-2015", "time":"18:05", "device-type":"cellphone", "device-name":"iPhone"}
When a JsonRecordInfoProvider is instantiated, a JSON ObjectMapper is created. The ObjectMapper is used to parse records values later when setCurrentRecord()
is called. The record key is ignored. The record id, geometry, and one extra field are retrieved from the _id, location and locationText JSON properties. The geometry is represented as latitude-longitude pair and is used to create a point geometry using JGeometry.createPoint()
method. The extra field (locationText) is added to the extraFields map, which serves as an out parameter and true is returned indicating that an extra field was added.
public class JsonRecordInfoProvider implements RecordInfoProvider<LongWritable, Text> { private Text value = null; private ObjectMapper jsonMapper = null; private JsonNode recordNode = null; public JsonRecordInfoProvider(){ //json mapper used to parse all the records jsonMapper = new ObjectMapper(); } @Override public void setCurrentRecord(LongWritable key, Text value) throws Exception { try{ //parse the current value recordNode = jsonMapper.readTree(value.toString()); }catch(Exception ex){ recordNode = null; throw ex; } } @Override public String getId() { String id = null; if(recordNode != null ){ id = recordNode.get("_id").getTextValue(); } return id; } @Override public JGeometry getGeometry() { JGeometry geom = null; if(recordNode!= null){ //location is represented as a lat,lon pair String location = recordNode.get("location").getTextValue(); String[] locTokens = location.split(","); double lat = Double.parseDouble(locTokens[0]); double lon = Double.parseDouble(locTokens[1]); geom = JGeometry.createPoint( new double[]{lon, lat}, 2, 8307); } return geom; } @Override public boolean getExtraFields(Map<String, String> extraFields) { boolean extraFieldsExist = false; if(recordNode != null) { extraFields.put("locationText", recordNode.get("locationText").getTextValue() ); extraFieldsExist = true; } return extraFieldsExist; } }
Parent topic: RecordInfoProvider
2.10.11.2 LocalizableRecordInfoProvider
This interface extends RecordInfoProvider
and is used to know the extra fields that can be used as the search text, when MVSuggest
is used.
The only method added by this interface is getLocationServiceField()
, which returns the name of the extra field that will be sent to MVSuggest.
In addition, the following is an implementation based on "Sample RecordInfoProvider Implementation." The name returned in this example is locationText
, which is the name of the extra field included in the parent class.
public class LocalizableJsonRecordInfoProvider extends JsonRecordInfoProvider implements LocalizableRecordInfoProvider<LongWritable, Text> { @Override public String getLocationServiceField() { return "locationText"; } }
An alternative to LocalizableRecordInfoProvider
is to set the configuration property oracle.spatial.recordInfo.locationField
with the name of the search field, which value should be sent to MVSuggest
. Example: configuration.set(LocatizableRecordInfoProvider.CONF_RECORD_INFO_LOCATION_FIELD, “locationField”)
Parent topic: RecordInfoProvider
2.10.12 HierarchyInfo
The HierarchyInfo
interface is used to describe a hierarchical dataset. This implementation of HierarchyInfo is expected to provide the number, names, and the entries of the hierarchy levels of the hierarchy it describes.
The root hierarchy level is always the hierarchy level 1. The entries in this level do not have parent entries and this level is referred as the top hierarchy level. Children hierarchy levels will have higher level values. For example: the levels for the hierarchy conformed by continents, countries, and states are 1, 2 and 3 respectively. Entries in the continent layer do not have a parent, but have children entries in the countries layer. Entries at the bottom level, the states layer, do not have children.
A HierarchyInfo implementation is provided out of the box with the Vector Analysis API. The DynaAdminHierarchyInfo
implementation can be used to read and describe the known hierarchy layers in GeoJSON format. A DynaAdminHierarchyInfo can be instantiated and configured or can be subclassed. The hierarchy layers to be contained are specified by calling the addLevel()
method, which takes the following parameters:
-
The hierarchy level number
-
The hierarchy level name, which must match the file name (without extension) of the GeoJSON file that contains the data. For example, the hierarchy level name for the file
world_continents.json
must beworld_continents
, forworld_countries.json
it isworld_countries
, and so on. -
Children join field: This is a JSON property that is used to join entries of the current level with child entries in the lower level. If a null is passed, then the entry id is used.
-
Parent join field: This is a JSON property used to join entries of the current level with parent entries in the upper level. This value is not used for the top most level without an upper level to join. If the value is set null for any other level greater than 1, an
IsInside
spatial operation is performed to join parent and child entries. In this scenario, it is supposed that an upper level geometry entry can contain lower level entries.
For example, let us assume a hierarchy containing the following levels from the specified layers: 1- world_continents, 2 - world_countries and 3 - world_states_provinces. A sample entry from each layer would look like the following:
world_continents: {"type":"Feature","_id":"NA","geometry": {"type":"MultiPolygon", "coordinates":[ x,y,x,y,x,y] }"properties":{"NAME":"NORTH AMERICA", "CONTINENT_LONG_LABEL":"North America"},"label_box":[-118.07998,32.21006,-86.58515,44.71352]} world_countries: {"type":"Feature","_id":"iso_CAN","geometry":{"type":"MultiPolygon","coordinates":[x,y,x,y,x,y]},"properties":{"NAME":"CANADA","CONTINENT":"NA","ALT_REGION":"NA","COUNTRY CODE":"CAN"},"label_box":[-124.28092,49.90408,-94.44878,66.89287]} world_states_provinces: {"type":"Feature","_id":"6093943","geometry": {"type":"Polygon", "coordinates":[ x,y,x,y,x,y]},"properties":{"COUNTRY":"Canada", "ISO":"CAN", "STATE_NAME":"Ontario"},"label_box":[-91.84903,49.39557,-82.32462,54.98426]}
A DynaAdminHierarchyInfo can be configured to create a hierarchy with the above layers in the following way:
DynaAdminHierarchyInfo dahi = new DynaAdminHierarchyInfo(); dahi.addLevel(1, "world_continents", null /*_id is used by default to join with child entries*/, null /*not needed as there are not upper hierarchy levels*/); dahi.addLevel(2, "world_countries", "properties.COUNTRY CODE"/*field used to join with child entries*/, "properties.CONTINENT" /*the value "NA" will be used to find Canada's parent which is North America and which _id field value is also "NA" */); dahi.addLevel(3, "world_states_provinces", null /*not needed as not child entries are expected*/, "properties.ISO"/*field used to join with parent entries. For Ontario, it is the same value than the field properties.COUNTRY CODE specified for Canada*/); //save the previous configuration to the job configuration dahi.initialize(conf);
A similar configuration can be used to create hierarchies from different layers, such as countries, states and counties, or any other layers with a similar JSON format.
Alternatively, to avoid configuring a hierarchy every time a job is executed, the hierarchy configuration can be enclosed in a DynaAdminHierarchyInfo subclass as in the following example:
public class WorldDynaAdminHierarchyInfo extends DynaAdminHierarchyInfo \ { public WorldDynaAdminHierarchyInfo() { super(); addLevel(1, "world_continents", null, null); addLevel(2, "world_countries", "properties.COUNTRY CODE", "properties.CONTINENT"); addLevel(3, "world_states_provinces", null, "properties.ISO"); } }
2.10.12.1 Sample HierarchyInfo Implementation
The HierarchyInfo
interface contains the following methods, which must be implemented to describe a hierarchy. The methods can be divided in to the following three categories:
-
Methods to describe the hierarchy
-
Methods to load data
-
Methods to supply data
Additionally there is an initialize()
method, which can be used to perform any initialization and to save and read data both to and from the job configuration
void initialize(JobConf conf); //methods to describe the hierarchy String getLevelName(int level); int getLevelNumber(String levelName); int getNumberOfLevels(); //methods to load data void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception; void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception; //methods to supply data Collection<String> getEntriesIds(int level); JGeometry getEntryGeometry(int level, String entryId); String getParentId(int childLevel, String childId);
The following is a sample HierarchyInfo implementation, which takes the previously mentioned world layers as the hierarchy levels. The first section contains the initialize method and the methods used to describe the hierarchy. In this case, the initialize method does nothing. The methods mentioned in the following example use the hierarchyLevelNames
array to provide the hierarchy description. The instance variables entriesGeoms
and entriesParent
are arrays of java.util.Map
, which contains the entries geometries and entries parents respectively. The entries ids are used as keys in both cases. Since the arrays indices are zero-based and the hierarchy levels are one-based, the array indices correlate to the hierarchy levels as array index + 1 = hierarchy level.
public class WorldHierarchyInfo implements HierarchyInfo { private String[] hierarchyLevelNames = {"world_continents", "world_countries", "world_states_provinces"}; private Map<String, JGeometry>[] entriesGeoms = new Map[3]; private Map<String, String>[] entriesParents = new Map[3]; @Override public void initialize(JobConf conf) { //do nothing for this implementation } @Override public int getNumberOfLevels() { return hierarchyLevelNames.length; } @Override public String getLevelName(int level) { String levelName = null; if(level >=1 && level <= hierarchyLevelNames.length) { levelName = hierarchyLevelNames[ level - 1]; } return levelName; } @Override public int getLevelNumber(String levelName) { for(int i=0; i< hierarchyLevelNames.length; i++ ) { if(hierarchyLevelNames.equals( levelName) ) return i+1; } return -1; }
The following example contains the methods that load the different hierarchy levels data. The load()
method reads the data from the source files world_continents.json
, world_countries.json
, and world_states_provinces.json
. For the sake of simplicity, the internally called loadLevel()
method is not specified, but it is supposed to parse and read the JSON files.
The loadFromIndex()
method only takes the information provided by the HierarchyIndexReader
instances passed as parameters. The load()
method is supposed to be executed only once and only if a hierarchy index has not been created, in a job. Once the data is loaded, it is automatically indexed and loadFromIndex()
method is called every time the hierarchy data is loaded into the memory.
@Override public void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception { int toLevel = fromLevel + hierDataPaths.length - 1; int levels = getNumberOfLevels(); for(int i=0, level=fromLevel; i<hierDataPaths.length && level<=levels; i++, level++) { //load current level from the current path loadLevel(level, hierDataPaths[i]); } } @Override public void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception { Text parentId = new Text(); RecordInfoArrayWritable records = new RecordInfoArrayWritable(); int levels = getNumberOfLevels(); //iterate through each reader to load each level's entries for(int i=0, level=fromLevel; i<readers.length && level<=levels; i++, level++) { entriesGeoms[ level - 1 ] = new Hashtable<String, JGeometry>(); entriesParents[ level - 1 ] = new Hashtable<String, String>(); //each entry is a parent record id (key) and a list of entries as RecordInfo (value) while(readers[i].nextParentRecords(parentId, records)) { String pId = null; //entries with no parent will have the parent id UNDEFINED_PARENT_ID. Such is the case of the first level entries if( ! UNDEFINED_PARENT_ID.equals( parentId.toString() ) ) { pId = parentId.toString(); } //add the current level's entries for(Object obj : records.get()) { RecordInfo entry = (RecordInfo) obj; entriesGeoms[ level - 1 ].put(entry.getId(), entry.getGeometry()); if(pId != null) { entriesParents[ level -1 ].put(entry.getId(), pId); } }//finishin loading current parent entries }//finish reading single hierarchy level index }//finish iterating index readers }
Finally, the following code listing contains the methods used to provide information of individual entries in each hierarchy level. The information provided is the ids of all the entries contained in a hierarchy level, the geometry of each entry, and the parent of each entry.
@Override public Collection<String> getEntriesIds(int level) { Collection<String> ids = null; if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null) { //returns the ids of all the entries from the given level ids = entriesGeoms[ level - 1 ].keySet(); } return ids; } @Override public JGeometry getEntryGeometry(int level, String entryId) { JGeometry geom = null; if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null) { //returns the geometry of the entry with the given id and level geom = entriesGeoms[ level - 1 ].get(entryId); } return geom; } @Override public String getParentId(int childLevel, String childId) { String parentId = null; if(childLevel >= 1 && childLevel <= getNumberOfLevels() && entriesGeoms[ childLevel - 1 ] != null) { //returns the parent id of the entry with the given id and level parentId = entriesParents[ childLevel - 1 ].get(childId); } return parentId; } }//end of class
Parent topic: HierarchyInfo
2.10.13 Using JGeometry in MapReduce Jobs
The Spatial Hadoop Vector Analysis only contains a small subset of the functionality provided by the Spatial Java API, which can also be used in the MapReduce jobs. This section provides some simple examples of how JGeometry can be used in Hadoop for spatial processing. The following example contains a simple mapper that performs the IsInside
test between a dataset and a query geometry using the JGeometry class.
In this example, the query geometry ordinates, srid, geodetic value and tolerance used in the spatial operation are retrieved from the job configuration in the configure method. The query geometry, which is a polygon, is preprocessed to quickly perform the IsInside
operation.
The map method is where the spatial operation is executed. Each input record value is tested against the query geometry and the id is returned, when the test succeeds.
public class IsInsideMapper extends MapReduceBase implements Mapper<LongWritable, Text, NullWritable, Text> { private JGeometry queryGeom = null; private int srid = 0; private double tolerance = 0.0; private boolean geodetic = false; private Text outputValue = new Text(); private double[] locationPoint = new double[2]; @Override public void configure(JobConf conf) { super.configure(conf); srid = conf.getInt("srid", 8307); tolerance = conf.getDouble("tolerance", 0.0); geodetic = conf.getBoolean("geodetic", true); //The ordinates are represented as a string of comma separated double values String[] ordsStr = conf.get("ordinates").split(","); double[] ordinates = new double[ordsStr.length]; for(int i=0; i<ordsStr.length; i++) { ordinates[i] = Double.parseDouble(ordsStr[i]); } //create the query geometry as two-dimensional polygon and the given srid queryGeom = JGeometry.createLinearPolygon(ordinates, 2, srid); //preprocess the query geometry to make the IsInside operation run faster try { queryGeom.preprocess(tolerance, geodetic, EnumSet.of(FastOp.ISINSIDE)); } catch (Exception e) { e.printStackTrace(); } } @Override public void map(LongWritable key, Text value, OutputCollector<NullWritable, Text> output, Reporter reporter) throws IOException { //the input value is a comma separated values text with the following columns: id, x-ordinate, y-ordinate String[] tokens = value.toString().split(","); //create a geometry representation of the record's location locationPoint[0] = Double.parseDouble(tokens[1]);//x ordinate locationPoint[1] = Double.parseDouble(tokens[2]);//y ordinate JGeometry location = JGeometry.createPoint(locationPoint, 2, srid); //perform spatial test try { if( location.isInside(queryGeom, tolerance, geodetic)){ //emit the record's id outputValue.set( tokens[0] ); output.collect(NullWritable.get(), outputValue); } } catch (Exception e) { e.printStackTrace(); } } }
A similar approach can be used to perform a spatial operation on the geometry itself. For example, by creating a buffer. The following example uses the same text value format and creates a buffer around each record location. The mapper output key and value are the record id and the generated buffer, which is represented as a JGeometryWritable
. The JGeometryWritable
is a Writable implementation contained in the Vector Analysis API that holds a JGeometry instance.
public class BufferMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, JGeometryWritable> { private int srid = 0; private double bufferWidth = 0.0; private Text outputKey = new Text(); private JGeometryWritable outputValue = new JGeometryWritable(); private double[] locationPoint = new double[2]; @Override public void configure(JobConf conf) { super.configure(conf); srid = conf.getInt("srid", 8307); //get the buffer width bufferWidth = conf.getDouble("bufferWidth", 0.0); } @Override public void map(LongWritable key, Text value, OutputCollector<Text, JGeometryWritable> output, Reporter reporter) throws IOException { //the input value is a comma separated record with the following columns: id, longitude, latitude String[] tokens = value.toString().split(","); //create a geometry representation of the record's location locationPoint[0] = Double.parseDouble(tokens[1]); locationPoint[1] = Double.parseDouble(tokens[2]); JGeometry location = JGeometry.createPoint(locationPoint, 2, srid); try { //create the location's buffer JGeometry buffer = location.buffer(bufferWidth); //emit the record's id and the generated buffer outputKey.set( tokens[0] ); outputValue.setGeometry( buffer ); output.collect(outputKey, outputValue); } catch (Exception e) { e.printStackTrace(); } } }
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.14 Support for Different Data Sources
In addition to file-based data sources (that is, a file or a set of files from a local or a distributed file system), other types of data sources can be used as the input data for a Vector API job.
Data sources are referenced as input data sets in the Vector API. All the input data sets implement the interface oracle.spatial.hadoop.vector.data.AbstractInputDataSet
. Input data set properties can be set directly for a Vector job using the methods setInputFormatClass()
, setRecordInfoProviderClass()
, and setSpatialConfig()
. More information can be set, depending the type of input data set. For example, setInput()
can specify the input string for a file data source, or setIndexName()
can be used for a spatial index. The job determines the input data type source based on the properties that are set.
Input data set information can also be set directly for a Vector API job using the job’s method setInputDataSet()
. With this method, the input data source information is encapsulated, you have more control, and it is easier to identify the type of data source that is being used.
The Vector API provides the following implementations of AsbtractInputDataSet
:
-
SimpleInputDataSet
: Contains the minimum information required by the Vector API for an input data set. Typically, this type of input data set should be used for non-file based input data sets, such as Apache Hbase, an Oracle database, or any other non-file-based data source. -
FileInputDataSet
: Encapsulates file-based input data sets from local or distributed file systems. It provides properties for setting the input path as an array of Path instances or as a string that can be a regular expression for selecting paths. -
SpatialIndexInputDataSet
: A subclass ofFileInputDataSet
optimized for working with spatial indexes generated by the Vector API. It is sufficient to specify the index name for this type of input data set. -
NoSQLInputDataSet
: Specifies Oracle NoSQL data sources. It should be used in conjunction with Vector NoSQL API. If the NoSQLKVInputFormat
orTableInputFormat
classes need to be used, useSimpleInputFormat
instead. -
MultiInputDataSet
: Input data set that encapsulates two or more input data sets.
Multiple Input Data Sets
Most of the Hadoop jobs provided by the Vector API (except Categorization) are able to manage more than one input data set by using the class oracle.spatial.hadoop.vector.data.MultiInputDataSet
.
To add more than one input data set to a job, follow these steps.
-
Create and configure two or more instances of
AbstractInputDataSet
subclasses. -
Create an instance of
oracle.spatial.hadoop.vector.data.MultiInputDataSet
. -
Add the input data sets created in step 1 to the
MultiInputDataSet
instance. -
Set
MultiInputDataSet
instance as the job’s input data set.
The following code snippet shows how to set multiple input data sets to a Vector API.
//file input data set
FileInputDataSet fileDataSet = new FileInputDataSet();
fileDataSet.setInputFormatClass(GeoJsonInputFormat.class);
fileDataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
fileDataSet.setInputString("/user/myUser/geojson/*.json");
//spatial index input data set
SpatialIndexInputDataSet indexDataSet = new SpatialIndexInputDataSet();
indexDataSet.setIndexName("myIndex");
//create multi input data set
MultiInputDataSet multiDataSet = new MultiInputDataSet();
//add the previously defined input data sets
multiDataSet.addInputDataSet(fileDataSet);
multiDataSet.addInputDataSet(indexDataSet);
Binning binningJob = new Binning();
//set multiple input data sets to the job
binningJob.setInputDataSet(multiDataSet);
NoSQL Input Data Set
The Vector API provides classes to read data from Oracle NoSQL Database. The Vector NoSQL components let you group multiple key-value pairs into single records, which are passed to Hadoop mappers as RecordInfo
instances. They also let you map NoSQL entries (key and value) to Hadoop records fields (RecordInfo
’s id
, geometry
, and extra fields).
The NoSQL parameters are passed to a Vector job using the NoSQLInputDataSet
class. You only need to fill and set a NoSQLConfiguration
instance that contains the KV store, hosts, parent key, and additional information for the NoSQL data source. InputFormat
and RecordInfoProvider
classes do not need to be set because the default ones are used.
The following example shows how to configure a job to use NoSQL as data source, using the Vector NoSQL classes.
//create NoSQL configuration
NoSQLConfiguration nsqlConf = new NoSQLConfiguration();
// set connection data
nsqlConf.setKvStoreName("mystore");
nsqlConf.setKvStoreHosts(new String[] { "myserver:5000" });
nsqlConf.setParentKey(Key.createKey("tweets"));
// set NoSQL entries to be included in the Hadoop records
// the entries with the following minor keys will be set as the
// RecordInfo's extra fields
nsqlConf.addTargetEntries(new String[] { "friendsCount", "followersCount" });
// add an entry processor to map the spatial entry to a RecordInfo's
// geometry
nsqlConf.addTargetEntry("geometry", NoSQLJGeometryEntryProcessor.class);
//create and set the NoSQL input data set
NoSQLInputDataSet nsqlDataSet = new NoSQLInputDataSet();
//set noSQL configuration
nsqlDataSet.setNoSQLConfig(nsqlConf);
//set spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
nsqlDataSet.setSpatialConfig(spatialConf);
Target entries refer to the NoSQL entries that will be part of the Hadoop records and are specified by the NoSQL minor keys. In the preceding example, the entries with the minor keys friendsCount
and followersCount
will be part of a Hadoop record. These NoSQL entries will be parsed as text values and assigned to the Hadoop RecordInfo
as the extra fields called friendsCount
and followersCount
. By default, the major key is used as record id. The entries that contain “geometry” as minor key are used to set the RecordInfo
’s geometry
field.
In the preceding example, the value type of the geometry NoSQL entries is JGeometry
, so it is necessary to specify a class to parse the value and assign it to the RecordInfo
’s geometry
field. This requires setting an implementation of the NoSQLEntryProcessor
interface. In this case, the NoSQLJGeometryEntryProcessor
class is used, and it reads the value from the NoSQL entry and sets that value to the current RecordInfo
’s geometry
field. You can provide your own implementation of NoSQLEntryProcessor
for parsing specific entry formats.
By default, NoSQL entries sharing the same major key are grouped into the same Hadoop record. This behavior can be changed by implementing the interface oracle.spatial.hadoop.nosql.NoSQLGrouper
and setting the NoSQLConfiguration
property entryGrouperClass
with the new grouper class.
The Oracle NoSQL library kvstore.jar
is required when running Vector API jobs that use NoSQL as the input data source.
Other Non-File-Based Data Sources
Other non-file-based data sources can be used with the Vector API, such as NoSQL (using the Oracle NoSQL classes) and Apache HBase. Although the Vector API does not provide specific classes to manage every type of data source, you can associate the specific data source with the job configuration and specify the following information to the Vector job:
-
InputFormat
: TheInputFormat
implementation used to read data from the data source. -
RecordInfoProvider
: An implementation ofRecordInfoProvider
to extract required information such asid
, spatial information, and extra fields from the key-value pairs returned by the current InputFormat. -
Spatial configuration: Describes the spatial properties of the input data, such as the SRID and the dimension boundaries.
The following example shows how to use Apache HBase data in a Vector job.
//create job
Job job = Job.getInstance(getConf());
job.setJobName(getClass().getName());
job.setJarByClass(getClass());
//Setup hbase parameters
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes("location_data"), Bytes.toBytes("geometry"));
scan.addColumn(Bytes.toBytes("other_data"), Bytes.toBytes("followers_count"));
scan.addColumn(Bytes.toBytes("other_data"), Bytes.toBytes("user_id"));
//initialize job configuration with hbase parameters
TableMapReduceUtil.initTableMapperJob(
"tweets_table",
scan,
null,
null,
null,
job);
//create binning job
Binning<ImmutableBytesWritable, Result> binningJob = new Binning<ImmutableBytesWritable, Result>();
//setup the input data set
SimpleInputDataSet inputDataSet = new SimpleInputDataSet();
//use HBase's TableInputFormat
inputDataSet.setInputFormatClass(TableInputFormat.class);
//Set a RecordInfoProvider which can extract information from HBase TableInputFormat's returned key and values
inputDataSet.setRecordInfoProviderClass(HBaseRecordInfoProvider.class);
//set spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
inputDataSet.setSpatialConfig(spatialConf);
binningJob.setInputDataSet(inputDataSet);
//job output
binningJob.setOutput("hbase_example_output");
//binning configuration
BinningConfig binConf = new BinningConfig();
binConf.setGridMbr(new double[]{-180, -90, 180, 90});
binConf.setCellHeight(5);
binConf.setCellWidth(5);
binningJob.setBinConf(binConf);
//configure the job
binningJob.configure(job);
//run
boolean success = job.waitForCompletion(true);
The RecordInfoProvider
class set in the preceding example is a custom implementation called HBaseRecordInfoProvider
, the definition of which is as follows.
public class HBaseRecordInfoProvider implements RecordInfoProvider<ImmutableBytesWritable, Result>, Configurable{ private Result value = null; private Configuration conf = null; private int srid = 0; @Override public void setCurrentRecord(ImmutableBytesWritable key, Result value) throws Exception { this.value = value; } @Override public String getId() { byte[] idb = value.getValue(Bytes.toBytes("other_data"), Bytes.toBytes("user_id")); String id = idb != null ? Bytes.toString(idb) : null; return id; } @Override public JGeometry getGeometry() { byte[] geomb = value.getValue(Bytes.toBytes("location_data"), Bytes.toBytes("geometry")); String geomStr = geomb!=null ? Bytes.toString(geomb) : null; JGeometry geom = null; if(geomStr != null){ String[] pointsStr = geomStr.split(","); geom = JGeometry.createPoint(new double[]{Double.valueOf(pointsStr[0]), Double.valueOf(pointsStr[1])}, 2, srid); } return geom; } @Override public boolean getExtraFields(Map<String, String> extraFields) { byte[] fcb = value.getValue(Bytes.toBytes("other_data"), Bytes.toBytes("followers_count")); if(fcb!=null){ extraFields.put("followers_count", Bytes.toString(fcb)); } return fcb!=null; } @Override public Configuration getConf() { return conf; } @Override public void setConf(Configuration conf) { srid = conf.getInt(ConfigParams.SRID, 0); } }
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.15 Job Registry
Every time a Vector API job is launched using the command line interface or the web console, a registry file is created for that job. A job registry file contains the following information about the job:
-
Job name
-
Job ID
-
User that executed the job
-
Start and finish time
-
Parameters used to run the job
-
Jobs launched by the first job (called child jobs). Child jobs contain the same fields as the parent job.
A job registry file preserves the parameters used to run the job, which can be used as an aid for running an identical job even when it was not initially run using the command line interface.
By default, job registry files are created under the HDFS path relative to the user folder oracle_spatial/job_registry
(for example, /user/hdfs/oracle_spatial/job_registry
for the hdfs user).
Job registry files can be removed directly using HDFS commands or using the following utility methods from class oracle.spatial.hadoop.commons.logging.registry.RegistryManager
:
-
public static int removeJobRegistry(long beforeDate, Configuration conf)
: Removes all the job registry files that were created before the specified time stamp from the default job registry folder. -
public static int removeJobRegistry(Path jobRegDirPath, long beforeDate, Configuration conf)
: Removes all the job registry files that were created before the specified time stamp from a specified job registry folder.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.10.16 Tuning Performance Data of Job Running Times Using the Vector Analysis API
The table lists some running times for jobs built using the Vector Analysis API. The jobs were executed using a 4-node cluster. The times may vary depending on the characteristics of the cluster. The test dataset contains over One billion records and the size is above 1 terabyte.
Table 2-4 Performance time for running jobs using Vector Analysis API
Job Type | Time taken (approximate value) |
---|---|
Spatial Indexing |
2 hours |
Spatial Filter with Spatial Index |
1 hour |
Spatial Filter without Spatial Index |
3 hours |
Hierarchy count with Spatial Index |
5 minutes |
Hierarchy count without Spatial Index |
3 hours |
The time taken for the jobs can be decreased by increasing the maximum split size using any of the following configuration parameters.
mapred.max.split.size mapreduce.input.fileinputformat.split.maxsize
This results in more splits are being processed by each single mapper and improves the execution time. This is done by using the SpatialFilterInputFormat
(spatial indexing) or FileSplitInputFormat
(spatial hierarchical join, buffer). Also, the same results can be achieved by using the implementation of CombineFileInputFormat
as internal InputFormat
.
Parent topic: Oracle Big Data Spatial Vector Analysis
2.11 Oracle Big Data Spatial Vector Analysis for Spark
Oracle Big Data Spatial Vector Analysis for Apache Spark is a spatial vector analysis API for Java and Scala that provides spatially-enabled RDDs (Resilient Distributed Datasets) that support spatial transformations and actions, spatial partitioning, and indexing.
These components make use of the Spatial Java API to perform spatial analysis tasks. The supported features include the following.
- Spatial RDD (Resilient Distributed Dataset)
- Spatial Transformations
- Spatial Actions (MBR and NearestNeighbors)
- Spatially Indexing a Spatial RDD
- Spatial DStream Transformations
- Support for Common Spatial Formats
- Spatial Spark SQL API
- Rendering Spatial Indexes on Maps
- JDBC Data Sources for Spatial RDDs
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.11.1 Spatial RDD (Resilient Distributed Dataset)
A spatial RDD (Resilient Distributed Dataset) is a Spark RDD that allows you to perform spatial transformations and actions.
The current spatial RDD implementation is the class oracle.spatial.spark.vector.rdd.SpatialJavaRDD
for Java and oracle.spatial.spark.vector.scala.rdd.SpatialRDD
for Scala. A spatial RDD implementation can be created from an existing instance of RDD or JavaRDD, as shown in the following examples:
Java:
//create a regular RDD
JavaRDD<String> rdd = sc.textFile("someFile.txt");
//create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
SparkRecordInfoProvider<String> recordInfoProvider = new MySparkRecordInfoProvider();
//create a spatial RDD
SpatialJavaRDD<String> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, recordInfoProvider, String.class));
Scala:
//create a regular RDD
val rdd: RDD[String] = sc.textFile("someFile.txt")
//create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
val recordInfoProvider: SparkRecordInfoProvider[String] = new MySparkRecordInfoProvider()
//create a spatial RDD
val spatialRDD: SpatialRDD[String] = SpatialRDD(rdd, recordInfoProvider)
A spatial RDD takes an implementation of the interface oracle.spatial.spark.vector.SparkRecordInfoProvider
, which is used for extracting spatial information from each RDD element.
A regular RDD can be transformed into a spatial RDD of the same generic type, that is, if the source RDD contains records of type String. The spatial RDD will also contain String records.
You can also create a Spatial RDD with records of type oracle.spatial.spark.vector.SparkRecordInfo
. A SparkRecordInfo
is an abstraction of a record from the source RDD; it holds the source record’s spatial information and may contain a subset of the source record’s data.
The following examples show how to create an RDD of SparkRecordInfo
records.
Java:
//create a regular RDD
JavaRDD<String> rdd = sc.textFile("someFile.txt");
//create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
SparkRecordInfoProvider<String> recordInfoProvider = new MySparkRecordInfoProvider();
//create a spatial RDD
SpatialJavaRDD<SparkRecordInfo> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, recordInfoProvider));
Scala:
//create a regular RDD
val rdd: RDD[String] = sc.textFile("someFile.txt")
//create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
val recordInfoProvider: SparkRecordInfoProvider[String] = new MySparkRecordInfoProvider()
//create a spatial RDD
val spatialRDD: SpatialRDD[SparkRecordInfo] = SpatialRDD.fromRDD(rdd, recordInfoProvider))
A spatial RDD of SparkRecordInfo
records has the advantage that spatial information does not need to be extracted from each record every time it is needed for a spatial operation.
You can accelerate spatial searches by spatially indexing a spatial RDD. Spatial indexing is described in section Spatially Indexing a Spatial RDD.
The spatial RDD provides the following spatial transformations and actions, which are described in the sections Spatial Transformations and Spatial Actions (MBR and NearestNeighbors).
Spatial transformations:
-
filter
-
flatMap
-
join (available when creating a spatial index)
Spatial Actions:
-
MBR
-
nearestNeighbors
Spatial Pair RDD
A pair version of the Java classSpatialJavaRDD
is provided and is implemented as the class oracle.spatial.spark.vector.rdd.SpatialJavaPairRDD
. A spatial pair RDD is created from an existing pair RDD and contains the same spatial transformations and actions as the single spatial RDD. A SparkRecordInfoProvider
used for a spatial pair RDD should receive records of type scala.Tuple2<K,V>
, where K
and V
correspond to the pair RDD key and value types, respectively.
Example 2-1 SparkRecordInfoProvider to Read Information from a CSV File
The following example shows how to implement a simple SparkRecordInfoProvider
to read information from a CSV file.
public class CSVRecordInfoProvider implements SparkRecordInfoProvider<String>{
private int srid = 8307;
//receives an RDD record and fills the given recordInfo
public boolean getRecordInfo(String record, SparkRecordInfo recordInfo) {
try {
String[] tokens = record.split(",");
//expected records have the format: id,name,last_name,x,y where x and y are optional
//output recordInfo will contain the fields id, last name and geometry
recordInfo.addField("id", tokens[0]);
recordInfo.addField("last_name", tokens[2]);
if (tokens.length == 5) {
recordInfo.setGeometry(JGeometry.createPoint(tokens[3], tokens[4], 2, srid));
}
} catch (Exception ex) {
//return false when there is an error extracting data from the input value
return false;
}
return true;
}
public void setSrid(int srid) {this.srid = srid;}
public int getSrid() {return srid;}
}
In this example, the record’s ID and last-name fields are extracted along with the spatial information to be set to the SparkRecordInfo
instance used as an out parameter. Extracting additional information is only needed when the goal is to create a spatial RDD containing SparkRecordInfo
elements and is necessary to preserve a subset of the original records information. Otherwise, it is only necessary to extract the spatial information.
The call to SparkRecordInfoProvider.getRecordInfo()
should return true
whenever the record should be included in a transformation or considered in a search. If SparkRecordInfoProvider.getRecordInfo()
returns false
, the record is ignored.
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.2 Spatial Transformations
The transformations described in the following subtopics are available for spatial RDD, spatial pair RDD, and the distributed spatial index unless stated otherwise (for example, a join transformation is only available for a distributed spatial index).
- Filter Transformation
- FlatMap Transformation
- Join Transformation
- Controlling Spatial Evaluation
- Spatially Enabled Transformations
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.2.1 Filter Transformation
A filter transformation is a spatial version of the regular RDD’s filter() transformation. In addition to a user-provided filtering function, it takes an instance of oracle.spatial.hadoop.vector.util.SpatialOperationConfig
, which is used to describe the spatial operation used to filter spatial records. A SpatialOperationConfig
contains a query window which is the geometry used as reference and a spatial operation. The spatial operation is executed in the form: (RDD record’s geometry) (spatial operation) (query window)
. For example: (RDD record) IsInside (queryWindow)
Spatial operations available are AnyInteract
, IsInside
, Contains
, and WithinDistance
.
The following examples return an RDD containing only records that are inside the given query window and with not null ID.
Java:
SpatialOperationConfig soc = new SpatialOperationConfig();
soc.setOperation(SpatialOperation.IsInside);
soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
SpatialJavaRDD<SparkRecordInfo> filteredSpatialRDD = spatialRDD.filter(
(record) -> {
return record.getField(“id”) != null;
}, soc);
Scala:
val soc = new SpatialOperationConfig()
soc.setOperation(SpatialOperation.IsInside)
soc.setQueryWindow(JGeometry.createLinearPolygon(Array(2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 ), 2, srid))
val filteredSpatialRDD: SpatialRDD[SparkRecordInfo] = spatialRDD.filter(
record => { record.getField(“id”) != null }, soc)
Parent topic: Spatial Transformations
2.11.2.2 FlatMap Transformation
A FlatMap transformation is a spatial version of the regular RDD’s flatMap()
transformation. In addition to the user-provided function, it takes a SpatialOperationConfig to perform a spatial filtering. It works like the Filter Transformation, except that spatially filtered results are passed to the map function and flattened.
The following examples create an RDD that contains only elements that interact with the given query window and geometries that have been buffered.
Java:
SpatialOperationConfig soc = new SpatialOperationConfig();
soc.setOperation(SpatialOperation.AnyInteract);
soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
JavaRDD<SparkRecordInfo> mappedRDD = spatialRDD.flatMap(
(record) -> {
JGeometry buffer = record.getGeometry().buffer(2.5);
record.setGeometry(buffer);
return Collections.singletonList(record);
}, soc);
Scala:
val soc = new SpatialOperationConfig()
soc.setOperation(SpatialOperation.AnyInteract)
soc.setQueryWindow(JGeometry.createLinearPolygon(Array( 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 ), 2, srid))
val mappedRDD: RDD[SparkRecordInfo] = spatialRDD.flatMap(
record => {
val buffer: JGeometry = record.getGeometry().buffer(2.5)
record.setGeometry(buffer)
record
}, soc)
Note:
As of Spark 2, the Java class org.apache.spark.api.java.function.FlatMapFunction
received by the flatMap
transformation returns an instance of java.util.Iterator
instead of Iterable
, so the return line of the preceding flatMap
transformation Java example changes for Spark 2 to: return Collections.singletonList(record).iterator();
Parent topic: Spatial Transformations
2.11.2.3 Join Transformation
A join transformation joins two spatial RDDs based on a spatial relationship between their records. In order to perform this transformation, one of the two RDDs must be spatially indexed. (See Spatial Indexing for more information about indexing a spatial RDD.)
The result type of a spatial join transformation is defined by a user-provided lambda function that is called for each pair of joined records.
The following examples join all the records from both data sets that interact in any way.
Java:
DistributedSpatialIndex index = DistributedSpatialIndex.createIndex(sparkContext, spatialRDD1, new QuadTreeConfiguration());
SpatialJavaRDD<SparkRecordInfo> spatialRDD2 = SpatialJavaRDD.fromJavaRDD(rdd2, new RegionsRecordInfoProvider(srid));
SatialOperationConfig soc = new SpatialOperationConfig();
soc.setOperation(SpatialOperation.AnyInteract);
JavaRDD<Tuple2<SparkRecordInfo, SparkRecordInfo> joinedRDD = index.spatialJoin( spatialRDD2,
(recordRDD1, recordRDD2) -> {
return Collections.singletonList( new Tuple2<>(recordRDD1, recordRDD2)).iterator());
}, soc);
Scala:
val index: DistributedSpatialIndex[SparkRecordInfo] = DistributedSpatialIndex.createIndex(spatialRDD1, new QuadTreeConfiguration())
val spatialRDD2: SpatialRDD[SparkRecordInfo] = SpatialRDD.fromRDD(rdd2, new RegionsRecordInfoProvider(srid))
val soc = new SpatialOperationConfig()
soc.setOperation(SpatialOperation.AnyInteract)
val joinedRDD: RDD[(SparkRecordInfo, SparkRecordInfo)] = index.join( spatialRDD2,
(recordRDD1, recordRDD2) => {Seq((recordRDD1, recordRDD2))}, soc)
Parent topic: Spatial Transformations
2.11.2.4 Controlling Spatial Evaluation
When executing a filtering transformation or nearest neighbors action, by default the spatial operation is executed before calling the user-defined filtering function; however, you can change this behavior. Executing a user-defined filtering function before the spatial operation can improve performance in scenarios where the spatial operation is costly in comparison to the user-defined filtering function.
To set the user-defined function to be executed before the spatial operation, set the following parameter to the SpatialOperationConfig
passed to either a filter transformation or nearest neighbors action.
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig(SpatialOperation.AnyInteract, qryWindow, 0.05);
//set the spatial operation to be executed after the user-defined filtering function
spatialOpConf.addParam(SpatialOperationConfig.PARAM_SPATIAL_EVAL_STAGE, SpatialOperationConfig.VAL_SPATIAL_EVAL_STAGE_POST);
spatialRDD.filter((r)->{ return r.getFollowersCount()>1000;}, spatialOpConf);
The preceding example applies to both spatial RDDs and a distributed spatial index.
Parent topic: Spatial Transformations
2.11.2.5 Spatially Enabled Transformations
Spatial operations can be performed in regular transformations by creating a SpatialTransformationContext
before executing any transformation.
After the SpatialTransformationContext
instance is in the transformation function, that instance can be used to get the record’s geometry and apply spatial operations, as shown in the following example, which transforms an RDD of String records into a pair RDD where the key and value corresponds to the source record ID and a buffered geometry.
Java:
SpatialJavaRDD<String> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, new CSVRecordInfoProvider(srid), String.class);
SpatialTransformationContext stCtx = spatialRDD.createSpatialTransformationContext();
JavaPairRDD<String, JGeometry> bufferedRDD = spatialRDD.mapToPair(
(record) -> {
SparkRecordInfo recordInfo = stCtx.getRecordInfo(record);
String id = (String) recordInfo.getField(“id”)
JGeometry geom. = recordInfo.getGeometry(record);
JGeometry buffer = geom.buffer(0.5);
return new Tuple2(id, buffer);
});
Scala:
val spatialRDD: SpatialRDD[String]= SpatialRDD.fromRDD(rdd, new CSVRecordInfoProvider(srid))
val stCtx: SpatialTransformationContext[String] = spatialRDD.createSpatialTransformationContext()
val bufferedRDD: RDD[(String, JGeometry)] = spatialRDD.map(
record => {
val recordInfo: SparkRecordInfo = stCtx.getRecordInfo(record)
val id: String = recordInfo.getField(“id”).asInstanceOf[String]
val geom: JGeometry = recordInfo.getGeometry(record)
val buffer: JGeometry = geom.buffer(0.5)
(id, buffer)
})
When working on a per-partition basis, you should use a stateful version of SpatialTransformationContext
, which avoids creating multiple instances of SparkRecordInfo
. The following pattern can be followed when working on a per-partition basis:
val stCtx: SpatialTransformationContext[String] = spatialRDD.createSpatialTransformationContext()
val bufferedRDD: RDD[(String, JGeometry)] = spatialRDD.mapPartitions(
(records) => {
val sSTCtx = new StatefulSpatialTransformationContext(stCtx)
records.map(record=>{
val recordInfo: SparkRecordInfo = sSTCtx.getRecordInfo(record)
val id: String = recordInfo.getField(“id”).asInstanceOf[String]
val geom: JGeometry = recordInfo.getGeometry(record)
val buffer: JGeometry = geom.buffer(0.5)
(id, buffer)
})
}, true)
Parent topic: Spatial Transformations
2.11.3 Spatial Actions (MBR and NearestNeighbors)
Spatial RDDs,spatial pair RDDs, and the distributed spatial index provide the following spatial actions.
-
MBR: Calculates the RDD’s minimum bounding rectangle (MBR). The MBR is only calculated once and cached so the second time it is called, it will not be recalculated. The following examples show how to get the MBR from a spatial RDD. (This transformation is not available for
DistributedSpatialIndex
.)Java:
doubl[] mbr = spatialRDD.getMBR();
Scala:
val mbr: Array[Double] = spatialRDD.getMBR()
-
NearestNeighbors: Returns a list containing the K nearest elements from an RDD or distributed spatial index to a given geometry. Additionally, a user-defined filter lambda function can be passed, so that only the records that pass the filter will be candidates to be part of the K nearest neighbors list. The following examples show how to get the 5 records closest to the given point.
Java:
JGeometry qryWindow = JGeometry.createPoint(new double[] { 2.0, 1.0 }, 2, srid)); SpatialOperationConfig soc = new SpatialOperationConfig(SpatialOperation.None, qryWindow, 0.05); List<SparkRecordInfo> nearestNeighbors = spatialRDD.nearestNeighbors( (record)->{ return ((Integer)record.getField(“followers_count”))>1000; }, 5, soc);
Scala:
val qryWindow: JGeometry = JGeometry.createPoint(Array(2.0, 1.0 ), 2, srid)) val soc: SpatialOperationConfig = new SpatialOperationConfig(SpatialOperation.None, qryWindow, 0.05) val nearestNeighbors: Seq[SparkRecordInfo] = spatialRDD.nearestNeighbors( record=>{ record.getField(“followers_count”).asInstanceOf[Int]>1000 }, 5, soc);
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.4 Spatially Indexing a Spatial RDD
A spatial RDD can be spatially indexed to speed up spatial searches when performing spatial transformations.
A spatial index repartitions the spatial RDD so that each partition only contains records on some specific area. This allows partitions that do not contain results in a spatial search to be quickly discarded, making the search faster.
A spatial index is created through the Java abstract class oracle.spatial.spark.vector.index.DistributedSpatialIndex
or its Scala equivalent oracle.spatial.spark.vector.scala.index.DistributedSpatialIndex
, both of which use a specific implementation to create the actual spatial index. The following examples show how to create a spatial index using a QuadTree-based spatial index implementation.
Java:
DistributedSpatialIndex<String> index = DistributedSpatialIndex.createIndex(sparkContext, spatialRDD1, new QuadTreeConfiguration());
Scala:
val index: DistributedSpatialIndex[String] = DistributedSpatialIndex.createIndex(spatialRDD1, new QuadTreeConfiguration())(sparkContext)
The type of spatial index implementation is determined by the last parameter, which is a subtype of oracle.spatial.spark.vector.index.SpatialPartitioningConfiguration
. Depending on the index implementation, the configuration parameter may accept different settings for performing partitioning and indexing. Currently, the only implementation of a spatial index is the class oracle.spatial.spark.vector.index.quadtree.QuadTreeDistIndex
, and it receives a configuration of type oracle.spatial.spark.vector.index.quadtree.QuadTreeConfiguration
.
The DistributedSpatialIndex
class currently supports the filter, flatMap, join, and nearestNeighbors transformations, which are described in Spatial Transformations.
A spatial index can be persisted using the method DistributedSpatialIndex.save()
, which takes an existing SparkContext
and a path where the index will be stored. The path may be in a local or a distributed (HDFS) file system. Similarly, a persisted spatial index can be loaded by calling the method DistributedSpatialIndex.load()
, which also takes an existing SparkContext
and the path where the index is stored.
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.4.1 Spatial Partitioning of a Spatial RDD
A spatial RDD can be partitioned through an implementation of the class oracle.spatial.spark.vector.index.SpatialPartitioning
. The SpatialPartitioning
class represents a spatial partitioning algorithm that transforms a spatial RDD into a spatially partitioned spatial pair RDD whose keys point to a spatial partition.
A SpatialPartitioning algorithm is used internally by a spatial index, or it can be used directly by creating a concrete class. Currently, there is a QuadTree-based implementation called oracle.spatial.spark.vector.index.quadtree.QuadTreePartitioning. The following example shows how to spatially partition a spatial RDD.
QuadTreePartitioning<T> partitioning = new QuadTreePartitioning<>(sparkContext, spatialRDD, new QuadTreeConfiguration());
SpatialJavaPairRDD<PartitionKey, T> partRDD = partitioning.getPartitionedRDD();
Parent topic: Spatially Indexing a Spatial RDD
2.11.4.2 Local Spatial Indexing of a Spatial RDD
A local spatial index can be created for each partition of a spatial RDD. Locally partitioning the content of each partition helps to improve spatial searches when working on a partition basis.
A local index can be created for each partition by setting the parameter useLocalIndex
to true
when creating a distributed spatial index. A spatially partitioned RDD can also be transformed so each partition is locally indexed by calling the utility method oracle.spatial.spark.vector.index.local.LocalIndex.createLocallyIndexedRDD(SpatialJavaPairRDD<PartitionKey, T> rdd)
.
Parent topic: Spatially Indexing a Spatial RDD
2.11.5 Spatial DStream Transformations
A Spatial DStream is a Spark DStream that allows spatial transformations to be performed.
The current Spatial DStream implementations are the class oracle.spatial.spark.vector.streaming.dstream.SpatialJavaDStream
and oracle.spatial.spark.vector.streaming.dstream.SpatialJavaPairDStream
for Java, and oracle.spatial.spark.vector.scala.streaming.dstream.SpatialDStream
for Scala. A spatial DStream can be created from an existing instance of DStream or JavaDStream, as shown in the following examples.
Java:
//create a regular DStram
JavaDStream<String> stream = ssc.socketTextStream(host, port);
//create a SparkRecordInfoProvider to extract spatial information from the stream
SparkRecordInfoProvider<String> recordInfoProvider = new TextRecordInfoProvider();
//create a Spatial DStream
SpatialJavaDStream<SparkRecordInfo> spatialStream = SpatialJavaDStream.fromJavaDStream(stream, recordInfoProvider);
Scala:
//create a regular RDD
val stream: DStream[String] = ssc.socketTextStream(host, port)
//create a SparkRecordInfoProvider to extract spatial information from the stream
val recordInfoProvider: SparkRecordInfoProvider[String] = new TextRecordInfoProvider()
//create a Spatial DStream
val spatialStream: SpatialDStream[SparkRecordInfo] = SpatialDStream.fromDStream(stream, recordInfoProvider)
A Spatial DStream takes an implementation of the interface oracle.spatial.spark.vector.SparkRecordInfoProvider
, which is used for extracting spatial information from each element contained in the stream.
A regular DStream can be transformed into a Spatial DStream of the same generic type; that is, if the source DStream contains records of type String, the Spatial DStream will also contain String records. You can also create a Spatial DStream with records of type oracle.spatial.spark.vector.SparkRecordInfo
. A SparkRecordInfo is an abstraction of a record from the source DStream; it holds the source record’s spatial information and may contain a subset of the source record’s data. The following examples show how to create a Spatial DStream of SparkRecordInfo records.
Java:
//create a regular DStram
JavaDStream<String> stream = ssc.socketTextStream(host, port);
//create a SparkRecordInfoProvider to extract spatial information from the stream
SparkRecordInfoProvider<String> recordInfoProvider = new TextRecordInfoProvider();
//create a Spatial DStream
SpatialJavaDStream<SparkRecordInfo> spatialStream = SpatialJavaDStream.fromJavaDStream(stream, recordInfoProvider);
Scala:
//create a regular RDD
val stream: DStream[String] = ssc.socketTextStream(host, port)
//create a SparkRecordInfoProvider to extract spatial information from the stream
val recordInfoProvider: SparkRecordInfoProvider[String] = new TextRecordInfoProvider()
//create a Spatial DStream
val spatialStream: SpatialDStream[SparkRecordInfo] = SpatialDStream.fromDStream(stream, recordInfoProvider)
A Spatial DStream of SparkRecordInfo records has the advantage that spatial information does not need to be extracted from each record every time it is needed for a spatial operation.
The Spatial DStream provides the following spatial transformations, which are available for both the Java and Scala Spatial DStream implementations.
- Filter Transformation (Spatial DStream)
- FlatMap Transformation (Spatial DStream)
- NearestNeighbors Transformation (Spatial DStream)
- Enrich Transformation (Spatial DStream)
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.5.1 Filter Transformation (Spatial DStream)
A filter transformation is a spatial version of the regular DStream’s filter() transformation. In addition to a user-provided filtering function, it takes an instance of oracle.spatial.hadoop.vector.util.SpatialOperationConfig
, which is used to describe the spatial operation used to filter spatial records. A SpatialOperationConfig
contains a query window, which is the geometry used as reference, and a spatial operation. The spatial operation is executed in the form: (DStream record’s geometry) (spatial operation) (query window)
. For example: (RDD record) IsInside (queryWindow)
Spatial operations available are AnyInteract
, IsInside
, Contains
, and WithinDistance
.
The following examples return a Spatial DStream containing only records that are inside the given query window and with not null ID.
Java:
SpatialOperationConfig soc = new SpatialOperationConfig();
soc.setOperation(SpatialOperation.IsInside);
soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
SpatialJavaDStream<SparkRecordInfo> filteredSpatialStream = spatialStream.filter(
(record) -> {
return record.getField(“id”) != null;
}, soc);
Scala:
val soc = new SpatialOperationConfig()
soc.setOperation(SpatialOperation.IsInside)
soc.setQueryWindow(JGeometry.createLinearPolygon(Array(2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 ), 2, srid))
val filteredSpatialStream: SpatialDStream[SparkRecordInfo] = spatialStream.filter(
record => { record.getField(“id”) != null }, soc)
Parent topic: Spatial DStream Transformations
2.11.5.2 FlatMap Transformation (Spatial DStream)
A FlatMap transformation is a spatial version of the regular RDD’s flatMap()
transformation. In addition to a user-provided function, it takes a SpatialOperationConfig to perform a spatial filtering. It works like the Filter Transformation (Spatial DStream), except that spatially filtered results are passed to the map function and flattened.
The following examples create an DStream that contains only elements that interact with the given query window and geometries that have been buffered.
Java:
SpatialOperationConfig soc = new SpatialOperationConfig();
soc.setOperation(SpatialOperation.AnyInteract);
soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
JavaDStream<SparkRecordInfo> mappedStream = spatialStream.flatMap(
(record) -> {
JGeometry buffer = record.getGeometry().buffer(2.5);
record.setGeometry(buffer);
return Collections.singletonList(record);
}, soc);
Scala:
val soc = new SpatialOperationConfig()
soc.setOperation(SpatialOperation.AnyInteract)
soc.setQueryWindow(JGeometry.createLinearPolygon(Array( 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 ), 2, srid))
val mappedStream: DStream[SparkRecordInfo] = spatialStream.flatMap(
record => {
val buffer: JGeometry = record.getGeometry().buffer(2.5)
record.setGeometry(buffer)
Seq(record)
}, soc)
Parent topic: Spatial DStream Transformations
2.11.5.3 NearestNeighbors Transformation (Spatial DStream)
A NearestNeighbors transformation returns a stream containing a single list of the K nearest elements from a Spatial DStream to a given geometry. The elements in the list are tuples of the form: (distance, Spatial DStream’s record)
, and are sorted by distance in ascending order. Additionally, a user-defined filter lambda function can be passed. Only the records that pass the filter will be candidates to be part of the K nearest neighbors list.
The following example gets the five closest records to the given point that have a followers_count
value greater than 1000.
Java:
JGeometry qryWindow = JGeometry.createPoint(new double[] { 2.0, 1.0 }, 2, srid));
SpatialOperationConfig soc = new SpatialOperationConfig(SpatialOperation.None, qryWindow, 0.05);
DStream<List<SparkRecordInfo>> nearestNeighborsStream = spatialStream.nearestNeighbors(
(record)->{
return ((Integer)record.getField(“followers_count”))>1000;
}, 5, soc);
Scala:
val qryWindow: JGeometry = JGeometry.createPoint(Array(2.0, 1.0 ), 2, srid))
val soc: SpatialOperationConfig = new SpatialOperationConfig(SpatialOperation.None, qryWindow, 0.05)
val nearestNeighborsStream: DStream[Seq[SparkRecordInfo]] = spatialStream.nearestNeighbors(
record=>{ record.getField(“followers_count”).asInstanceOf[Int]>1000 }, 5, soc)
Parent topic: Spatial DStream Transformations
2.11.5.4 Enrich Transformation (Spatial DStream)
An Enrich transformation uses a GeoEnricher Component to associate features from different data layers to spatial records from a Spatial DStream. The spatial records and the layer features are matched by their spatial relationship.
The transformation has the following input and output.
-
Input:
-
Lambda function: A user-provided function that is called for each record from the stream (having associated features or not). The lambda function takes two parameters: a record from the stream and an iterator of SpatialFeature instances associated to the stream’s record. The return type is an Iterator (Java) or a TransversableOnce (Scala) of a type specified by the user.
-
enricher: An instance of
GeoEnricher
used to perform the matching between geometries from the stream´s records and features from data layers.
-
-
Output: A DStream whose size and elements’ type is defined by the return type of the user-provided lambda function.
The following example executes the enrich transformation from an existing spatial stream. The transformation associates each stream’s record to a list of features from data layers describing the world political boundaries for continents, countries, states/provinces, and cities. The order of the returned features is from the more specific layer (in this case, from cities), to the most general layer (continents).
In this example, the resulting stream will contain only those records from the spatial stream which were associated to a feature from the world political boundaries, so records with no features are discarded.
Java:
// Create a GeoEnricher
GeoEnricher enricher = new GeoJSONGeoEnricher(
//path to the folder containing the world political boundaries
spatialDataLayersDir,
/*a predefined configuration which includes the world political boundaries for: continents, countries, states/provinces, and cities.*/
GeoJSONGeoEnricher.WORLD_POLITICAL_BOUNDARIES_CONF,
//a Hadoop configuration
hadoopConfiguration);
/*Perform the enrich transformation to create a stream a of pairs where each pair is a record with its associated features*/
JavaPairDStream<SparkRecordInfo, List<SpatialFeature>> enrichedStream = spatialStream.enrich(
(record, features) -> {
//return only records with features
List<SpatialFeature> featureList = null;
while (features.hasNext()) {
if(featureList == null){
featureList = new LinkedList<>();
}
featureList.add(features.next());
}
if(featureList != null){
return Collections.singletonList(new Tuple2<>(record, featureList)).iterator();
}else{
return Collections.emptyIterator();
}
}
enricher
).mapToPair(tuple->{return tuple;});
Scala:
// Create a GeoEnricher
val enricher: GeoEnricher = new GeoJSONGeoEnricher(
/*path to the folder containing files with the world political boundaries*/
spatialDataLayersDir,
/*a predefined configuration which includes the world political boundaries for: continents, countries, states/provinces, and cities.*/
GEOJSONGeoEnricher.WORLD_POLITICAL_BOUNDARIES_CONF,
// a Hadoop configuration
hadoopConfiguration)
/*Perform the enrich transformation to create a stream a of pairs where each pair is a record and its associated features*/
val enrichedStream: DStream[(SparkRecordInfo, Seq[SpatialFeature])] = spatialStream.enrich(
(record, features) => {
//return only records with features
Seq((record, features.toSeq)).filter(!_._2.isEmpty)
}
enricher
)
Parent topic: Spatial DStream Transformations
2.11.5.4.1 GeoEnricher Component
The interface oracle.spatial.spark.vector.geoenrichment.GeoEnricher
is used to perform enrichment of geometries. It provides a method that takes a geometry and returns an iterator of SpatialFeature instances that spatially interact with the given geometry.
The current implementation of GeoEnricher is the class oracle.spatial.spark.vector.geoenrichment.GeoJSONGeoEnricher
, which uses a hierarchy of data layers defined as GeoJSON files. The SpatialFeature instances returned by the enrich method are features from each level in the hierarchy where the first element will be the feature at the last hierarchy level, followed by its parent, and so on.
A GeoJSONGeoEnricher can be created with the following parameters:
-
Path to the folder containing the GeoJSON files
-
An instance of GeoJSONSpatialLayerHierarchyConfiguration. This class defines a set of spatial layers to be loaded as a hierarchy.
-
Optionally, a Hadoop configuration instance if the spatial data layers are stored in HDFS
A GeoJSONSpatialLayerHierarchyConfiguration receives an array of SpatialLayerDescriptor instances. Each SpatialLayerDescriptor describes a spatial layer defined in a GeoJSON file. The hierarchy level of each spatial layer corresponds to the index in the array, so the top parent layer will be the one with index 0, the layer with index 1 will be the child layer of the layer with index 0, and so on.
A SpatialLayerDescriptor has the following information:
-
filename
: The name of the GeoJSON file where the current layer is defined, for example: world_countries.json -
name
: The name of the spatial layer. This is a text provided by the user used to identify the spatial layer. If no name is provided, the name defined at thecollectionName
field from the GeoJSON file is used. -
parentRefField
: The name of a field contained at every feature of the current layer, used to associate a feature from the current layer to a feature from the parent layer. If this field is null, the parents will be associated by finding the features from the parent layer which spatially contain or interact with features from the child layers. Any property from the properties list of a GeoJSON feature can be used as aparentRefField
. -
parentRefFieldMapping
: The name of a field contained at every feature of the parent layer which is expected to have the same value than parentRefField in the child layer. That is, when looking for a parent feature is expected thatchildFeature.parentRefField = parentFeature.parentRefFieldMapping
ifparentFeature
is the parent ofchildFeature
. If this value is set to null, a field with the name defined byparentRefField
will be considered as the parent layer.
The following example defines a GeoJSONSpatialLayerHierarchyConfiguration for the world political boundaries from continents to cities. The files can be found at the installation folder at spatial/vector/examples/templates
.
GeoJSONSpatialLayerDescriptor[] descriptors = new GeoJSONSpatialLayerDescriptor[4];//4 hierarchy levels
//First level: continents. Top parent layer
descriptors[0]=new GeoJSONSpatialLayerDescriptor("world_continents.json", "continents");
/*Second level: countries. Features from this layer have a property called Continent which can be used to find a parent at the continents layer. The Continent property from the current layer points to the _id field at the parent layer*/
descriptors[1]=new GeoJSONSpatialLayerDescriptor("world_countries.json", "countries", "Continent", "_id");
/*Third level: states/provinces. The current layer’s ISO property points to the parent layer’s Country Code property.*/
descriptors[2]=new GeoJSONSpatialLayerDescriptor("world_states_provinces.json", "states/provinces", "ISO", "Country Code");
/*Last level: cities. Parents will be associated by finding states/provinces containing cities from this layer as no fields are provided.*/
descriptors[3]=new GeoJSONSpatialLayerDescriptor("world_cities.json", "cities");
//Finally a configuration is created.
GeoJSONSpatialLayerHierarchyConfiguration worldPoliticalBoundsConf =
new GeoJSONSpatialLayerHierarchyConfiguration(
descriptors,
,8307//SRID
,0.05//tolerance
);
Parent topic: Enrich Transformation (Spatial DStream)
2.11.6 Support for Common Spatial Formats
The Spark Vector API provides utilities to easily read data from common spatial formats such as GeoJSON and ESRI ShapeFile.
The Java class oracle.spatial.spark.vector.io.SpatialSources
and the Scala class oracle.spatial.spark.vector.scala.io.SpatialSources
contain static methods to read data from GeoJSON and ShapeFile formats by specifying the data path, the data Spatial Reference System ID (SRID), and the list of non-spatial fields to be loaded.
The following examples show how to load data from a GeoJSON file. The records are automatically transformed to instances of SparkRecordInfo
, which contain the spatial information plus the _id
and followers_count
fields. If all the fields need to be retrieved, null can be passed instead of the whole list of fields. Both GeoJSON and Shapefile read methods contain an overload that returns the original records as String and MapWritable representations, respectively.
Java:
//list of GeoJSON field names to be loaded for each feature
List<String> fieldNames = new ArrayList<String>();
fieldNames.add("_id");
fieldNames.add("followers_count");
//create a spatial RDD from a GeoJSON file
SpatialJavaRDD<SparkRecordInfo> spatialRDD = SpatialSources.readGeoJSONRecordInfo(geoJSONInputPath, 8307, fieldNames, sparkContext);
Scala:
//create a spatial RDD from a GeoJSON file
val spatialRDD = SpatialSources.readGeoJSONRecordInfo(geoJSONInputPath, 8307, Seq("_id","followers_count"))(sparkContext)
Or, using implicit classes:
//create a spatial RDD from a GeoJSON file
import oracle.spatial.spark.vector.scala.io.SpatialSources.ImplicitSpatialSources
val spatialRDD = sparkContext.readGeoJSONRecordInfo(geoJSONInputPath, 8307, Seq("_id","followers_count"))
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.7 Spatial Spark SQL API
The Spatial Spark SQL API supports Spark SQL DataFrame objects containing spatial information in any format.
Oracle Big Data Spatial Vector Hive Analysis can be used with Spark SQL.
Example 2-2 Creating a Spatial DataFrame for Querying Tweets
The following example uses the Spark 1.x API to create a spatial DataFrame for querying tweets. Ithe data is loaded using a spatial RDD, then a DataFrame can be created using the function SpatialJavaRDD.createSpatialDataFrame
.
//create HiveContext
HiveContext sqlContext = new HiveContext(sparkContext.sc());
//get the spatial DataFrame from the SpatialRDD
//the geometries are in GeoJSON format
DataFrame spatialDataFrame = spatialRDD.createSpatialDataFrame(sqlContext, properties);
// Register the DataFrame as a table.
spatialDataFrame.registerTempTable("tweets");
//register UDFs
sqlContext.sql("create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon'");
sqlContext.sql("create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point'");
sqlContext.sql("create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains'");
// SQL can be run over RDDs that have been registered as tables.
StringBuffer query = new StringBuffer();
query.append("SELECT geometry, friends_count, location, followers_count FROM tweets ");
query.append("WHERE ST_Contains( ");
query.append(" ST_Polygon('{\"type\": \"Polygon\",\"coordinates\": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307) ");
query.append(" , ST_Point(geometry, 8307) ");
query.append(" , 0.05)");
query.append(" and followers_count > 50");
DataFrame results = sqlContext.sql(query.toString());
//Filter the tweets in a query window (somewhere in the north of Mexico)
//and with more than 50 followers.
//Note that since the geometries are in GeoJSON format it is possible to create the ST_Point like
//ST_Point(geometry, 8307)
//instead of
//ST_Point(geometry, 'oracle.spatial.hadoop.vector.hive.json.GeoJsonHiveRecordInfoProvider')
List<String> filteredTweets = results.javaRDD().map(new Function<Row, String>() {
public String call(Row row) {
StringBuffer sb = new StringBuffer();
sb.append("Geometry: ");
sb.append(row.getString(0));
sb.append("\nFriends count: ");
sb.append(row.getString(1));
sb.append("\nLocation: ");
sb.append(row.getString(2));
sb.append("\nFollowers count: ");
sb.append(row.getString(3));
return sb.toString();
}
}).collect();
//print the filtered tweets
filteredTweets.forEach(tweet -> System.out.println("Tweet: "+tweet));
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.7.1 Spark 2 API Enhancements
New Spark SQL capabilities have been added to the Spark 2 Vector API.
Spatial DataSet/DataFrame
Spatial RDDs can be transformed to DataSets/DataFrames using the functions provided by the class oracle.spatial.spark.vector.sql.SpatialJavaRDDConversions
(Java) and oracle.spatial.spark.vector.scala.sql. SpatialRDDConversions
(Scala). The latter provides an implicit class in order to make it possible to call the transformation from the Spatial RDD instance. The following examples show how to transform a Spatial RDD to a DataFrame.
Java:
List<String> fields = Arrays.asList(new String[]{("friends_count","location", "followers_count"});
Dataset<Row> spatialDataFrame = SpatialJavaRDDConversions.toDataFrame(spatialRDD, fields, sparkSession);
Scala:
//using implicit classes
import oracle.spatial.spark.vector.scala.sql.SpatialRDDConversions.ImplicitSpatialRDDConversions
val spatialDataFrame = spatialRDD.toDataFrame(Seq("friends_count","location", "followers_count"))(sparkSession)
A spatial DataFrame can also be created from a GeoJSON file. The following examples show how a GeoJSON file can be loaded into a spatial DataFrame. The sample GeoJSON content is also shown.
GeoJSON:
{ "type":"FeatureCollection",
"attr_names":["id","category"],
"features":[
{"type":"Feature","_id":"1","geometry":{"type":"Point","coordinates":[-122.40849,37.7972]},"properties":{"category":"6"}}
{"type":"Feature","_id":"2","geometry":{"type":"Point","coordinates":[-122.40816,37.79769]},"properties":{"category":"4"}}
}
Java:
String[] nonSpatialCols = { "_id", "category" };
sparkSession.read().format(GeoJSONRelation.Format()).option("srid", 8307).schema(SchemaUtils.createStringFieldsSchema(Arrays.asList(nonSpatialCols))).load("/someptah/geojson.json").createOrReplaceTempView("spatialTable");
Scala:
val nonSpatialCols = Seq("_id", "category")
sparkSession.read.format(GeoJSONRelation.Format).option("srid", 8307).schema(SchemaUtils.createStringFieldsSchema(nonSpatialCols)).load("/someptah/geojson.json").createOrReplaceTempView("spatialTable")
Spatial UDFs
The same set of Hive UDFs is available as Spark UDFs for the Spark 2 Vector API. For details, see Spatial Analysis Spark SQL UDFs.
The following line registers the spatial UDFs and should be executed before using any spatial UDF.
SpatialEnvironment.setup(sparkSession)
Spatial Index
An existing Spark Vector API’s spatial index can be used from Spark 2 SQL to perform faster spatial queries.
The following examples show how to transform an instance of a spatial index to a DataFrame:
Java:
// Create a spatial RDD from a GeoJSON file
List<String> fieldNames = Arrays.asList(new String[] {"id", "followers_count"});
SpatialJavaRDD<SparkRecordInfo> spatialRDD = SpatialSources.readGeoJSONRecordInfo(path, srid, fieldNames, sparkContext);
//Create a spatial index
DistributedSpatialIndex<SparkRecordInfo> index = DistributedSpatialIndex.createIndex(sparkContext, spatialRDD, new QuadTreeConfiguration());
//Specify the columns as StructFields. The geometry column is always included by default
StructField[] fields = SchemaUtils.toStringStructFields(fieldNames);
//options can be null if there are no options to be passed
Map<String, Object> options = new HashMap<>();
//include the CRS to all the geometries to avoid using SDO_<TYPE> wrappers in spatial UDF's
options.put(QuadTreeIndexRelation.OptIncludeCRS(), true);
//transform the existing spatial index to DataFrame and register as a temporal table
QuadTreeIndexRelation.toDataFrame(index, SparkRecordInfo.class, fields, options, sparkSession).createOrReplaceTempView("tweets_index");
Scala:
import oracle.spatial.spark.vector.scala.io.SpatialSources.ImplicitSpatialSources
import oracle.spatial.spark.vector.scala.sql.index.quadtree.QuadTreeIndexRelation._
import oracle.spatial.spark.vector.scala.sql.SpatialRDDConversions.ImplicitSpatialRDDConversions
//List of field names to be loaded from the GeoJSON file
val fieldNames = Seq("id", "followers_count")
//create a spatial RDD
val spatialRDD = sparkContext.readGeoJSON(path, srid, fieldNames)
//spatially index the spatial RDD
val index = DistributedSpatialIndex.createIndex(spatialRDD, new QuadTreeConfiguration())(implicitly, sparkContext)
//transform the existing spatial index to DataFrame and register as a temporal table
//fieldNames are automatically transformed to an array of string StructFields thanks to the //import of QuadTreeIndexRelation._
//toDataFrame can be called from the index thanks to the import of //ImplicitSpatialRDDConversions
index.toDataFrame(fieldNames, Map(QuadTreeIndexRelation.OptIncludeCRS->true))(sparkSession).createOrReplaceTempView("tweets_index")
It is also possible to load directly a persisted spatial index into a DataFrame, as the following examples show.
Java:
// list of GeoJSON field names to be loaded for each feature
List<String> fieldNames = Arrays.asList(new String[] { "id", "followers_count"});
// Create the required schema for the index. In this case, the schema
// contains only fields of type StringType. A schema with other data
// types can be passed if needed.
StructType schema = SchemaUtils.createStringFieldsSchema(fieldNames);
// read an existing spatial index and register it as table
sparkSession.read().format(QuadTreeIndexRelation.Format()).schema(schema).load(indexPath).createOrReplaceTempView("tweets_index");
Scala:
//List of field names from the spatial index to be included as columns.
val fieldNames = Seq("id", "followers_count")
//Create the required schema for the index.
//In this case, the schema contains only fields of type StringType.
//A schema with other data types can be passed if needed.
val schema = SchemaUtils.createStringFieldsSchema(fieldNames)
//read an existing spatial index and register it as a table
sparkSession.read.format(QuadTreeIndexRelation.Format).schema(schema).load(indexPath).createOrReplaceTempView("tweets_index")
After a spatial index is transformed to a DataFrame, it can be used as any other spatial DataFrame.
A spatial index can be also created from an existing spatial DataFrame, as shown in the following examples.
Java:
//load a spatial DataFrame from a GeoJSON file
Dataset<Row> spatialDataFrame = sparkSession.read().format(GeoJSONRelation.Format()).option("srid", srid).schema(mySchema).load("/user/x/customers.json");
//index the spatial DataFrame and register as table index
QuadTreeConfiguration qtConf = new QuadTreeConfiguration();
Map<String, Object> options = new HashMap<>();
options.put(QuadTreeIndexRelation.OptIncludeCRS(), true);
QuadTreeIndexRelation.indexSpatialDataFrame(spatialDataFrame, srid, qtConf, options, sparkSession).createOrReplaceTempView("index");
Scala:
//load a spatial DataFrame from a GeoJSON file
val spatialDataFrame = sparkSession.read.format(GeoJSONRelation.Format).option("srid", srid).schema(mySchema).load("/user/x/customers.json")
//index the spatial DataFrame and register as table index
val qtConf = new QuadTreeConfiguration()
val options = Map(QuadTreeIndexRelation.OptIncludeCRS->true)
QuadTreeIndexRelation.indexSpatialDataFrame(spatialDataFrame, srid, qtConf, options, sparkSession).createOrReplaceTempView("index")
A spatially indexed DataFrame can be persisted. The following examples show how to save a spatial index as a persistent table.
Java:
Dataset<Row> index = QuadTreeIndexRelation.indexSpatialDataFrame(spatialDataFrame, srid, qtConf, options, sparkSession);
index.write().format(QuadTreeIndexRelation.Format()).saveAsTable("index");
Scala:
val index = QuadTreeIndexRelation.indexSpatialDataFrame(spatialDataFrame, srid, qtConf, options, sparkSession)
index.write.format(QuadTreeIndexRelation.Format).saveAsTable("index")
Spatial Join
Two spatial DataFrames can be joined using a spatial condition specified by a two-operands UDF, such as st_anyinteract
or st_contains
. Records from both sides that meet the specified spatial condition will be joined.
A spatial index can be used to perform an optimized spatial join, which generally should perform faster.
The following examples show how to perform spatial join between two DataFrames where one is a spatial index.
Java:
//load a spatial index
sparkSession.read().format(QuadTreeIndexRelation.Format()).schema(custSchema).load(indexPath).createOrReplaceTempView("customers");
//load a spatial dataframe
sparkSession.read().format(GeoJSONRelation.Format()).option("srid", 8307).schema(storesSchema).load(storesPath).createOrReplaceTempView("stores");
//retrieve all the customers within 2 kilometers from each store
String query = "select * from stores s, customers c where st_withindistance(st_point(s.geometry, 8307), st_point(c.geometry, 8307), 2000.0, 0.05)";
sparkSession.sql(query).show();
Scala:
//load a spatial index
sparkSession.read.format(QuadTreeIndexRelation.Format).schema(custSchema).load(indexPath).createOrReplaceTempView("customers")
//load a spatial dataframe
sparkSession.read.format(GeoJSONRelation.Format).option("srid", 8307).schema(storesSchema).load(storesPath).createOrReplaceTempView("stores")
//retrieve all the customers within 2 kilometers from each store
val query = "select * from stores s, customers c where st_withindistance(st_point(s.geometry, 8307), st_point(c.geometry, 8307), 2000.0, 0.05)"
sparkSession.sql(query).show()
Nearest Neighbors
The UDF ST_NN can be used to find the nearest neighbors for a specified location. This location can be a single geometry or the locations from an existing table.
The following examples show how to get the nearest neighbors for a specified geometry. Note that the DataFrame must be spatially indexed.
Java:
//read an existing spatial index
spark.read().format(QuadTreeIndexRelation.Format()).schema(schema).load(indexPath).createOrReplaceTempView("index");
//define a polygon geometry
String polygonJSON = "{\"type\": \"Polygon\", \"coordinates\": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}";
//look for the 5 nearest neighbors for the previously defined polygon
String query = "SELECT location FROM index WHERE ST_NN( ST_POLYGON(geometry, 8307), ST_POLYGON('"+polygonJSON+"', 8307), 5, 0.05)";
sparkSession.sql(query).show();
Scala:
//read an existing spatial index
spark.read.format(QuadTreeIndexRelation.Format).schema(schema).load(indexPath).createOrReplaceTempView("index")
//define a polygon geometry
val polygonJSON = """{"type": "Polygon", "coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}"""
//look for the 5 nearest neighbors for the previously defined polygon
val query = s"SELECT location FROM index WHERE ST_NN( ST_POLYGON(geometry, 8307), ST_POLYGON('$polygonJSON', 8307), 5, 0.05)"
sparkSession.sql(query).show()
The following examples show how to get the five nearest neighbors for all the rows from the table named STORES. When this version of the nearestneighbors
operation is executed, one of the two DataFrames must be spatially indexed.
Java:
//load a spatial index
sparkSession.read().format(QuadTreeIndexRelation.Format()).schema(custSchema).load(indexPath).createOrReplaceTempView("customers");
//load a spatial dataframe
sparkSession.read().format(GeoJSONRelation.Format()).option("srid", 8307).schema(storesSchema).load(storesPath).createOrReplaceTempView("stores");
//retrieve the five nearest customers for each store
String query = "select * from stores s, customers c where st_nn(st_point(c.geometry, 8307), st_point(s.geometry, 8307), 5, 0.05)";
sparkSession.sql(query).show();
Scala:
//load a spatial index
sparkSession.read.format(QuadTreeIndexRelation.Format).schema(custSchema).load(indexPath).createOrReplaceTempView("customers")
//load a spatial dataframe
sparkSession.read.format(GeoJSONRelation.Format).option("srid", 8307).schema(storesSchema).load(storesPath).createOrReplaceTempView("stores")
//retrieve the five nearest customers for each store
String query = "select * from stores s, customers c where st_nn(st_point(c.geometry, 8307), st_point(s.geometry, 8307), 5, 0.05)";
sparkSession.sql(query).show()
Performance Considerations with a Spatial Index Over Spark 2 SQL
A Spatial index performs faster when using only a spatial filter or a spatial filter and AND conditions in the WHERE clause. The following queries take full advantage of a spatial index as the spatial data is pre filtered before executing the SQL query:
SELECT * FROM tweets_index WHERE ST_ANYINTERACT( ST_POLYGON('$polygonJSON',8307), ST_POINT(geometry,8307), 0.05 )
SELECT * FROM tweets_index WHERE ST_CONTAINS( ST_POLYGON('$polygonJSON',8307), ST_POINT(geometry,8307), 0.05 ) AND followers_count > 50
SELECT * FROM tweets_index WHERE ST_INSIDE( ST_POINT(geometry,8307), ST_POLYGON('$polygonJSON',8307), 0.05 ) AND followers_count > 50 AND id != null
Using OR conditions avoids the spatial data to be pre filtered, however, some spatial index optimizations are applied. The following query is an example of this case:
SELECT * FROM tweets_index WHERE ST_CONTAINS( ST_POLYGON('$polygonJSON',8307), ST_POINT(geometry,8307), 0.05 ) OR followers_count > 50
When using more than one spatial filter in a WHERE clause, no spatial index optimizations are used and the query is performed as if there were no spatial index. For example:
SELECT * FROM tweets_index
WHERE
ST_ANYINTERACT( ST_POLYGON('$polygonJSON1',8307), ST_POINT(geometry,8307), 0.05 )
AND
ST_CONTAINS( ST_POLYGON('$polygonJSON2',8307), ST_POINT(geometry,8307), 0.05 )
Parent topic: Spatial Spark SQL API
2.11.7.2 Spatial Analysis Spark SQL UDFs
Spatial analysis functions are available as Spark 2 SQL UDFs (user-defined functions).
The same set of Hive UDFs is available as Spark UDFs for the Spark 2 Vector API. In order to start using the Spatial UDFs, the following method from class oracle.spatial.spark.vector.scala.sql.SpatialEnvironment
needs to be executed before calling any query containing a spatial UDF:
SpatialEnvironment.setup(sparkSession)
The input spatial data can be in GeoJSON, WKT, or WKB format. You can also use a spatial index for faster processing.
In the queries, spatial geometry type constructors, such as ST_Polygon or ST_Point, can be used to create a GeoJSON representation of the input geometry and to add a SRID (coordinate system) for the geometry. Such constructors must be used if a geometry is specified in the query, even if the geometry is already in GeoJSON format – unless you use the spatial index option to set the SRID in the geometry, in which case a spatial geometry type constructor is not needed; for example:
spark.read().format(QuadTreeIndexRelation.Format()).schema(schema)
.option(QuadTreeIndexRelation.OptIncludeCRS(), true) //avoid using Type Functions
.load(indexPath).createOrReplaceTempView("tweets_index");
Prerequisite Libraries for Spatial Analysis Spark SQL UDFs
The required libraries for Spatial Analysis Spark SQL UDFs are:
-
sdohadoop-vector.jar
-
sdospark2-vector.jar
-
sdoutl.jar
-
sdoapi.jar
-
ojdbc8.jar
Using Spark SQL UDFs
Spatial analysis Spark SQL UDFs are a series of Spark SQL user-defined functions used to create geometries and perform spatial operations using one or two geometries in creating a Spark SQL query.
Hive and Spark Spatial SQL Functions provides reference information for the available spatial functions.
The following example returns the tweet records within a specific geographical polygon and where there are more than 50 followers. The general steps for the example are:
-
Set up the spatial SQL environment.
-
Create a spatial RDD from geographical input.
-
Create a DataSet from the SpatialRDD. A spatial DataSet contains a column called geometry whose values are in GeoJSON format.
-
Register the DataSet so it can be used within SQL statements as a table.
-
Create the query to filter the records.
-
Execute the filter.
Java Example:
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import oracle.spatial.spark.vector.SparkRecordInfo;
import oracle.spatial.spark.vector.io.SpatialSources;
import oracle.spatial.spark.vector.rdd.SpatialJavaRDD;
import oracle.spatial.spark.vector.scala.sql.SpatialEnvironment;
import oracle.spatial.spark.vector.sql.SpatialJavaRDDConversions;
public class SpatialQueryExample {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("SpatialEx").getOrCreate();
//Setup spatial SQL environment
SpatialEnvironment.setup(spark);
String geoJSONInput = args[0];
//The coordinate system the spatial data is expected to be
int srid = 8307;
// list of GeoJSON field names to be loaded for each feature
List<String> fieldNames = Arrays.asList(new String[] {
"id", "followers_count", "friends_count", "location"} );
// Create a spatial RDD from a GeoJSON file
SpatialJavaRDD<SparkRecordInfo> spatialRDD =
SpatialSources.readGeoJSONRecordInfo(geoJSONInput, srid, fieldNames,
JavaSparkContext.fromSparkContext(spark.sparkContext()));
// Create a DataSet from the SpatialRDD.
Dataset<Row> spatialDF = SpatialJavaRDDConversions.toDataFrame(
spatialRDD, fieldNames, spark);
// Register the dataset so it can be used within SQL statements
spatialDF.createOrReplaceTempView("sample_tweets");
//polygon used to spatially filter data
String qryWindow = "{\"type\": \"Polygon\",\"coordinates\": [[[-106, 25], [-106,
30], [-104, 30], [-104, 25], [-106, 25]]]}";
// Filter the tweets within the query window (somewhere in the north of Mexico)
StringBuilder query =new StringBuilder()
.append(" SELECT geometry, friends_count, location, followers_count")
.append(" FROM sample_tweets ")
.append(" WHERE ")
.append(" ST_CONTAINS(ST_POLYGON('").append(qryWindow).append("', 8307),
ST_POINT(geometry, 8307), 0.05)")
.append(" AND followers_count > 50 ");
//Execute the query
spark.sql(query.toString()).show();
}
}
Scala Example:
import org.apache.spark.sql.SparkSession
import oracle.spatial.spark.vector.sql.udf.function.FunctionExecutor
import oracle.spatial.spark.vector.scala.io.SpatialSources.ImplicitSpatialSources
import oracle.spatial.spark.vector.scala.sql.SpatialRDDConversions.ImplicitSpatialRDDConversions
import scala.collection.mutable.StringBuilder
import oracle.spatial.spark.vector.scala.sql.SpatialEnvironment
object SpatialQueryExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("SpatialQueryExample").getOrCreate()
//Setup spatial SQL environment
SpatialEnvironment.setup(spark)
val geoJSONInput = args(0)
//The coordinate system the spatial data is expected to be
val srid = 8307
// list of GeoJSON field names to be loaded for each feature
val fieldNames = Seq("id", "followers_count", "friends_count", "location")
// Create a spatial RDD from a GeoJSON file
val spatialRDD = spark.sparkContext.readGeoJSONRecordInfo(geoJSONInput, srid,
fieldNames)
// Create a DataSet from the SpatialRDD.
val spatialDF = spatialRDD.toDataFrame(fieldNames)(spark)
// Register the dataset so it can be used within SQL statements
spatialDF.createOrReplaceTempView("sample_tweets")
//polygon used to spatially filter data
val qryWindow = """{"type": "Polygon","coordinates":
[[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}"""
// Filter the tweets within the query window (somewhere in the north of Mexico)
val query =s""" SELECT geometry, friends_count, location, followers_count
| FROM sample_tweets
| WHERE
| ST_CONTAINS(ST_POLYGON('$qryWindow', $srid),
ST_POINT(geometry, $srid), 0.05)
| AND followers_count > 50 """.stripMargin
//Execute the query
val results = spark.sql(query)
results.show()
}
}
Using Spatial Indexes with Spark UDFs
Spatial Spark SQL UDFs can process indexed data sets. You can create an index on the fly or you can use a persisted spatial index. For more information, see Spatially Indexing a Spatial RDD.
The following example filters the tweet records that spatially interact with a specified polygon or with fewer than 2 followers, and it uses the spatial index option to include the SRID in the geometry column. In this scenario there is no need to wrap the geometry in a Type function.
The general steps are:
-
Set up the spatial SQL environment.
-
Read a persisted index into a DataSet and register it as a table.
-
Create the query to filter the records.
-
Execute the filter.
Java Example:
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import oracle.spatial.spark.vector.scala.sql.SpatialEnvironment;
import oracle.spatial.spark.vector.scala.sql.index.quadtree.QuadTreeIndexRelation;
import oracle.spatial.spark.vector.serialization.SpatialVectorKryoRegistrator;
public class IndexOptionsAndSchemaTypesExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf();
// the index is expected to have its partitions indexed with an R-Tree
// so the following line is required if Kryo is used
SpatialVectorKryoRegistrator.register(conf);
SparkSession spark=SparkSession.builder().config(conf).appName("I").getOrCreate();
//Setup spatial SQL environment
SpatialEnvironment.setup(spark);
String indexPath = args[0];
//Create the required schema for the index.
StructType schema = new StructType(new StructField[]{
new StructField("followers_count", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("friends_count", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("location", DataTypes.StringType, true, Metadata.empty())
});
//read an existing spatial index and register it as table called "tweets_index"
spark.read().format(QuadTreeIndexRelation.Format()).schema(schema)
.option(QuadTreeIndexRelation.OptIncludeCRS(), true)//avoid using Type Functions
.load(indexPath).createOrReplaceTempView("tweets_index");
//polygon used to spatially filter data
String qryWindow = "{\"type\": \"Polygon\",\"coordinates\": [[[-106, 25],
[-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}";
// Retrieve all the tweets which spatially interact with the given polygon
// Note that geometry column is not surrounded by the ST_POINT function
StringBuilder query =new StringBuilder()
.append(" SELECT geometry, friends_count, location, followers_count")
.append(" FROM tweets_index ")
.append(" WHERE ")
.append(" ST_ANYINTERACT(
ST_POLYGON('").append(qryWindow).append("', 8307),
geometry, 0.05)")
.append(" OR followers_count = 2 ");
System.out.println(query);
spark.sql(query.toString()).show();
}
}
Scala Example:
import org.apache.spark.sql.SparkSession
import oracle.spatial.spark.vector.sql.udf.function.FunctionExecutor
import oracle.spatial.spark.vector.scala.io.SpatialSources.ImplicitSpatialSources
import oracle.spatial.spark.vector.scala.sql.SpatialRDDConversions.ImplicitSpatialRDDConversions
import scala.collection.mutable.StringBuilder
import org.apache.spark.SparkConf
import oracle.spatial.spark.vector.serialization.SpatialVectorKryoRegistrator
import oracle.spatial.spark.vector.scala.sql.SpatialEnvironment
import oracle.spatial.spark.vector.scala.sql.index.quadtree.QuadTreeIndexRelation
import oracle.spatial.spark.vector.scala.sql.util.SchemaUtils
import org.apache.spark.sql.types.StructField
import oracle.spatial.spark.vector.scala.sql.util.SchemaUtils
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.Metadata
import org.apache.spark.sql.types.StringType
object IndexOptionsAndSchemaTypesExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
//the index is expected to have its partitions indexed with an R-Tree
//so the following line is required if Kryo is used
SpatialVectorKryoRegistrator.register(conf)
val spark = SparkSession.builder().config(conf).appName("IndexEx").getOrCreate()
//Setup spatial SQL environment
SpatialEnvironment.setup(spark)
val indexPath = args(0)
//Create the required schema for the index
val schema = StructType(Array(
StructField("followers_count",IntegerType, true, Metadata.empty),
StructField("friends_count",IntegerType, true, Metadata.empty),
StructField("location",StringType, true, Metadata.empty)))
//read an existing spatial index and register it as table called "tweets_index"
spark.read.format(QuadTreeIndexRelation.Format).schema(schema)
.option(QuadTreeIndexRelation.OptIncludeCRS, true)//set to avoid using Type Functs
.load(indexPath).createOrReplaceTempView("tweets_index")
//polygon used to spatially filter the data
val polygonJSON = """{"type": "Polygon", "coordinates": [[[-106, 25], [-106, 30],
[-104, 30], [-104, 25], [-106, 25]]]}"""
//Spatial reference system ID of the data
val srid = 8307
//Retrieve tweets which spatially interact with the given polygon
//Note that geometry column is not surrounded by the ST_POINT function
val query = s"""SELECT geometry, location, friends_count, followers_count
| FROM tweets_index
| WHERE
| ST_ANYINTERACT( ST_POLYGON('$polygonJSON',$srid), geometry, 0.05 )
| OR followers_count = 2 """.stripMargin
println(s"Executing: \n$query")
val results = spark.sql(query)
results.show()
}
}
Parent topic: Spatial Spark SQL API
2.11.8 Rendering Spatial Indexes on Maps
The Java API provides two ways to generate results based on a spatial index that can be rendered on maps using the Oracle Map API.
- Render all the records or filtered records as an image. Some examples of the possibilities are:
- Render an image using a numerical attribute and setting the minimum and maximum values.
- Render an image using a numerical attribute and setting limits with defined colors.
- Render an image using a textual attribute and setting possible values with defined colors.
- Render the records as a categorization result (such as by countries or regions). Some examples of the possibilities are:
- Categorize by country using a numerical attribute and setting minimum and maximum values, using default colors and the average or sum as categorization operation.
- Categorize using the count operation and setting limits and colors.
- Categorize using the count operation, the default colors, and the automatic limits detection.
- Categorize using textual attribute and setting possible values and colors.
Detailed code examples can be found under the folder /opt/oracle/oracle-spatial-graph/spatial/vector/examples/
.
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.11.9 JDBC Data Sources for Spatial RDDs
Oracle Database data can be used as the data source of a Spatial RDD by using the Spark Vector Analysis API.
The class oracle.spatial.spark.vector.util.JDBCUtils
(or oracle.spatial.spark.vector.scala.util.JDBCUtils
for Scala) provides convenience methods for creating a Spatial RDD from an Oracle database table or from a SQL query to an Oracle database. The table or SQL query should contain one column of type SDO_GEOMETRY in order to create a Spatial RDD.
Both the from-table and from-query method versions require a connection to the Oracle database, which is supplied by a lambda function defined by the template oracle.spatial.spark.vector.util.ConnectionSupplier
(or oracle.spatial.spark.vector.scala.util.ConnectionSupler
for Scala).
The resulting Spatial RDD type parameter will always be SparkRecordInfo
, that is, the resulting RDD will contain records of the type SparkRecordInfo
, which will contain the fields specified when querying the table or the columns in the SELECT section of the SQL query. By default, the name and type of the columns retrieved are inferred using the ResultSet
metadata; however, you can control the naming and type of the retrieved fields by supplying an implementation of SparkRecordInfoProvider
The following examples show how to create a Spatial RDD from a table and from a SQL query respectively.
Example 2-3 Creating a Spatial RDD from a Database Table
SpatialJavaRDD<SparkRecordInfoProvider> jdbcSpatialRDD = JDBCUtils.createSpatialRDDFromTable(
sparkContext, //spark context
()->{
Class.forName("oracle.jdbc.driver.OracleDriver");
return new DriverManager.getConnection(connURL, usr, pwd);
}, //DB connection supplier lambda
“VEHICLES”, //DB table
Arrays.asList(new String[]{"ID","DESC","LOCATION"}), //list of fields to retrieve
null //SparkRecordInfoProvider<ResultSet, SparkRecordIngo> (optional)
);
Example 2-4 Creating a Spatial RDD from a SQL Query to the Database
SpatialJavaRDD<SparkRecordInfoProvider> jdbcSpatialRDD = JDBCUtils.createSpatialRDDFromQuery(
sparkContext, //spark context
()->{
Class.forName("oracle.jdbc.driver.OracleDriver");
return new DriverManager.getConnection(connURL, usr, pwd);
}, //DB connection supplier lambda
“SELECT * FROM VEHICLES WHERE category > 5”, //SQL query
null //SparkRecordInfoProvider<ResultSet, SparkRecordIngo> (optional)
);
In the preceding examples, data from the Oracle database is queried and partitioned to create a Spark RDD. The number and size of the partitions is determined automatically by the Spark Vector Analysis API.
You can also specify the desired number of database rows to be contained in a Spark partition by calling a method overload that takes this number as a parameter. Manually specifying the number of rows per partition can improve the performance of the Spatial RDD creation.
Parent topic: Oracle Big Data Spatial Vector Analysis for Spark
2.12 Oracle Big Data Spatial Vector Hive Analysis
Oracle Big Data Spatial Vector Hive Analysis provides spatial functions to analyze the data using Hive.
The spatial data can be in any Hive supported format. You can also use a spatial index created with the Java analysis API (see Spatial Indexing) for fast processing.
The supported features include:
See also HiveRecordInfoProvider for details about the implementation of these features.
Hive and Spark Spatial SQL Functions provides reference information about the available functions.
Prerequisite Libraries
The following libraries are required by the Spatial Vector Hive Analysis API.
-
sdohadoop-vector-hive.jar
-
sdohadoop-vector.jar
-
sdoutil.jar
-
sdoapi.jar
-
ojdbc.jar
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.12.1 HiveRecordInfoProvider
A record in a Hive table may contain a geometry field in any format like JSON, WKT, or a user-specifiedformat. Geometry constructors like ST_Geometry can create a geometry receiving the GeoJSON, WKT, or WKB representation of the geometry. If the geometry is stored in another format, a HiveRecordInfoProvider can be used.
HiveRecordInfoProvider
is a component that interprets the geometry field representation and returns the geometry in a GeoJSON format.
The returned geometry must contain the geometry SRID, as in the following example format:
{"type":<geometry-type", "crs": {"type": "name", "properties": {"name": "EPSG:4326"}}"coordinates":[c1,c2,....cn]}
The HiveRecordInfoProvider
interface has the following methods:
-
void setCurrentRecord(Object record)
-
String getGeometry()
The method setCurrentRecord()
is called by passing the current geometry field provided when creating a geometry in Hive. The HiveRecordInfoProvider
is used then to get the geometry or to return null if the record has no spatial information.
The information returned by the HiveRecordInfoProvider
is used by the Hive Spatial functions to create geometries (see Hive and Spark Spatial SQL Functions).
Sample HiveRecordInfoProvider Implementation
SimpleHiveRecordInfoProvider
, takes text records in JSON format. The following is a sample input record:{"longitude":-71.46, "latitude":42.35}
When SimpleHiveRecordInfoProvider
is instantiated, a JSON ObjectMapper
is created. The ObjectMapper
is used to parse records values later when setCurrentRecord()
is called. The geometry is represented as latitude-longitude pair, and is used to create a point geometry using the JsonUtils.readGeometry()
method. Then the GeoJSON format to be returned is created using GeoJsonGen.asGeometry()
, and the SRID is added to the GeoJSON using JsonUtils.addSRIDToGeoJSON().
public class SimpleHiveRecordInfoProvider implements HiveRecordInfoProvider{
private static final Log LOG =
LogFactory.getLog(SimpleHiveRecordInfoProvider.class.getName());
private JsonNode recordNode = null;
private ObjectMapper jsonMapper = null;
public SimpleHiveRecordInfoProvider(){
jsonMapper = new ObjectMapper();
}
@Override
public void setCurrentRecord(Object record) throws Exception {
try{
if(record != null){
//parse the current value
recordNode = jsonMapper.readTree(record.toString());
}
}catch(Exception ex){
recordNode = null;
LOG.warn("Problem reading JSON record
value:"+record.toString(), ex);
}
}
@Override
public String getGeometry() {
if(recordNode == null){
return null;
}
JGeometry geom = null;
try{
geom = JsonUtils.readGeometry(recordNode,
2, //dimensions
8307 //SRID
);
}catch(Exception ex){
recordNode = null;
LOG.warn("Problem reading JSON record
geometry:"+recordNode.toString(), ex);
}
if(geom != null){
StringBuilder res = new StringBuilder();
//Get a GeoJSON representation of the JGeometry
GeoJsonGen.asGeometry(geom, res);
String result = res.toString();
//add SRID to GeoJSON and return the result
return JsonUtils.addSRIDToGeoJSON(result, 8307);
}
return null;
}
}
Parent topic: Oracle Big Data Spatial Vector Hive Analysis
2.12.2 Using the Hive Spatial API
The Hive Spatial API consists of Oracle-supplied Hive User Defined Functions that can be used to create geometries and perform operations using one or two geometries.
The functions can be grouped into logical categories: types, single-geometry, and two-geometries. (Hive and Spark Spatial SQL Functions lists the functions in each category and provides reference information about each function.)
Example 2-5 Hive Script
The following example script returns information about Twitter users in a data set who are within a specified geographical polygon and who have more than 50 followers. It does the following:
-
Adds the necessary jar files:
add jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
-
Creates the Hive user-defined functions that will be used:
create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point'; create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon'; create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
-
Creates a Hive table based on the files under the HDFS directory
/user/oracle/twitter
. TheInputFormat
used in this case isoracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
and the Hive SerDe is a user-provided SerDeoracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe
.CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING) ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe' STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/oracle/twitter';
-
Runs a spatial query receiving an ST_Polygon query area and the ST_Point tweets geometry, and using 0.5 as the tolerance value for the spatial operation. The output will be information about Twitter users in the query area who have more than 50 followers.
SELECT id, followers_count, friends_count, location FROM sample_tweets WHERE ST_Contains( ST_Polygon( '{"type": "Polygon", "coordinates": [[[-106, 25],[-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307 ), ST_Point(geometry, 8307), 0.5 ) and followers_count > 50;
The complete script is as follows:
add jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)
ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe'
STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/oracle/twitter';
SELECT id, followers_count, friends_count, location FROM sample_tweets
WHERE
ST_Contains(
ST_Polygon(
'{"type": "Polygon",
"coordinates":
[[[-106, 25],[-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}',
8307
),
ST_Point(geometry, 8307),
0.5
)
and followers_count > 50;
Parent topic: Oracle Big Data Spatial Vector Hive Analysis
2.12.3 Using Spatial Indexes in Hive
Hive spatial queries can use a previously created spatial index, which you can create using the Java API (see Spatial Indexing).
If you do not need to use the index in API functions that will access the original data, you can specify isMapFileIndex=false
when you call oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing
, or you can use the function setMapFileIndex(false)
. In these cases, the index will have the following structure:
HDFSIndexDirectory/part-xxxxx
And in these cases, when creating a Hive table, just provide the folder where you created the index.
If you need to access the original data and you do not set the parameter isMapFileIndex=false
, the index structure is as follows:
part-xxxxx
data
index
In such cases, to create a Hive table, the data files of the index are needed. Copy the data
files into a new HDFS folder, with each data file having a different name, like data1, data2,, and so on. The new folder will be used to create the Hive table.
The index contains the geometry records and extra fields. That data can be used when creating the Hive table.
(Note that Spatial Indexing Class Structure describes the index structure, and RecordInfoProvider provides an example of a RecordInfoProvider
adding extra fields.)
InputFormat oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat
will be used to read the index. The output of this InputFormat
is GeoJSON.
Before running any query, you can specify a minimum bounding rectangle (MBR) that will perform a first data filtering using SpatialIndexTextInputFormat
..
Example 2-6 Hive Script Using a Spatial Index
The following example script returns information about Twitter users in a data set who are within a specified geographical polygon and who have more than 50 followers. It does the following:
-
Adds the necessary jar files:
add jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
-
Creates the Hive user-defined functions that will be used:
create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point'; create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon'; create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
-
Sets the data maximum and minimum boundaries (dim1Min,dim2Min,dim1Max,dim2Max):
set oracle.spatial.boundaries=-180,-90,180,90;
-
Sets the extra fields contained in the spatial index that will be included in the table creation:
set oracle.spatial.index.includedExtraFields=followers_count,friends_count,location;
-
Creates a Hive table based on the files under the HDFS directory /user/oracle/twitter. The
InputFormat
used in this case isoracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat
and the Hive SerDe is a user-provided SerDeoracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe
. (The code fororacle.spatial.hadoop.vector.hive.json.GeoJsonSerDe
is included with the Hive examples.) The geometry of the tweets will be saved in the geometry column with the format {"longitude":n, "latitude":n} :CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets_index (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING) ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe' STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/oracle/twitter/index';
-
Defines the minimum bounding rectangle (MBR) to filter in the
SpatialIndexTextInputFormat
. Any spatial query will only have access to the data in this MBR. If no MBR is specified, then the data boundaries will be used. This setting is recommended to improve the performance.set oracle.spatial.spatialQueryWindow={"type": "Polygon","coordinates": [[[-107, 24], [-107, 31], [-103, 31], [-103, 24], [-107, 24]]]};
-
Runs a spatial query receiving an ST_Polygon query area and the ST_Point tweets geometry, and using 0.5 as the tolerance value for the spatial operation. The tweet geometries are in GeoJSON format, and the ST_Point function is usedspecifying the SRID as 8307.. The output will be information about Twitter users in the query area who have more than 50 followers.
SELECT id, followers_count, friends_count, location FROM sample_tweets WHERE ST_Contains( ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307) , ST_Point(geometry, 8307) , 0.5) and followers_count > 50;
The complete script is as follows. (Differences between this script and the one in Using the Hive Spatial API are marked in bold; however, all of the steps are described in the preceding list.)
add jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
/opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
set oracle.spatial.boundaries=-180,-90,180,90;
set oracle.spatial.index.includedExtraFields=followers_count,friends_count,location;
CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets_index (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)
ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe'
STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/oracle/twitter/index';
set oracle.spatial.spatialQueryWindow={"type": "Polygon","coordinates": [[[-107, 24], [-107, 31], [-103, 31], [-103, 24], [-107, 24]]]};
SELECT id, followers_count, friends_count, location FROM sample_tweets
WHERE ST_Contains(
ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307)
, ST_Point(geometry, 8307)
, 0.5)
and followers_count > 50;
Parent topic: Oracle Big Data Spatial Vector Hive Analysis
2.13 Using the Oracle Big Data SpatialViewer Web Application
You can use the Oracle Big Data SpatialViewer Web Application (SpatialViewer) to perform a variety of tasks.
These include tasks related to spatial indexing, creating and showing thematic maps, loading rasters into HDFS, visualizing uploaded rasters in the globe, selecting individual or multiple footprints, performing raster algebra operations, dealing with gaps and overlaps, combining selected footprints, generating a new image with the specified file format from the selected footprints, and applying user-specific processing.
- Creating a Hadoop Spatial Index Using SpatialViewer
- Exploring the Hadoop Indexed Spatial Data
- Creating a Spark Spatial Index Using SpatialViewer
- Performing Spatial Analysis
- Running a Categorization Job Using SpatialViewer
- Viewing the Categorization Results
- Saving Categorization Results to a File
- Creating and Deleting Templates
- Configuring Templates
- Running a Clustering Job Using SpatialViewer
- Viewing the Clustering Results
- Saving Clustering Results to a File
- Running a Binning Job Using SpatialViewer
- Viewing the Binning Results
- Saving Binning Results to a File
- Running a Job to Create an Index Using the Command Line
- Running a Job to Create a Categorization Result
- Running a Job to Create a Clustering Result
- Running a Job to Create a Binning Result
- Running a Job to Perform Spatial Filtering
- Running a Job to Get Location Suggestions
- Running a Job to Perform a Spatial Join
- Running a Job to Perform Partitioning
- Using Multiple Inputs
- Loading Images from the Local Server to the HDFS Hadoop Cluster
- Visualizing Rasters in the Globe
- Processing a Raster or Multiple Rasters with the Same MBR
- Creating a Mosaic Directly from the Globe
- Adding Operations for Raster Processing
- Creating a Slope Image from the Globe
- Changing the Image File Format from the Globe
Parent topic: Using Big Data Spatial and Graph with Spatial Data
2.13.1 Creating a Hadoop Spatial Index Using SpatialViewer
To create a Hadoop spatial index using SpatialViewer, follow these steps.
-
Open the console:
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
-
Click Spatial Index.
-
Specify all the required details:
-
Index name.
-
Path of the file or files to index in HDFS. For example,
/user/oracle/bdsg/tweets.json
. -
SRID of the geometries to be indexed. Example: 8307
-
Tolerance of the geometries to be indexed. Example: 0.05
-
Input Format class: The input format class. For example:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
-
Record Info Provider class: The class that provides the spatial information. For example:
oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider
.Note:
If the
InputFormat
class or theRecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar, you must add it in the/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/WEB-INF/lib
directory and restart the server. -
Whether the enrichment service (
MVSuggest
) must be used or not. If the geometry has to be found from a location string, then use theMVSuggest
service. In this case the providedRecordInfoProvider
must implement the interfaceoracle.spatial.hadoop.vector.LocalizableRecordInfoProvider
. -
MVSuggest Templates (Optional): When using the
MVSuggest
service, you can define the templates used to create the index.
-
-
Click Create.
A URL will be displayed to track the job.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.2 Exploring the Hadoop Indexed Spatial Data
To explore Hadoop indexed spatial data, follow these steps.
-
Open the console:
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
-
Click Explore Data.
For example, you can:
-
Select the desired indexed data and use the rectangle tool to display the data in the desired area.
-
Change the background map style.
-
Show data using a heat map.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.3 Creating a Spark Spatial Index Using SpatialViewer
To create a Spark spatial index using SpatialViewer, follow these steps.
-
Open the console:
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vectorspark
-
Click Spatial Index.
-
Specify all the required details:
-
Index name.
-
Path of the file or files to index in HDFS. For example,
/user/oracle/bdsg/tweets.json
. -
SRID of the geometries to be indexed. Example: 8307
-
Input Format class (optional): The input format class. For example:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
-
Key class (required if an input format class is defined): Class of the input format keys. For example:
org.apache.hadoop.io.LongWritable
-
Value class (required if an input format class is defined): Class of the input format values. For example:
org.apache.hadoop.io.Text
-
Record Info Provider class: The class that provides the spatial information. For example:
oracle.spatial.spark.vector.recordinfoprovider.GeoJsonRecordInfoProvider
Note:
If the
InputFormat
class or theRecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/WEB-INF/lib directory
and restart the server.
-
-
Click Create.
A URL will be displayed to track the job.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.4 Performing Spatial Analysis
To start spatial analysis:
-
Open the console:
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vectorspark
-
Click Spatial Analysis.
Then perform spatial analysis actions as desired. For example, you can:
-
Select the desired indexed data and use the rectangle tool to display the data in the desired area.
-
Change the background map style.
- Perform the following spatial analysis operations on indexed data or on previous analysis results:
- Buffer: Expand a shape by a specified distance.
- Convex hull: Envelop a shape disallowing inward curves.
- Distance: Compute the shortest distance between shapes.
- Envelope: Return the smallest rectangle fitting the shape.
- Simplify: Reduce vertices maintaining approximate shape using the Douglas-Peucker algorithm.
- SimplifyVW: Reduce vertices maintaining approximate shape using the Visvalingham-Whyatt algorithm.
- Any interaction: Filter shapes interacting in any way.
- Contains: Filter shapes containing a specified geometry.
- Inside: Filter shapes that are inside a specified geometry.
- Within distance: Filter shapes that are within a specified distance apart from other shapes.
- Nearest neighbors: Find the specified number of shapes closest to a shape.
Note:
You can also set non-spatial conditions on the indexed data to have them considered in the analysis.
When two geometries are specified, the second one can be indexed data, previous analysis results, or a specified geometry picked from a list of predefined geometries (for example a city, region or country), picked in a map, or defined using GeoJSON. (For predefined geometries, see Creating and Deleting Templates to add or delete templates for selecting.)
- Create spatial views on indexed data or on analysis results. For example, create views to:
- Render the records as an image. Possibilities include:
- Render an image using a numerical attribute and setting the minimum and maximum values.
- Render an image using a numerical attribute and setting limits with defined colors.
- Render an image using a textual attribute and setting possible values with defined colors.
- Render the records as a categorization result (by countries or regions for example). Possibilities include:
- Categorize by country using a numerical attribute and setting minimum and maximum values. The average or sum can be used as the categorization operation.
- Categorize using the count operation and setting limits and colors.
- Categorize using the count operation, default colors, and automatic limits detection.
- Categorize using a textual attribute and setting possible values and colors.
- Render the records as an image. Possibilities include:
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.5 Running a Categorization Job Using SpatialViewer
You can run a categorization job with or without the spatial index. Follow these steps.
-
Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. -
Click Categorization, then Categorization Job.
-
Select either With Index or Without Index and provide the following details, as required:
-
With Index
-
Index name
-
-
Without Index
-
Path of the data: Provide the HDFS data path. For example,
/user/oracle/bdsg/tweets.json
. -
JAR with user classes (Optional): If the
InputFormat
class or theRecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/WEB-INF/lib
directory and restart the server. -
Input Format class: The input format class. For example:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
-
Record Info Provider class: The class that will provide the spatial information. For example:
oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider
. -
Whether the enrichment service
MVSuggest
service must be used or not. If the geometry must be found from a location string, then use theMVSuggest
service. In this case the providedRecordInfoProvider
has to implement the interfaceoracle.spatial.hadoop.vector.LocalizableRecordInfoProvider
. -
Templates: The templates to create the thematic maps.
Note:
If a template refers to point geometries (for example, cities), the result returned is empty for that template, if
MVSuggest
is not used. This is because the spatial operations return results only for polygons.Tip:
When using the
MVSuggest
service the results will be more accurate if all the templates that could match the results are provided. For example, if the data can refer to any city, state, country, or continent in the world, then the better choice of templates to build results are World Continents, World Countries, World State Provinces, and World Cities. On the other hand, if the data is from the USA states and counties, then the suitable templates are USA States and USA Counties. If an index that was created using theMVSuggest
service is selected, then select the top hierarchy for an optimal result. For example, if it was created using World Countries, World State Provinces, and World Cities, then use World Countries as the template. -
Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example,
Tweets test
.
-
-
Click Create. A URL will be displayed to track the job.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.6 Viewing the Categorization Results
To view the categorization results, follow these steps.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.7 Saving Categorization Results to a File
You can save categorization results to a file (for example, the result file created with a job executed from the command line) on the local system for possible future uploading and use. The templates are located in the folder /opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/templates
. The templates are GeoJSON files with features and all the features have ids. For example, the first feature in the template USA States starts with: {"type":"Feature","_id":"WYOMING",
...
The results must be JSON files with the following format: {"id":"JSONFeatureId","result":result}
.
For example, if the template USA States is selected, then a valid result is a file containing: {"id":"WYOMING","result":3232} {"id":"SOUTH DAKOTA","result":74968}
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.8 Creating and Deleting Templates
To create new templates do the following:
- Add the template JSON file in the folder
/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/templates/
. - Add the template configuration file in the folder
/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/templates/_config_
.
To delete the template, delete the JSON and configuration files added in steps 1 and 2.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.9 Configuring Templates
Each template has a configuration file. The template configuration files are located in the folder /opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/templates/_config_
. The name of the configuration file is the same as the template files suffixed with config.json
instead of .json
.For example, the configuration file name of the template file usa_states.json
is usa_states.config.json
. The configuration parameters are:
-
name: Name of the template to be shown on the console. For example,
name: USA States
. -
display_attribute: When displaying a categorization result, a cursor move on the top of a feature displays this property and result of the feature. For example,
display_attribute: STATE NAME
. -
point_geometry: True, if the template contains point geometries and false, in case of polygons. For example,
point_geometry: false
. -
child_templates (optional): The templates that can have several possible child templates separated by a coma. For example,
child_templates: ["world_states_provinces, usa_states(properties.COUNTRY CODE:properties.PARENT_REGION)"].
If the child templates do not specify a linked field, it means that all the features inside the parent features are considered as child features. In this case, the
world_states_provinces
doesn't specify any fields. If the link between parent and child is specified, then the spatial relationship doesn't apply and the feature properties link are checked. In the above example, the relationship with theusa_states
is found with the propertyCOUNTRY CODE
in the current template, and the property PARENT_REGION
in the template fileusa_states.json
. -
srid: The SRID of the template's geometries. For example,
srid: 8307
. -
back_polygon_template_file_name (optional): A template with polygon geometries to set as background when showing the defined template. For example,
back_polygon_template_file_name: usa_states
. -
vectorLayers: Configuration specific to the
MVSuggest
service. For example:{ "vectorLayers": [ { "gnidColumns":["_GNID"], "boostValues":[2.0,1.0,1.0,2.0] } ] }
Where:
-
gnidColumns is the name of the column(s) within the Json file that represents the Geoname ID. This value is used to support multiple languages with
MVSuggest
. (See references of that value in the filetemplates/_geonames_/alternateNames.json
.) There is nodefault value for this property. -
boostValues is an array of float numbers that represent how important a column is within the "properties" values for a given row. The higher the number, the more important that field is. A value of zero means the field will be ignored. When boostValues is not present, all fields receive a default value of 1.0, meaning they all are equally important properties. The
MVSuggest
service may return different results depending on those values. For a Json file with the following properties, the boost values might be as follows:"properties":{"Name":"New York City","State":"NY","Country":"United States","Country Code":"US","Population":8491079,"Time Zone":"UTC-5"} "boostValues":[3.0,2.0,1.0,1.0,0.0,0.0]
-
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.10 Running a Clustering Job Using SpatialViewer
To run a clustering job using SpatialViewer, follow these steps.
-
Open:
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
-
Click Clustering, then Clustering Job.
-
Provide the following details, as required:
-
Path of the data: Provide the HDFS data path. For example,
/user/oracle/bdsg/tweets.json
. -
The SRID of the geometries. For example: 8307
-
The tolerance of the geometries. For example: 0.05
-
JAR with user classes (Optional): If the
InputFormat
class or theRecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/WEB-INF/lib
directory and restart the server. -
Input Format class: The input format class. For example:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
-
Record Info Provider class: The class that will provide the spatial information. For example:
oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider
. -
Number of clusters: The number of clusters to be found.
-
Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.
-
-
Click Create.
A URL will be displayed to track the job.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.11 Viewing the Clustering Results
To view the clustering results, follow these steps.
- Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. - Click Clustering, then Results.
- Click any one of the Results displayed.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.12 Saving Clustering Results to a File
You can save clustering results to a file on your local system, for later uploading and use. To save the clustering results to a file, follow these steps.
- Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. - Click Clustering, then Results.
- Click the icon for saving the results.
- Specify a name.
- Specify the SRID of the geometries. For example: 8307
- Click Choose File and select the file location.
- Click Save.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.13 Running a Binning Job Using SpatialViewer
You can run a binning job with or without the spatial index. Follow these steps.
-
Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. -
Click Binning, then Binning Job.
-
Select either With Index or Without Index and provide the following details, as required:
-
With Index
-
Index name
-
-
Without Index
-
Path of the data: Provide the HDFS data path. For example,
/user/oracle/bdsg/tweets.json
-
The SRID of the geometries. For example: 8307
-
The tolerance of the geometries. For example: 0.05
-
JAR with user classes (Optional): If the
InputFormat
class or theRecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the/opt/oracle/oracle-spatial-graph/spatial/web-server/spatialviewer/WEB-INF/lib
directory and restart the server. -
Input Format class: The input format class. For example:
oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat
-
Record Info Provider class: The class that will provide the spatial information. For example:
oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider
.
-
-
-
Binning grid minimum bounding rectangle (MBR). You can click the icon for seeing the MBR on the map.
-
Binning shape: hexagon (specify the hexagon width) or rectangle (specify the width and height).
-
Thematic attribute: If the job uses an index, double-click to see the possible values, which are those returned by the function
getExtraFields
of theRecordInfoProvider
used when creating the index. If the job does not use an index, then the field can be one of the fields returned by the functiongetExtraFields
of the specifiedRecordInfoProvider
class. In any case, thecount
attribute is always available and specifies the number of records in the bin. -
Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.
Click Create. A URL will be displayed to track the job.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.14 Viewing the Binning Results
To view the binning results, follow these steps.
- Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. - Click Binning, then Results.
- Click any of the Results displayed.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.15 Saving Binning Results to a File
You can save binning results to a file on your local system, for later uploading and use. To save the binning results to a file, follow these steps.
- Open
http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/?root=vector
. - Click Binning, then View Results.
- Click the icon for saving the results.
- Specify the SRID of the geometries. For example: 8307
- Specify the thematic attribute, which must be a property of the features in the result. For example, the count attribute can be used to create results depending on the number of results per bin.
- Click Choose File and select the file location.
- Click Save.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.16 Running a Job to Create an Index Using the Command Line
To create a spatial index, use a command in the following format:
hadoop jar <HADOOP_LIB_PATH>/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing [generic options] input=<path|comma_separated_paths|path_pattern> output=<path> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<minX,minY,maxX,maxY>] [indexName=<index_name>] [indexMetadataDir=<path>] [overwriteIndexMetadata=<true|false>] [ mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>][mvsMatchCountry=<country_name>][mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>][mvsInterfaceType=<LOCAL, WEB>][mvsIsRepository=<true|false>][rebuildMVSIndex=<true|false>][mvsPersistentLocation=<hdfs_path>][mvsOverwritePersistentLocation=<true|false>] ]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing
with oracle.spatial.hadoop.vector.mapreduce.job.SpatialIndexing
.
Input/output arguments:
-
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. -
inputFormat
: theinputFormat
class implementation used to read the input data. -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. -
output
: the path where the spatial index will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY
Spatial index metadata arguments:
-
indexName
(optional, default=output folder name):The name of the index to be generated. -
indexMetadataDir
(optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata will be stored. -
overwriteIndexMetadata
(optional, default=false) boolean argument that indicates whether the index metadata can be overwritten if an index with the same name already exists.
MVSuggest
arguments:
-
mvsLocation
: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest. -
mvsMatchLayers
(optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search. -
mvsMatchCountry
(optional, default=none): a country name which MVSuggest will give higher priority when performing matches. -
mvsSpatialResponse
(optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID. -
mvsInterfaceType
(optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB. -
mvsIsRepository
(optional: default=false) (LOCAL only): boolean value which specifies whether mvsLocation points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a _config_ and _geonames_ folder. -
mvsRebuildIndex
(optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not. -
mvsPersistentLocation
(optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved. -
mvsIsOverwritePersistentLocation
(optional, default=false): boolean argument that indicates whether an existing mvsPersistentLocation must be overwritten in case it already exists.
Example: Create a spatial index called indexExample
. The index metadata will be stored in the HDFS directory spatialMetadata
.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/spatial_index inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 indexName=indexExample indexMetadataDir=indexMetadataDir overwriteIndexMetadata=true
Example: Create a spatial index using MVSuggest
to assign a spatial location to records that do not contain geometries.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/spatial_index inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=mypackage.Simple LocationRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 indexName=indexExample indexMetadataDir=indexMetadataDir overwriteIndexMetadata=true mvsLocation=file:///local_folder/mvs_dir/oraclemaps_pub/ mvsRepository=true
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.17 Running a Job to Create a Categorization Result
To create a categorization result, use a command in one of the following formats.
With a Spatial Index
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] ( indexName=<indexName> [indexMetadataDir=<path>] ) | ( input=<path|comma_separated_paths|path_pattern> isInputIndex=true [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>] ) output=<path> hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass> [hierarchyDataPaths=<level1_path,level2_path,,levelN_path>] spatialOperation=<[None, IsInside, AnyInteract]>
Without a Spatial Index
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>] output=<path> hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass> hierarchyDataPaths=<level1_path,level2_path,,levelN_path>] spatialOperation=<[None, IsInside, AnyInteract]>
Using MVSuggest
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] (indexName=<indexName> [indexMetadataDir=<path>]) | ( (input=<path|comma_separated_paths|path_pattern> isInputIndex=true) | (input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass>) [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>] ) output=<path> mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>] [mvsMatchCountry=<country_name>] [mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>] [mvsInterfaceType=<[UNDEFINED, LOCAL, WEB]>] [mvsIsRepository=<true|false>] [mvsRebuildIndex=<true|false>] [mvsPersistentLocation=<hdfs_path>] [mvsOverwritePersistentLocation=<true|false>] [mvsMaxRequestRecords=<integer_number>] hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass>
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Categorization
with oracle.spatial.hadoop.vector.mapreduce.job.Categorization
.
Input/output arguments:
-
indexName
: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argumentinput
is ignored. -
indexMetadataDir
(optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located -
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored ifindexName
is specified.) -
inputFormat
: theinputFormat
class implementation used to read the input data. (Ignored ifindexName
is specified.) -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. (Ignored ifindexName
is specified.) -
output
: the path where the spatial index will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY -
spatialOperation
: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values areIsInside
andAnyInteract
.
Hierarchical data set arguments:
-
hierarchyIndex
: the HDFS path of an existing hierarchical index or where it can be stored if it needs to be generated. -
hierarchyInfo
: the fully qualified name of aHierarchyInfo
subclass which is used to describe the hierarchical data. -
hierarchyDataPaths
(optional, default=none): a comma separated list of paths of the hierarchy data. The paths should be sorted in ascending way by hierarchy level. If a hierarchy index path does not exist for the given hierarchy data, this argument is required.
MVSuggest
arguments:
-
mvsLocation
: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest. -
mvsMatchLayers
(optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search. -
mvsMatchCountry
(optional, default=none): a country name which MVSuggest will give higher priority when performing matches. -
mvsSpatialResponse
(optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID. -
mvsInterfaceType
(optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB. -
mvsIsRepository
(optional: default=false) (LOCAL only): Boolean value that specifies whethermvsLocation
points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a_config_
and_geonames_
folder. -
mvsRebuildIndex
(optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not. -
mvsPersistentLocation
(optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved. -
mvsIsOverwritePersistentLocation
(optional, default=false): boolean argument that indicates whether an existingmvsPersistentLocation
must be overwritten in case it already exists.
Example: Run a Categorization job to create a summary containing the records counts by continent, country, and state/provinces. The input is an existing spatial index called indexExample
. The hierarchical data will be indexed and stored in HDFS at the path hierarchyIndex
.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization indexName= indexExample output=/user/hdfs/demo_vector/tweets/hier_count_spatial hierarchyInfo=vectoranalysis.categorization.WorldAdminHierarchyInfo hierarchyIndex=hierarchyIndex hierarchyDataPaths=file:///templates/world_continents.json,file:///templates/world_countries.json,file:///templates/world_states_provinces.json spatialOperation=IsInside
Example: Run a Categorization job to create a summary of tweet counts per continent, country, states/provinces, and cities using MVSuggest
.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization input="/user/hdfs/demo_vector/tweets/part*" inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass> output=/user/hdfs/demo_vector/tweets/hier_count_mvs hierarchyInfo=vectoranalysis.categorization.WorldAdminHierarchyInfo hierarchyIndex=hierarchyIndex mvsLocation=file:///mvs_dir mvsMatchLayers=world_continents,world_countries,world_states_provinces spatialOperation=IsInside
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.18 Running a Job to Create a Clustering Result
To create a clustering result, use a command in the following format:
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.KMeansClustering [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> output=<path> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>] k=<number_of_clusters> [clustersPoints=<comma_separated_points_ordinates>] [deleteClusterFiles=<true|false>] [maxIterations=<integer_value>] [critFunClass=<CriterionFunction_subclass>] [shapeGenClass=<ClusterShapeGenerator_subclass>] [maxMemberDistance=<double_value>]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.KMeansClustering
with oracle.spatial.hadoop.vector.mapreduce.job.KMeansClustering
.
Input/output arguments:
-
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. -
inputFormat
: theinputFormat
class implementation used to read the input data. -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. -
output
: the path where the spatial index will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY -
spatialOperation
: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values areIsInside
andAnyInteract
.
Clustering arguments:
-
k
: the number of clusters to be found. -
clusterPoints
(optional, default=none): the initial cluster centers as a comma-separated list of point ordinates in the form: p1_x,p1_y,p2_x,p2_y,…,pk_x,pk_y -
deleteClusterFiles
(optional, default=true): Boolean arguments that specifies whether the intermediate cluster files generated between iterations should be deleted or not -
maxIterations
(optional, default=calculated based on the number k): the maximum number of iterations allowed before the job completes. -
critFunClass
(optional, default=oracle.spatial.hadoop.vector.cluster.kmeans. SquaredErrorCriterionFunction) a fully qualified name of aCriterionFunction
subclass. -
shapeGenClass
(optional, default= oracle.spatial.hadoop.vector.cluster.kmeans. ConvexHullClusterShapeGenerator) a fully qualified name of aClusterShapeGenerator
subclass used to generate the geometry of the clusters. -
maxMemberDistance
(optional, default=undefined): a double value that specifies the maximum distance between a cluster center and a cluster member.
Example: Run a Clustering job to generate 5 clusters. The generated clusters geometries will be the convex hull of all .
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.KMeansClustering input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/result inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 k=5 shapeGenClass=oracle.spatial.hadoop.vector.cluster.kmeans.ConvexHullClusterShapeGenerator
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.19 Running a Job to Create a Binning Result
To create a binning result, use a command in the following format:
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Binning [generic options] (indexName=<INDEX_NAME> [indexMetadataDir=<INDEX_METADATA_DIRECTORY>]) | (input=<DATA_PATH> inputFormat=<INPUT_FORMAT_CLASS> recordInfoProvider=<RECORD_INFO_PROVIDER_CLASS> [srid=<SRID>] [geodetic=<GEODETIC>] [tolerance=<TOLERANCE>]) output=<RESULT_PATH> cellSize=<CELL_SIZE> gridMbr=<GRID_MBR> [cellShape=<CELL_SHAPE>] [aggrFields=<EXTRA_FIELDS>]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Binning
with oracle.spatial.hadoop.vector.mapreduce.job.Binning
.
Input/output arguments:
-
indexName
: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argumentinput
is ignored. -
indexMetadataDir
(optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located -
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. -
inputFormat
: theinputFormat
class implementation used to read the input data. -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. -
output
: the path where the spatial index will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations.
Binning arguments:
-
cellSize
: the size of the cells in the format: width,height -
gridMbr
: the minimum and maximum dimension values for the grid in the form: minX,minY,maxX,maxY -
cellShape
(optional, default=RECTANGLE): the shape of the cells. It can be RECTANGLE or HEXAGON -
aggrFields
(optional, default=none): a comma-separated list of field names that will be aggregated.
Example: Run a spatial binning job to generate a grid of hexagonal cells and aggregate the value of the field SALES..
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Binning indexName=indexExample indexMetadataDir=indexMetadataDir output=/user/hdfs/demo_vector/result cellShape=HEXAGON cellSize=5 gridMbr=-175,-85,175,85 aggrFields=SALES
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.20 Running a Job to Perform Spatial Filtering
To perform spatial filtering, use a command in the following format:
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialFilter [generic options] ( indexName=<indexName> [indexMetadataDir=<path>] ) | ( (input=<path|comma_separated_paths|path_pattern> isInputIndex=true) | (input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass>) [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>] ) output=<path> spatialOperation=<[IsInside, AnyInteract]> queryWindow=<json-geometry>
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialFilter
with oracle.spatial.hadoop.vector.mapreduce.job.SpatialFilter
.
Input/output arguments:
-
indexName
: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argumentinput
is ignored. -
indexMetadataDir
(optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located -
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. -
inputFormat
: theinputFormat
class implementation used to read the input data. -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. -
output
: the path where the spatial index will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations.
Binning arguments:
-
cellSize
: the size of the cells in the format: width,height -
gridMbr
: the minimum and maximum dimension values for the grid in the form: minX,minY,maxX,maxY -
cellShape
(optional, default=RECTANGLE): the shape of the cells. It can be RECTANGLE or HEXAGON -
aggrFields
(optional, default=none): a comma-separated list of field names that will be aggregated. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minx,minY,maxX,maxY -
spatialOperation
: the operation to be applied between the queryWindow and the geometries from the input data set -
queryWindow
: the geometry used to filter the input dataset.
Example: Perform a spatial filtering operation.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialFilter indexName=indexExample indexMetadataDir=indexMetadataDir output=/user/hdfs/demo_vector/result spatialOperation=IsInside queryWindow='{"type":"Polygon", "coordinates":[[-106, 25, -106, 30, -104, 30, -104, 25, -106, 25]]}'
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.21 Running a Job to Get Location Suggestions
To create a job to get location suggestions, use a command in the following format.
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SuggestService [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> output=<path> mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>] [mvsMatchCountry=<country_name>] [mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>] [mvsInterfaceType=<[UNDEFINED, LOCAL, WEB]>] [mvsIsRepository=<true|false>] [mvsRebuildIndex=<true|false>] [mvsPersistentLocation=<hdfs_path>] [mvsOverwritePersistentLocation=<true|false>] [mvsMaxRequestRecords=<integer_number>]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SuggestService
with oracle.spatial.hadoop.vector.mapreduce.job.SuggestService
.
Input/output arguments:
-
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored ifindexName
is specified.) -
inputFormat
: theinputFormat
class implementation used to read the input data. (Ignored ifindexName
is specified.) -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. (Ignored ifindexName
is specified.) -
output
: the path where the spatial index will be stored
MVSuggest
arguments:
-
mvsLocation
: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest. -
mvsMatchLayers
(optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search. -
mvsMatchCountry
(optional, default=none): a country name which MVSuggest will give higher priority when performing matches. -
mvsSpatialResponse
(optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID. -
mvsInterfaceType
(optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB. -
mvsIsRepository
(optional: default=false) (LOCAL only): Boolean value that specifies whethermvsLocation
points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a_config_
and_geonames_
folder. -
mvsRebuildIndex
(optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not. -
mvsPersistentLocation
(optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved. -
mvsIsOverwritePersistentLocation
(optional, default=false): boolean argument that indicates whether an existingmvsPersistentLocation
must be overwritten in case it already exists.
Example: Get suggestions based on location texts from the input data set.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SuggestService input="/user/hdfs/demo_vector/tweets/part*" inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass> output=/user/hdfs/demo_vector/tweets/suggest_res mvsLocation=file:///mvs_dir mvsMatchLayers=world_continents,world_countries,world_states_provinces
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.22 Running a Job to Perform a Spatial Join
To perform a spatial join operation on two data sets, use a command in the following format.
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job. SpatialJoin [generic options] inputList={ { ( indexName=<dataset1_spatial_index_name> indexMetadataDir=<dataset1_spatial_index_metadata_dir_path> ) | ( input=<dataset1_path|comma_separated_paths|path_pattern> inputFormat=<dataset1_InputFormat_subclass> recordInfoProvider=<dataset1_RecordInfoProvider_subclass> ) [boundaries=<min_x,min_y,max_x,max_y>] } { (indexName=<dataset2_spatial_index_name> indexMetadataDir=<dataset2_spatial_index_metadata_dir_path> ) | ( input=<dataset2_path|comma_separated_paths|path_pattern> inputFormat=<dataset2_InputFormat_subclass> recordInfoProvider=<dataset2_RecordInfoProvider_subclass> ) [boundaries=<min_x,min_y,max_x,max_y>] } } output=<path>[srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] boundaries=<min_x,min_y,max_x,max_y> spatialOperation=<AnyInteract|IsInside|WithinDistance> [distance=<double_value>] [samplingRatio=<decimal_value_between_0_and_1> | partitioningResult=<path>]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialJoin
with oracle.spatial.hadoop.vector.mapreduce.job.SpatialJoin
.
InputList
: A list of two input data sets. The list is enclosed by curly braces ({}). Each list element is an input data set, which is enclosed by curly braces. An input data set can contain the following information, depending on whether the data set is specified as a spatial index.
If specified as a spatial index:
-
indexName
: the name of an existing spatial index. -
indexMetadataDir
: the directory where the spatial index metadata is located
If not specified as a spatial index:
-
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored ifindexName
is specified.) -
inputFormat
: theinputFormat
class implementation used to read the input data. (Ignored ifindexName
is specified.) -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. (Ignored ifindexName
is specified.)
output
: the path where the results will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY -
spatialOperation
: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values areIsInside
andAnyInteract
. -
distance
: distance used forWithinDistance
operations.
Partitioning arguments:
-
samplingRatio
(optional, default=0.1): ratio used to sample the data sets when partitioning needs to be performed -
partitioningResult
(optional, default=none): Path to a previously generated partitioning result file
Example: Perform a spatial join on two data sets.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialJoin inputList="{{input=/user/hdfs/demo_vector/world_countries.json inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider} {input=file="/user/hdfs/demo_vector/tweets/part*” inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider}}" output=/user/hdfs/demo_vector/spatial_join srid=8307 spatialOperation=AnyInteract boundaries=-180,-90,180,90
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.23 Running a Job to Perform Partitioning
To perform a spatial partitioning, use a command in the following format.
hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job. SpatialJoin [generic options] inputList={ { ( indexName=<dataset1_spatial_index_name> indexMetadataDir=<dataset1_spatial_index_metadata_dir_path> ) | ( input=<dataset1_path|comma_separated_paths|path_pattern> inputFormat=<dataset1_InputFormat_subclass> recordInfoProvider=<dataset1_RecordInfoProvider_subclass> ) [boundaries=<min_x,min_y,max_x,max_y>] } [ { (indexName=<dataset2_spatial_index_name> indexMetadataDir=<dataset2_spatial_index_metadata_dir_path> ) | ( input=<dataset2_path|comma_separated_paths|path_pattern> inputFormat=<dataset2_InputFormat_subclass> recordInfoProvider=<dataset2_RecordInfoProvider_subclass> ) [boundaries=<min_x,min_y,max_x,max_y>] } …… { (indexName=<datasetN_spatial_index_name> indexMetadataDir=<datasetN_spatial_index_metadata_dir_path> ) | ( input=<datasetN_path|comma_separated_paths|path_pattern> inputFormat=<datasetN_InputFormat_subclass> recordInfoProvider=<datasetN_RecordInfoProvider_subclass> ) [boundaries=<min_x,min_y,max_x,max_y>] } } ] output=<path>[srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] boundaries=<min_x,min_y,max_x,max_y> [samplingRatio=<decimal_value_between_0_and_1>]
To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Partitioning
with oracle.spatial.hadoop.vector.mapreduce.job.Partitioning
.
InputList
: A list of two input data sets. The list is enclosed by curly braces ({}). Each list element is an input data set, which is enclosed by curly braces. An input data set can contain the following information, depending on whether the data set is specified as a spatial index.
If specified as a spatial index:
-
indexName
: the name of an existing spatial index. -
indexMetadataDir
: the directory where the spatial index metadata is located
If not specified as a spatial index:
-
input
: the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored ifindexName
is specified.) -
inputFormat
: theinputFormat
class implementation used to read the input data. (Ignored ifindexName
is specified.) -
recordInfoProvider
: therecordInfoProvider
implementation used to extract information from the records read by theInputFormat
class. (Ignored ifindexName
is specified.)
output
: the path where the results will be stored
Spatial arguments:
-
srid
(optional, default=0): the spatial reference system (coordinate system) ID of the spatial data. -
geodetic
(optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not. -
tolerance
(optional, default=0.0): double value that represents the tolerance used when performing spatial operations. -
boundaries
(optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY
Partitioning arguments:
-
samplingRatio
(optional, default=0.1): ratio used to sample the data sets when partitioning needs to be performed
Example: Partition two data sets.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Partitioning inputList="{{input=/user/hdfs/demo_vector/world_countries.json inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider} {input=file="/user/hdfs/demo_vector/tweets/part*” inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider}}" output=/user/hdfs/demo_vector/partitioning srid=8307 boundaries=-180,-90,180,90
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.24 Using Multiple Inputs
Multiple input data sets can be specified to a Vector job through the command line interface using the inputList
parameter. The inputList
parameter value is a group of input data sets. The inputList
parameter format is as follows:
inputList={ {input_data_set_1_params} {input_data_set_2_params} … {input_data_set_N_params} }
Each individual input data set can be one of the following input data sets:
-
Non-file input data set:
inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]
-
File input data set:
input=<path|comma_separated_paths|path_pattern> inputFormat=<FileInputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]
-
Spatial index input data set:
( ( indexName=<<indexName>> [indexMetadataDir=<<path>>]) | ( isInputIndex=<true> input=<path|comma_separated_paths|path_pattern> ) ) [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]
-
NoSQL input data set:
kvStore=<kv store name> kvStoreHosts=<comma separated list of hosts> kvParentKey=<parent key> [kvConsistency=<Absolute|NoneRequired|NoneRequiredNoMaster>] [kvBatchSize=<integer value>] [kvDepth=<CHILDREN_ONLY|DESCENDANTS_ONLY|PARENT_AND_CHILDREN|PARENT_AND_DESCENDANTS>] [kvFormatterClass=<fully qualified class name>] [kvSecurity=<properties file path>] [kvTimeOut=<long value>] [kvDefaultEntryProcessor=<fully qualified class name>] [kvEntryGrouper=<fully qualified class name>] [ kvResultEntries={ { minor key 1: a minor key name relative to the major key [fully qualified class name: a subclass of NoSQLEntryProcessor class used to process the entry with the given key] } * } ] [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]
Notes:
-
A Categorization job does not support multiple input data sets.
-
A SpatialJoin job only supports two input data sets.
-
A SpatialIndexing job does not accept input data sets of type spatial index.
-
NoSQL input data sets can only be used when kvstore.jar is present in the classpath.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.25 Loading Images from the Local Server to the HDFS Hadoop Cluster
Note:
If you cannot find the raster files, you can copy them to the shared directory folder created during the installation: check the Admin tab for the directory location, then copy the raster files into it.
If you receive an error, check the Raster Configuration details. If GDAL native library is not set-up correctly, much of the raster functionality of the web application will not work.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.26 Visualizing Rasters in the Globe
Before you can visualize the rasters in the globe, you must upload the raster files to HDFS, as explained in Loading Images from the Local Server to the HDFS Hadoop Cluster.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.27 Processing a Raster or Multiple Rasters with the Same MBR
Before you can visualize the rasters in the globe, you must upload the raster files to HDFS, as explained in Loading Images from the Local Server to the HDFS Hadoop Cluster.
Before processing rasters with the same MBR (minimum bounding rectangle), you must upload the raster files to HDFS, as explained in Loading Images from the Local Server to the HDFS Hadoop Cluster, and visualize the rasters, as explained in Visualizing Rasters in the Globe.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.28 Creating a Mosaic Directly from the Globe
Before you can create the mosaic image, you must upload the raster files to HDFS, as explained in Loading Images from the Local Server to the HDFS Hadoop Cluster.
Note:
Spark raster processing does not yet support all the options provided for Hadoop raster processing. For Spark raster processing, you must specify additional configuration parameters in the Spark Configuration section of the Admin tab:
-
spark.driver.extraClassPath, spark.executor.extraClassPath
: Specify your hive library installation using these keys. Example:/usr/lib/hive/lib/*
-
spark.kryoserializer.buffer.max
: Enter a value to support the kryo serialization. Example:160m
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.29 Adding Operations for Raster Processing
Before you add algebra operations for raster processing or image mosaic creation, follow the instructions in Processing a Raster or Multiple Rasters with the Same MBR until you have the raster processing dialog displayed. Before clicking Create Mosaic, perform these steps:
Note:
For some raster process operations using spark, you need to supply memory details to the spark drivers and executors, with the details depending of the size and details of the rasters in the process. For Spark raster processing, you must specify additional configuration parameters in the Spark Configuration section of the Admin tab:
-
spark.driver.extraClassPath, spark.executor.extraClassPath
: Specify your hive library installation using these keys. Example:/usr/lib/hive/lib/*
-
spark.kryoserializer.buffer.max
: Enter a value to support the kryo serialization. Example:160m
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.30 Creating a Slope Image from the Globe
Before you can create the mosaic image, you must upload the raster files to HDFS, as explained in Loading Images from the Local Server to the HDFS Hadoop Cluster.
Note:
Spark raster processing does not yet support custom process classes.
Parent topic: Using the Oracle Big Data SpatialViewer Web Application
2.13.31 Changing the Image File Format from the Globe
Before you can change the image file format, follow the instructions in Processing a Raster or Multiple Rasters with the Same MBR until you have the raster processing dialog displayed. Before clicking Create Mosaic, perform these steps:
Parent topic: Using the Oracle Big Data SpatialViewer Web Application