Data Masking

4 Data Masking

This chapter provides conceptual information about the features that comprise Oracle Data Masking, and procedural information about performing the task sequence including Masking Formats and Masking Definitions. Data masking presupposes that you have created an Application Data Model (ADM) with sensitive columns defined.

Introduction

Enterprises run the risk of breaching sensitive information when copying production data into non-production environments for the purposes of application development, testing, data sharing or analysis. Oracle Data Masking helps reduce this risk by irreversibly replacing the original sensitive data with fictitious data so that production data can be shared safely with non-production users. Accessible through Oracle Enterprise Manager, Data Masking provides end-to-end secure automation for provisioning test databases from production in compliance with regulations.

Data masking (also known as data scrambling and data anonymization) is the process of replacing sensitive information copied from production databases to non-production databases with realistic, but scrubbed, data based on masking rules. Data masking is ideal for virtually any situation when confidential or regulated data needs to be shared with non-production users. These users may include internal users such as application developers, or external business partners such as offshore testing companies, suppliers and customers. These non-production users need to access some of the original data, but do not need to see every column of every table, especially when the information is protected by government regulations.

Data masking enables organizations to generate realistic and fully functional data with similar characteristics as the original data to replace sensitive or confidential information. This contrasts with encryption or Virtual Private Database, which simply hides data, and the original data can be retrieved with the appropriate access or key. With data masking, the original sensitive data cannot be retrieved or accessed.

Names, addresses, phone numbers, and credit card details are some of the examples of data that require protection of the information content from inappropriate visibility. Live production database environments contain valuable and confidential data access to this information is tightly controlled. However, each production system usually has replicated development copies, and the controls on such test environments are less stringent. This greatly increases the risks that the data might be used inappropriately. Data masking can modify sensitive database records so that they remain usable, but do not contain confidential or personally identifiable information. Yet, the masked test data resembles the original in appearance to ensure the integrity of the application.

Note:

The Oracle Real Application Testing option of Oracle Database help you to securely assure the integrity of database changes and to manage test data. For more information, see: Oracle Database Testing Guide

Related Oracle Security Offerings

Table 4-1 Related Offerings

Oracle Solutions	Description
Oracle Data Masking and Subsetting	Software for creating masked and subsetted copies of production data for use in non-production environments such as testing and development databases.
Oracle Data Safe (Data Discovery/ Data Masking)	Discover and mask sensitive data with a cloud service that supports Oracle Databases everywhere: in the Oracle Cloud, on-premises, and third-party clouds.
Oracle Label Security	Implements Multi-Level Security (MLS), enabling rows with differing sensitivity to reside in the same table. Explicitly labels rows with group, compartment, and sensitivity levels then matches them with user labels.
Oracle Data Redaction	Redacts sensitive data from query results before display through client applications. Enforces redaction at runtime, with low overhead, and according to conditions set in policies.

Data Masking Components

Data Masking consists of two main components:

Masking Formats

A masking format represents the definition on how to mask some given data. For example, an IP Address masking format could be a 3 random digits from 0 to 255 followed by a dot (.) fixed string repeated 4 times (without a trailing dot). A masking format can either be one that you create, or one from the list of Oracle-supplied default masking formats.

As a matter of best practice, organizations should create masking formats for all commonly regulated information so that the formats can be applied to the sensitive data regardless of which database the sensitive data resides in. This ensures that all sensitive data is consistently masked across the entire organization.
Masking Definitions

A masking definition defines a data masking operation to be implemented on one or more tables in a database. Masking definitions associate table columns with formats to use for masking the data. They also maintain the relationship between columns that are not formally declared in the database using related columns.

You can create a new masking definition or use an existing definition for a masking operation. To create a masking definition, you specify the column of the table and how data should be masked. If the columns being masked are involved in unique, primary key, or foreign key constraints, data masking generates the values so that the constraints are not violated. Masking ensures uniqueness per character using decimal arithmetic. For example, a numeric string of length 5 generates a maximum of only 99999 unique values. Similarly, a numeric string of length 1 generates a maximum of only 9 unique values.

You can export masking definitions to files and import them on other systems. This is important when the test and production sites reside on different Oracle Management Systems or on entirely different sites.

Recommended Data Masking Workflow

The following figure shows that the production database is cloned to a staging region and then masked there. During the masking process, the staging and test areas are tightly controlled like a production site.

Figure 4-1 Recommended Data Masking Workflow

Description of "Figure 4-1 Recommended Data Masking Workflow"

The workflow diagram illustrates the process of data masking and subsetting within Oracle environments, with the following steps:

Sensitive Data Discovery: Identifies sensitive data that requires protection.
Masking Format: Involves creating masking formats, either through predefined Oracle-supplied templates or custom masking rules.
Masking Definition: Consolidates sensitive columns and their associated masking formats, which define how to mask the sensitive data in these columns.
Clone Production to Staging: A copy of production data is cloned to a staging environment where masking operations are performed.
Execute Masking Script: Applies the defined masking rules to the cloned data, ensuring sensitive data is protected.
Subset Definition: A definition to subset data, limiting data exposure.
Add Masking Definition: Combines the masking rules with the subsetting definition.
Execute Masking/Subsetting Script: Executes both masking and subsetting on the data set.
Sensitive Data Sharing: The masked and subsetted data can then be safely shared without revealing sensitive information.

Data masking is an iterative and evolving process handled by the security administrator and implemented by the database administrator. When you first configure data masking, try out the masking definition on a test system, then add a greater number of columns to the masking definition and test it to make sure it functions correctly and does not break any application constraints. During this process, you should exercise care when removing all embedded references to the real data while maintaining referential integrity.

After data masking is configured to your satisfaction, you can use the existing definition to repeatedly mask after cloning. The masking definition, however, would need to evolve as new schema changes require new data and columns to be masked.

After the masking process is complete, you can distribute the database for wide availability. If you need to ship the database to another third-party site, you are required to use the Data Pump Export utility, and then ship the dump file to the remote site.

You can also perform inline, or at the source, data masking while creating a subset definition.

Masking Formats

A masking definition requires one or more masking formats for any columns included in the masking definition. When adding columns to a masking definition, you can either create masking formats manually or import them from the format library. It is often more efficient to work with masking formats from the format library.

A masking format is made up of one or more format entries, where a format entry is a predefined type of supported masking method such as Fixed String, Random String, Shuffle etc. In essence, it describes how to transform an input value to an output value. The masking format is a combination of one or more of these transformations which chain together to take some input value and output a final masked value.

This section covers the following topics:

Creating a New Masking Format

This section describes how to create new masking formats using Enterprise Manager. To create a masking format:

From the Targets menu, select Databases, then click the Security menu and navigate to Data Masking and Subsetting then Data Masking. You can also access the menu through the database home page by selecting Security then Data Masking and Subsetting then Data Masking.
Go to Masking Formats under Data Masking. The Masking Format Library page appears with predefined formats that Oracle Enterprise Manager provides.
Click Create.
The Create Masking Format page appears, where you can define a masking format.
Provide a required name for the new masking format, add a description, select Sensitive Type, and then select a Custom Format Entry type from the drop-down list.
A section appears that enables you to provide input for the format entry you have selected. For instance, if you select Array List, the subsequent box enables you to enter a list of comma-separated values, such as New York, New Jersey, and New Hampshire.
Continue adding additional format entries as needed.
When done, provide an optional user-defined or post-processing function, then click Create to include this customized masking format in the library.
The Masking Format page reappears with your newly created format displayed in the Format Library table. You can use this format later to mask a column of the same sensitive type.

Creating a Masking Format from an Existing Format

After you have created at least one format, you can use this format as a template, where you can implement most of the masking format using a different name and changing the entries as needed, rather than needing to create a new format from scratch.

Navigate to the Masking Formats page and select the format you want to use as a template.
Click Actions and choose Create Like.
Enter a name for the new format and select a Sensitive Type.
Customize the format entries as needed and click Create.

Oracle-supplied Predefined Masking Formats

Enterprise Manager provides several out-of-box predefined formats. All predefined formats and built-in formats are random.

Masking Format Categories and Characteristics

Data masking formats have characteristics. Some common characteristics include combinable, uniqueness, reversible, and deterministic:

Combinable: A masking format is considered combinable when it can be combined with other basic masking formats or predefined masking formats though the use of conditions.
Uniqueness: A masking format is characterized as having uniqueness if it ensures uniqueness of the generated masked data. These types of masking formats are useful for masking columns with uniqueness constraints.
Reversible: A masking format that is characterized as reversible can retrieve original column data from masked data. Data masking usually means permanently replacing the data and ensuring that no one can retrieve the original data. But, sometimes you might want to see the original data.
Deterministic: A deterministic masking format generates consistent output for a given input across databases and data masking jobs. Deterministic masking helps to maintain data integrity across multiple applications and preserve system integrity in a single sign-on environment.

MASKING FORMAT	DESCRIPTION	SAMPLE MASKED DATA	COMBINABLE	UNIQUENESS
Visa Credit Card Number	~10 billion unique Visa credit card numbers	4929677281270400 4929026969239255	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
USA Phone Number Formatted	~2.7 billion unique USA phone numbers	305-557-9243 540-140-3020	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
USA Phone Number	~2.7 billion unique USA phone numbers	6823682394 3208603360	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
UPC Number Formatted	~100 billion UPC numbers	1-12345-12345-1 2-38193-12983-9	Yes	Yes. The number of distinct values in the specified range must be greater than or equal to the number of values in the column
UPC Number	~100 billion UPC numbers	277909630632 492259845110	Yes	Yes. The number of distinct values in the specified range must be greater than or equal to the number of values in the column
Social Security Number Formatted	~718 million unique US Social Security Numbers	058-20-6521 337-71-7394	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Social Security Number	~718 million unique US Social Security Numbers	087719720 177470222	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Social Insurance Number Formatted	~1 billion unique Canadian Social Insurance Numbers	861-115-129 186-915-534	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Social Insurance Number	~1 billion unique Canadian Social Insurance Numbers	300580727 780632832	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
National Insurance Number Formatted	Generates unique UK National Insurance Numbers	BH 85 53 00 D RX 81 61 33 B	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
MasterCard Credit Card Number	~10 billion unique MasterCard credit card number	5353780646712850 5259381925706856	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
ISBN (Thirteen Digit) Formatted	~2 billion unique ISBN numbers	978-3-315533-40-1 979-3-343005-63-4	Yes	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
ISBN (Thirteen Digit)	~2 billion unique ISBN numbers	9792746294911 9789075358902	Yes	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
ISBN (Ten Digit) Formatted	~1 billion unique ISBN numbers	8-13-132012-X 7-69-626483-4	Yes	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
ISBN (Ten Digit)	~1 billion unique ISBN numbers	9890454203 7851091314	Yes	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Generic Credit Card Number Formatted	~10 billion unique generic credit card numbers	3647-1229-3400-9783 3088-2317-5970-6344	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Generic Credit Card Number	~10 billion unique generic credit card numbers	5595470359501553 3826796343439260	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Finnish Social Security Numbers	~2.4 billion unique Finnish Social Security Numbers	080309A635C 290558-119J	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Discover Card Credit Card Number	~10 billion unique Discover Card credit card numbers	6011793517281170 6011061023216399	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Auto Mask Format	Masking by scrambling the characters and numbers. This format preserves the input length, position of the characters and numbers, case of the character (upper or lower), and special characters in the input.	Input: ABCD_343_ddg Output: FHDT_657_tte	Yes	Not applicable
Mexico CURP Code	A unique 18-character alphanumeric identifier assigned to every Mexican citizen and resident.	GARC800101HDFLNS08 MORA920715MNLVRS03	No	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Mexico Company RFC ID	A 12-character alphanumeric company code used for tax purposes.	ABC930101T12 XYZ850715M45	No	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Mexico Individual RFC ID	A 12-character alphanumeric used to identify individuals for tax purposes.	LOPE8001011H0 HERN7505128J3	No	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
Netherlands Citizen Service Number (BSN)	A unique nine-digit number used for all interactions with government agencies and some private institutions.	123456782 345678901	No	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column
American Express Credit Card Number	~10 billion unique American Express credit card numbers	3791330644732875 3481870017424316	Yes, if the generated masked values passes the post processing function validation	Yes, if the number of distinct values that can be generated by the LMF is greater than the number of distinct values in the column

Note:

The Deterministic and Reversible format characteristic values for all Masking Formats in the table above are "No".

For more information on Masking Format Characteristics, see: Characteristics of Masking Formats

Credit Card Numbers

Out of the box, the format library provides many different formats for credit cards. The credit card numbers generated by these formats pass the standard credit card validation tests by the applications, thereby making them appear like valid credit card numbers.

Some of the masking formats provided include coverage for credit card types such as:

Mastercard
Visa
American Express
Discover
Generic credit card numbers (can cover all card types)

You may want to use different styles for storing credit card numbers, such as:

Pure numbers
'Space' for every four digits
'Hyphen' ( - ) for every four digits, and so forth

To implement the masked values in a certain format style, you can set the DM_CC_FORMAT variable of the DM_FMTLIB package.

United States Social Security Numbers

Out of the box, you can generate valid U.S. Social Security (SSN) numbers. These SSNs pass the normal application tests of a valid SSN.

You can affect the format style by setting the DM_SSN_FORMAT variable of the DM_FMTLIB package. For example, if you set this variable to '-', the typical social security number would appear as '123-45-6789'.

ISBN Numbers

Using the format library, you can generate either 10-digit or 13-digit ISBN numbers. These numbers adhere to standard ISBN number validation tests. All of these ISBN numbers are random in nature. Similar to other format definitions, you can affect the "style" of the ISBN format by setting values to DM_ISBN_FORMAT.

UPC Numbers

Using the format library, you can generate valid UPC numbers. They adhere to standard tests for valid UPC numbers. You can affect the formatting style by setting the DM_UPC_FORMAT value of the DM_FMTLIB package.

Canadian Social Insurance Numbers

Using the format library, you can generate valid Canadian Social Insurance Numbers (SINs). These numbers adhere to standard tests of Canadian SINs. You can affect the formatting style by setting the DM_CN_SIN_FORMAT value of the DM_FMTLIB package.

North American Phone Numbers

Out of the box, the format library provides various possible U.S. and Canadian phone numbers. These are valid, realistic looking numbers that can pass standard phone number validation tests employed by applications. You can generate the following types of numbers:

Any North American phone numbers
Any Canadian phone number
Any U.S.A. phone number

UK National Insurance Numbers

Using the format library, you can generate valid unique random UK National Insurance Numbers (NINs). These numbers adhere to standard tests of UK NINs. A typical national insurance number would appear as 'GR 12 56 34 RS'.

Auto Mask

This format scrambles characters and numbers into masked characters and numbers and while retaining the format and length of the data, including special characters; for example, 'ABCD_343-ddg' masked as 'FHDT_657-tte'.

Format Entry Options to Customize Masking Format

Format entry options are as follows:

Array List
Accepts a list of values as input and maps each value in the list to a value in the input column. The number of values in the list should be greater than or equal to the number of distinct values in the masked column. The values in the user-provided list are ordered randomly before mapping them to the original column values. For example, if the original column contains values [10,20,30,40,50] and the Array List specified by the user is [99,100,101,102,103], the first masking run could produce the mapping [10,101], [20,103], [30,100], [40,99], [50,102] and a different masking run can produce [10,100], [20,99], [30,101], [40,102], [50,103].

A mapping table is created. The CTAS that creates the mapping table queries from:
- The original table to fetch the column values being masked and a row number for each column value. The row number is derived from the Oracle-supplied ROW_NUMBER function.
- The user-passed list of values — values in the user-passed array list are converted into a table-like record set using the SQL TABLE function. A row number is also retrieved corresponding to each value in the record set. The row number is derived from the ROWNUM pseudo column. The values in the record set are randomly ordered using DBMS_RANDOM.VALUE function.
- The mapping table CTAS then joins the row numbers in both sub-queries to map the original value (from sub-query in step 1 above) and a value from the user-list (sub-query in step 2 above). Multiple executions of the CTAS will create a mapping table with different original-masked value mappings because of the random ordering of the user list in step 2.
  Note:
  If masking column is either Date or Timestamp, user can mask the column using Array List format. The Date/Timestamp formats supported are
```
yyyy-MM-dd
```
  and
```
yyyy-MM-DDTHH:mm:ssTZD
```
  For example: "2015-01-18" and "2015-01-18T03:25:46Z".
Delete
Deletes a row based on a condition. If the condition matches, then the row is deleted on the target. A mapping table is created. The “DELETE_VAL” column in the mapping table is set to 1 for rows that are candidates to be deleted. For example, we are masking the SALARY column and the masking definition has conditions on the EMPID column and formats defined as:
```
EMPID < 100 
      DELETE 
EMPID < 200
	    RANDOM NUMBERS [Start Value:1 End Value:100]
DEFAULT
	    PRESERVE ORIGINAL DATA
```
The mapping table will have the DELETE_VAL column set to 1 for SALARY rows with EMPID < 100. DELETE_VAL for all other rows is set to 0. The final masking CTAS SQL which joins the original table and the mapping table to create the masked table filters out rows with DELETE_VAL set to 1. Therefore, the rows in the original table that match the join condition are effectively “deleted”.
Encrypt
The Encrypt masking format encrypts column data using Triple DES (3DES). The format of the column data after encryption is similar to that of the original values. For example, if you mask nine-digit numbers, the encrypted values also have nine digits. Encrypt is a deterministic and the only reversible masking format. It is helpful when businesses need to mask and send their data to a third party for analysis, reporting, or any other business processing purpose. After the processed data is received from the third party, the original data can be recovered (decrypted) using the same seed value that was used to encrypt the data.

You provide a regular expression to mask character or numeric type column. The specified regular expression must match all the original values in the column. If a value does not match the regular expression exactly, the masking format may no longer produce one-to-one mapping. Therefore, to ensure uniqueness, all the values must match the regular expression. The encrypted values also match the specified regular expression. Encrypt supports encryption of strings of fixed widths. It supports a subset of the regular expression language and does not support * or + syntax in regular expressions.

Note:
The maximum value of a user provided regular expression cannot exceed 64 bits.

You also provide a seed value that is used to generate a key for encryption and decryption. The seed value has to be provided at the time of submitting a data masking job. It can be any string containing alphanumeric characters.

If your masking definition has a sensitive column using Encrypt, you are shown the decrypt option while submitting a data masking job. Choosing this option, you can decrypt the encrypted column values by providing the same seed used to encrypt the sensitive column.
Fixed Number
This format does not use a lookup or a mapping table. It assigns a fixed number value to a string/number column.

The type of column applicable to this entry is a NUMBER column or a STRING column. For example, if you mask a column that has a social security number, one of the entries can be Fixed Number '900'. This format is combinable.
Fixed String
This format does not use a lookup or a mapping table. It assigns a fixed string value to a string column. For example, if you mask a column that has a License Plate Number, one of the entries can be Fixed String 'CA'. This format is combinable.
Null Value
Masks the column with a value of NULL. It does not use a lookup or a mapping table.
Preserve Original Data
Preserves the original column value. Used in conditional masking with a combination of other formats where only a subset of values needs to be masked based on a condition.
Random Dates
The format creates a mapping table. The mapping table CTAS contains code to generate random dates within a user specified date range. A random date is generated using the following logic:
```
TO_DATE(start_date','YYYY-DD-MM HH24:MI:SS') + 
mask_util.genrnd(0, <#of days between the specified date range>)
```
This format is combinable at the end after data range.
Random Decimal Numbers
If used as part of a mixed random string, these have limited usage for generating unique values. This masking format generates unique values within the specified range. For example, a starting value of 5.5 and ending value of 9.99 generates a decimal number ranging from 5.5 to 9.99, both inclusive. This masking format is combinable.
Random Digits
This format generates unique values within the specified range. For example, for a random digit with a length of [5,5], an integer between [0, 99999] is randomly generated, left padded with '0's to satisfy the length and uniqueness requirement. This is a complementary type of random number, which will not be padded. When using random digits, the random digit pads to the appropriate length in a string. It does not pad when used for a number column. This format is combinable.

Data masking ensures that the generated values are unique, but if you do not specify enough digits, you could run out of unique values in that range.
Random Numbers

If used as part of a mixed random string, these have limited usage for generating unique values. This format generates unique values within the specified range. For example, a starting value of 100 and ending value of 200 generates an integer number ranging from 100 to 200, both inclusive. This format is combinable.
Random Strings

This format generates unique values within the specified range. For example, a starting length of 2 and ending length of 6 generates a random string of 2 - 6 characters in length. This format is combinable.
Regular Expression

The format uses a lookup table. No mapping table is created. The PL/SQL function that implements the format is invoked directly from the final CTAS which creates the masked table. The lookup table has two columns to store the regular expression and the replacement value specified by the user. The SQL REGEXP_REPLACE function is used to implement this format.

The function has the signature:
```
regexp_replace(column_value, regex, replacement_val);
```
For example, phone numbers in the format
```
 nnn.nnn.nnnn
```
can be masked using a regex
```
[1-9]{3}[.][0-9]{3}[.][0-9]{4}
```
with a replacement value
```
***.***.****
```
The format invokes the regexp_replace for each regexp-replacement value pair. If the phone number column was masked using regular expression:
```
EMPID < 100
  Regular Expression  	Regex: [1-9]{3}[.][0-9]{3}[.][0-9]{4}  Replacement Value: 999.444.5555
  Regular Expression  	Regex: [9]{3}[.][4]{3}[.][5]{4} 	  Replacement Value: ***.***.****
```
Each column value matching the first regular expression is first replaced with 999.444.5555, this value then matches the second regular expression and is replaced with ***.***.****. The example is not a real world use case. The behavior probably is a side effect of how the format is implemented, the real use case of specifying multiple regular expression formats to mask a column is to handle cases when the data in the column could match multiple regular expressions. For example:
```
EMPID < 100
  Regular Expression  	Regex: [1-9]{3}[.][0-9]{3}[.][0-9]{4}  Replacement Value: ***.***.****
  Regular Expression  	Regex: [1-9]{3}[.][0-9]{3}[.][0-9]{3}  Replacement Value: ***.***.***
```
can be used to mask column values that store 10 digit - nnn.nnn.nnnn - or 9 digit - nnn.nnn.nnn - phone numbers.
Shuffle

This format does not use a lookup or a mapping table. The final CTAS which creates the mapping table includes a sub query to order the column contents randomly using the DBMS_RANDOM.VALUE function. If shuffle is used with a grouping column, the PARTITION clause is used to partition by the grouping column and the column is ordered randomly within each partition. The implementation is similar to that of Array List and Table Column. The random ordering for the shuffle format occurs on the column being “shuffle masked”, whereas in Array List, it is on the user-passed list, and in Table Column, the ordering is on the user-specified column.
Substitute

The format creates a mapping table. It uses a user specified “substitution” table as a source for masked values. The format uses the Oracle supplied hash based partitioning function ORA_HASH to map a column value to its mask value in a lookup (substitution) table. Deterministic substitution is used, and uniqueness is not guaranteed. The new dropdown lists can be used to choose attributes like objects, columns, and schemas.

Processing involves querying the substitution table to get a count of distinct values in the mask column. This count - n - is then used as the max_bucket parameter of ORA_HASH to hash the original column values into n buckets. For example, if we are masking EMPLOYEE.SALARY and using SUBST.SUB_COL column as the substitution column, we first get the count of distinct values in SUB_COL. The mapping table CTAS then queries:
1. The original column, EMPLOYEE.SALARY
2. The user provided substitution table to fetch all the distinct values in SUBST.SUB_COL and also fetches the ROWNUM associated with each row
The CTAS SQL then joins 1 and 2 using ORA_HASH and equating its output to the ROWNUM from step 2. The SELECT part of the CTAS SQL is listed below. max_bckt is the count of distinct values in the substitution column SUBST.SUB_COL:
```
select s.orig_val,
       a0.new_val
from ( select orig_val
       from (select "SALARY" orig_val
        	 from "TESTU"."EMPLOYEE") 
       group by orig_val) s,
     (select rownum rn, 
             SUB_COL new_val 
      from (select distinct SUB_COL
            from TESTU.SUBST
            order by SUB_COL)) a0
 where ora_hash(s.orig_val, max_bckt, seed)+1 = a0.rn
```
SQL Expression

The format does not create a mapping table. It allows a user to use a SQL Expression for masking a column. Data masking uses this expression to generate masked values to replace the original values. The expression is invoked directly from the masking CTAS SQL. The SQL Expression can consist of one or more values, operators, and SQL functions that evaluates to a value. It can also contain substitution columns (columns from the same table as the masked column). Some examples of valid expressions:
```
1.	dbms_random.string('u', 8) || '@company.com'
2.	%first_name% || '.' || %last_name% || '@company.com'
```
Substring

The format creates a mapping table. The mapping table CTAS invokes the Oracle SUBSTR function on the input column. The format accepts a start position and length as input, extracts that data from the input column using SUBSTR, and uses that as a mask value.
Table Column

The format creates a mapping table. The format maps original column values to column values in a user specified table. The processing is similar to the array list format. The values in the user specified table are randomly ordered using DBMS_RANDOM.VALUE before mapping each value to the original column. Unlike the Substitute format, the format is not deterministic since the substitution column is randomly ordered.
Truncate

The format truncates all rows in a table. It does not create a mapping table. If one of the columns in a table is masked using this format, so no other mask formats can be specified for any of the other columns.
User Defined Function

The format creates a mapping table. The return value of the user defined function is used to mask the column. The function is invoked as part of the mapping table CTAS. The function has a fixed signature:
```
 function userdef_func(rowid varchar2, col_name varchar2, orig_val varchar2) returns varchar2;
```
Post Processing Function

Optionally, some formats can also include a post-processing function that takes the output after executing all the format entries, runs the function and produces the final masked value.

The format allows users to use a custom function to process column values after they are masked using standard data masking formats. For example, the SALARY column can be masked with a SQL expression first, and a post processing function can be applied on the masked values to add a currency symbol, like ‘$’. The function has a fixed signature:
```
function post_proc_func(rowid varchar2, column_name varchar2, mask_value varchar2) returns varchar2;
```
The ROWID input allows a user to fetch column values from the masked table. The function could use these values to mask the input column value, basically to transform the column further after a standard format is applied on the column. This format creates a mapping table. The post processing function gets invoked as part of the mapping table CTAS SQL. The input to the mask_value argument of the function is the masked value of the original column. For example, say we are masking the SALARY column and the mask definition has conditions on the EMPID column and formats are defined this way:
```
EMPID < 100 
      RANDOM NUMBERS [START:100000 END: 10000000]
      POST PROCESSING FUNCTION ppf 
EMPID < 200
	    FIXED NUMBER 100000
DEFAULT
	    PRESERVE ORIGINAL DATA
```

Note:

Refer to Frequently Asked Questions for recommended customize masking formats using above entry options.

Deterministic Masking Using the Substitute Format

You may occasionally need to consistently mask multiple, distinct databases. For instance, if you run HR, payroll, and benefits that have an employee ID concept on three separate databases, the concept may be consistent for all of these databases, in that an employee's ID can be selected to retrieve the employee's HR, payroll, or benefits information. Based on this premise, if you were to mask the employee's ID because it actually contains his/her social security number, you would have to mask this consistently across all three databases.

Deterministic masking provides a solution for this problem. You can use the Substitute format to mask employee ID column(s) in all three databases. The Substitute format uses a table of values from which to substitute the original value with a mask value. As long as this table of values does not change, the mask is deterministic or consistent across the three databases.

Using the Shuffle Format

A shuffle format is available, but it does not keep the original data distribution if the column values are not unique or when conditional masking is used. For example, in an Original Table with two columns, EmpName and Salary, the Salary column might have distinct values like 10, 90, and 20.

Original Table

EmpName	Salary
A	10
B	90
C	10
D	10
E	90
F	20

If you mask the Salary column with this format, each of the original values is replaced with one of the values from this set. Assume that the shuffle format replaces 10 with 20, 90 with 10, and 20 with 90.

Mapping Table (Non-preservation)

EmpName	Salary
10	20
90	10
20	90

The result is a shuffled Salary column as shown in the Masked Table, but the data distribution is changed. While the value 10 occurs three times in the Salary column of the Original Table, it occurs only twice in the Masked Table.

Masked Table (Non-preservation)

EmpName	Salary
A	20
B	10
C	20
D	20
E	10
F	90

If the salary values had been unique, the format would have maintained data distribution.

Using Group Shuffle

Group shuffle enables you to perform a shuffle within discrete units, or groups, where there is a relationship among the members of the group. Consider the case of shuffling the salaries of employees. This table illustrates the group shuffle mechanism, where employees are categorized as managers (M) or workers (W), and salaries are shuffled within job category.

Employee	Job Category	Salary	Shuffled Salary
Alice	M	90	88
Bill	M	88	90
Carol	W	72	70
Denise	W	57	45
Eddie	W	70	57
Frank	W	45	72

Using Conditional Masking

To demonstrate how conditional masking can handle duplicate values, add another job category, assistant (A), where the employee in this category, George, earns the same as Frank. Assume the following conditions:

If job category is M, replace salary with a random number between 1 and 10.
If job category is W, set salary to a fixed number (01).
Default is to preserve the existing value.

Applying these conditions results in the masked values shown in the following table:

Employee	Job Category	Salary	Conditional Result
Alice	M	90	5
Bill	M	88	7
Carol	W	72	01
Denise	W	57	01
Eddie	W	70	01
Frank	W	45	01
George	A	45	45

Conditional masking works when there are duplicate values provided there are no dependent columns or foreign keys. If either of these is present, a "bleeding condition" results in the first of two duplicate values becoming the value of the second. So, in the example, Frank's salary is not preserved, but becomes 01.

Providing User-defined and Post-processing Functions

If desired, you can provide user-defined and post-processing functions on the Create Masking Format page. A user-defined choice is available in the Custom Format Entry list, and a post-processing function field is available at the bottom of the page.

User-defined functions

To provide a user-defined function, select User Defined Function from the drop-down list to access the input fields.

A user-defined function passes in the original value as input, and returns a mask value. The data type and uniqueness of the output values must be compatible with the original output values. Otherwise, a failure occurs when the job runs. Combinable, a user-defined function is a PL/SQL function that can be invoked in a SELECT statement. Its signature is returned as:
```
Function udf_func (rowid varchar2, column_name varchar2, original_value varchar2) return varchar2;
```
- rowid is the min (rowid) of the rows that contain the value original_value 3rd argument.
- column_name is the name of the column being masked.
- original_value is the value being masked.
That is, it accepts the original value as an input string, and returns the mask value.

Both the input and output values are varchar2. For instance, a user-defined function to mask a number could receive 100 as input, the string representation of the number 100, and return 99, the string representation of the number 99. Values are cast appropriately when inserting to the table. If the value is not castable, masking fails.
Post-processing functions

To provide a post-processing function, enter it in the Post Processing Function field.

A post-processing function has the same signature as a user-defined function, but passes in the mask value the masking engine generates, and returns the mask value that should be used for masking, as shown in the following example:
```
Function post_proc_udf_func (rowid varchar2, column_name varchar2, mask_value varchar2) return varchar2;
```
- rowid is the min (rowid) of the rows that contain the value mask_value.
- column_name is the name of the column being masked.
- mask_value is the value being masked.

Patterns of Format Definitions

All of the predefined format definitions adhere to these typical patterns:

Generate a random number or random digits.
Perform post-processing on the above-generated value to ensure that the final result is a valid, realistic value.

For example, a valid credit card number must pass Luhn's check. That is, the last digit of any credit card number is a checksum digit, which is always computed. Also, the first few digits indicate the card type (MasterCard, Amex, Visa, and so forth). Consequently, the format definition of a credit card would be as follows:

Generate random and unique 10-digit numbers.
Using a post-processing function, transform the values above to a proper credit card number by adding well known card type prefixes and computing the checksum digit.

This format is capable of generating 10 billion unique credit card numbers.

Masking Definitions

A masking definition defines a data masking operation to be implemented on one or more tables in a database. Masking definitions associate table columns with masking formats to use for masking the data.

Prerequisites

Before creating a masking definition, review requirements and advisory information here: Prerequisites

Ensure the format you select does not violate check constraints and does not break any applications that use the data.
For triggers and PL/SQL packages, data masking recompiles the object.
Exercise caution when masking partitioned tables, especially if you are masking the partition key. In this circumstance, the row may move to another partition.
Data Masking does not support clustered tables, masking information in object tables, XML tables, and virtual columns. Relational tables are supported for masking.
If objects are layered on top of a table such as views, materialized views, and PL/SQL packages, they are recompiled to be valid.

If you plan to mask a test system intended for evaluating performance, try to preserve the production statistics and SQL profiles after masking by adding a pre-masking script to export the SQL profiles and statistics to a temporary table, then restoring after masking completes.

Note:

With the introduction of Enterprise Manager 24.1 Release 1 Update 2 (RU02), it is now possible to re-associate masking definitions from dropped database targets to existing target databases from the Enterprise Manager Masking Definitions page.

Creating a Masking Definition

From the Targets menu, select Databases, then click the Security menu and navigate to Data Masking and Subsetting then Data Masking. Alternatively, you can right-click on your specific database and go to Security then Data Masking and Subsetting then Data Masking.

The Masking Definitions page appears, where you can create new or manage existing masking definitions.
Click Create to go to the Create Masking Definition page.

A masking definition includes information regarding table columns and the format for each column. You can choose which columns to mask, leaving the remaining columns intact.
Provide required details, including Name, Application Data Models, Associated Database, and Database Named Credentials.
Click Next to go to the Create Masking Definition: Masking Formats page, where you can choose which sensitive columns in the ADM you want to mask.
Masking formats are automatically assigned by default for the discovered columns. You can review these formats and can edit based on your requirement. In order to edit the masking format, select the desired sensitive columns you want to mask.

Note:
There can be columns for which a masking format is not defined. You would need to review such columns and define a masking format.

Click Define Masking Format to define the format for the column. Also, this option can save significant time. When you select multiple columns to add that have the same data type, you do not need to define the format for each column one by one.

For instance, if you search for Social Security numbers (SSN) and the search yields 100 SSN columns, you could select them all, then click Define Masking Format to import the SSN format for all of them.
Optionally, below are the other masking options available:
- Group Masking: If you want to mask selected columns as a group, enable Group Masking. The columns that you want to mask as a group must all be from the same table. Any formats defined while this checkbox is selected will be applied to all the selected columns as a group.
- Select all filtered rows: Enable the Select all filtered rows check box if you want to mask more than one column together rather than separately.
  After you define the masking format for the grouped columns and return to this page, the Column Group in the table shows an identical number for each entry row in the table for all members of the group. For example, if you have defined your first group containing four columns, each of the four entries in this page will show a number 1 in the Column Group column. This helps you to distinguish which columns belong to which column groups.
The Define Masking Format page appears, where you can define the format for all selected columns, as explained below:
- Choose the masking format either from the predefined masking formats drop-down or customize your own. Click Import.
- You can provide one or more format entries.
- You can generate Sample Data by clicking Generate.
- When you have finished formatting the column, click Add. The sensitive columns you selected and assigned masking formats to earlier now appear in the next section. Click Next.
Review the Data Masking options. Before clicking Next, you can also provide the pre- and post-masking script. Refer to Selecting Data Masking Advanced Options for more information regarding Pre Mask Scripts and Post Mask Scripts.
Review the final page and click Create to return to the Masking Definitions page. The new masking definition will appear at the top of the table.

Selecting Data Masking Advanced Options

Data Masking Options

The data masking options include:

Disable redo log generation during masking:

Masking disables redo logging and flashback logging to purge any original unmasked data from logs. However, in certain circumstances when you only want to test masking, roll back changes, and retry with more mask columns, it is easier to uncheck this box and use a flashback database to retrieve the old unmasked data after it has been masked. You can use Enterprise Manager to flashback a database.

Note:

Disabling this option compromises security. You must ensure this option is enabled in the final mask performed on the copy of the production database.
Refresh statistics after masking:

If you have already enabled statistics collection and would like to use special options when collecting statistics, such as histograms or different sampling percentages, it is beneficial to turn off this option to disable default statistics collection and run your own statistics collection jobs.
Drop temporary tables created during masking:

Masking creates temporary tables that map the original sensitive data values to mask values. In some cases, you may want to preserve this information to track how masking changed your data. Note that doing so compromises security. These tables must be dropped before the database is available for unprivileged users.
Decrypt encrypted columns:

This option decrypts columns that were previously masked using Encrypt format. To decrypt a previously encrypted column, the seed value must be the same as the value used to encrypt.

Decrypt only recovers the original value if the original format used for the encryption matches the original value. If the originally encrypted value did not conform to the specified regular expression, when decrypted, the encrypted value cannot reproduce the original value.
Use parallel execution when possible:

Oracle Database can make parallel various SQL operations that can significantly improve their performance. Data Masking uses this feature when you select this option. You can enable Oracle Database to automatically determine the degree of parallelism, or you can specify a value. For more information about using parallel execution and the degree of parallelism, see the Oracle Database Data Warehousing Guide.
Recompile invalid dependent objects after masking:

The masking process re-creates the table to be masked and as a consequence, all existing dependent objects (packages, procedures, functions, MViews, Views, Triggers) become invalid. You can specify that the masking process recompile these invalid objects after creating the table, by selecting the check box. Otherwise, invalid objects are not recompiled using utl_comp procedures at the end of masking.

If you choose this option, indicate whether to use serial or parallel execution. You can enable Oracle Database to automatically determine the degree, or you can specify a value. For more information about using parallel execution and the degree of parallelism, see the Oracle Database Data Warehousing Guide.

Random Number Generator

The random number generation options include:

Favor Speed:

The DBMS_RANDOM package is used for random number generation.
Favor Security:

The DBMS_CRYPTO package is used for random number generation. Additionally, if you use the Substitute format, a seed value is required when you schedule the masking job or database clone job.

Pre Mask and Post Mask Scripts

Pre and post masking scripts are free flow scripts where a user can execute SQL statements and PL/SQL stored program units (procedures, functions, packages, anonymous block) before and after masking as required.

Use the Pre Mask Script text box to specify any user-specified SQL script that must run before masking starts.

Use the Post Mask Script text box to specify any user-specified SQL script that must run after masking completes.

Pre-Masking Checks

Prior to generating or running a masking job, a pre-masking check is recommended. The pre-masking check performs a number of checks on the selected target database to determine if it is properly configured for a masking job. The pre-masking check performs the following checks:

Table 4-2 Pre-Masking Checks

Check	Description
Database target user has been granted all required privileges	Check if the masking user has sufficient privileges to perform masking. For more information see, Grant Roles to the Oracle Data Safe Service Account on Your Target Database.
No invalid objects found	Check if there are invalid dependent objects of the tables being masked. Masking involves dropping and recreating the existing tables being masked and their dependent objects. Masking might run into issues recreating the invalid objects.
Sufficient default and TEMP tablespace	Check whether the user's default tablespace and TEMP tablespace have sufficient free space for masking. Masking requires the TEMP tablespace to be at least twice the size of the largest table being masked, and the default tablespace to be at least three times its size.
Database/system level triggers	Check if there are database/system-level triggers. Masking involves dropping and recreating the existing tables being masked and their dependent objects, such as triggers. Masking users may not have the privileges to drop database/system-level triggers, so masking might run into errors.
`AUTOEXTEND` is disabled for the undo tablespace, but there is still available space remaining	Check whether `AUTOEXTEND` is enabled for the undo tablespace.
Tables in the masking policy have up-to-date statistics	Check if the tables being masked have up-to-date statistics. Current statistics allow precise computation of space usage and other operations that rely on dictionary statistics. For more information see, GATHER_TABLE_STATS Procedure in the Oracle Database PL/SQL Packages and Types Reference guide.
Tables in the masking policy have no Oracle Label Security (OLS) policies	Check if the tables being masked have Oracle Label Security (OLS) policies applied. In order for the masking job to run properly, OLS policies affecting the table need to be dropped or disabled. Post-masking, you will need to rebuild the OLS policies.
Virtual Private Database (VPD): The masking user has access or VPD policies are disabled on the tables in the masking policy	Check if the tables being masked have Virtual Private Database (VPD) policies applied. In order for the masking job to run properly, the masking user must be granted sufficient privileges to access the table data, or the VPD policies affecting the table need to be dropped or disabled. The data masking job always drops VPD policies on the tables post masking, you will need to rebuild the VPD policies.
Data Redaction: The masking user has access, or Data Redaction policies are disabled on the tables in the masking policy	Check if the tables being masked have Data Redaction policies applied. In order for the masking job to run properly, the Data Safe user must be granted sufficient privileges to access the table data, or the Data Redaction policies affecting the table need to be dropped or disabled. The data masking job always drops the Data Redaction policies on the tables. Post-masking, you will need to rebuild the Data Redaction policies.
Database Vault: The masking user has access, or Database Vault policies are disabled on the tables in the masking policy	Check if the Data Safe user has permissions to access the target tables if they are covered by Database Vault policies. In order for the masking job to run properly, the Data Safe must be able to access all the data on the target tables. Ensure that the Data Safe user is granted proper privileges. If Database Vault is not enabled on the table, this check will pass. For more information on Database Vault, see the Oracle Database Vault Administrator’s Guide.
Columns defined in the masking definition exist in the target	Check if the masking columns are available in the target database
There is at-least one masking column defined for masking definition	Check if there is at-least one masking column defined in the masking definition
Required packages are in valid state	Check if any of the following packages are in invalid state: `SYS.DBMS_CRYPTO`, `SYS.UTL_RECOMP`
Tables in the masking definition do not have indexes present in SYS schema	Check if the tables present in the masking definition have any SYS indexes present. Masking involves dropping and recreating indexes associated with the tables being masked. So indexes present in the SYS may cause failure during masking

If any of the above checks fail, it is recommended to perform the remediation actions listed in the pre-masking report. Once the issues have been remediated, perform the pre-masking check again to determine if the database is properly configured for the masking job. Once all of the checks have passed, you can generate the masking script and perform the masking job.

Performing a Pre-Masking Check

Prior to generating a masking script, you should perform a pre-masking check to determine if the database is properly configured for the masking job. If the pre-masking check produces any failures, then you should perform the remediation recommendations.

Select your Masking Definition under Data Masking, and click Actions.
Click Pre-Masking checks.
Click Schedule and provide the necessary details.
Click Submit.
Wait for the pre-masking check to finish. Perform any remediation actions and ensure all checks pass prior to generating the masking script. This may require running an additional pre-masking check.

View a Pre-Masking Check Report

After performing a pre-masking check, you will need to view the report to determine if checks were failed or passed. If the pre-masking check produces any failures, then you should perform the remediation recommendations.

Select your Masking Definition under Data Masking, and click Actions.
Click Pre-Masking checks. The Pre-Masking Checks window is displayed.
From the list of Pre-Masking Check Reports, select the one you want to view.
Check the log messages.

Generating Masking Script

To schedule a script generation job:

Select the masking definition to generate a script for, then click on the Actions button and select Manage Masking Script then Generate Masking Script.
Select a Data Masking option:
- In-Database Masking – to replace sensitive data in-place with masked data on a specified database (usually copied from production).
- In-Export Masking – to export masked data from the specified source database (usually production) using Oracle Data Pump.
- You can choose both options; that is, a script to mask the database directly and a script to create a masked dump.
Select an Associated Database and Database Named Credentials from the drop-down list.
Change the default job name to something meaningful, if desired, and provide an optional job description.
Specify to start the job immediately or at a later specified date and time, then click Generate.
A message confirms that the Generate script job has been submitted successfully. Click on the Refresh button to see the latest job status.

Scheduling Masking Job

To set up the data masking job and schedule its execution:

Select the masking definition for which a script has been generated, then click Actions and select Schedule Masking.
Select a Data Masking Option button to indicate your preference:
- In-Database Masking – Replaces sensitive data directly in the specified database (usually a copy from production) with masked data. This option is intended for non-production environments only.
  
  Note:
  You must enable the check box indicating that the selected database is not a production database in order to proceed.
- In-Export Masking – Exports masked data from the specified source database (usually production) using Oracle Data Pump. This option is safe for production environments, as it does not modify customer data. However, it creates temporary tables that will be dropped once the masking operation is completed. Your selection will update the checkbox text below the buttons and other areas on the page.
From the dropdown menus, select the required Associated Database, Database Named Credentials, and Host Named Credentials.
Choose a Tablespace for Temporary Objects. There are three options:
1. Default Tablespace
2. Temporary Tablespace
3. Custom Tablespace
Select the appropriate option based on your needs.
Enter a Job Name or accept the default pre-filled value.
Specify whether to start the job immediately or at a later date and time. Then, click Next.
If you selected In-Database Masking in Step 2:
1. Enter the script file location and name (pre-filled), and then click Submit.
If you selected In-Export Masking in Step 2:
1. Enter the Export Filename (pre-filled).
2. Specify a directory to save the mask dump files to by selecting an option from Type of Export Directory. The available options are a Database Directory, Custom Directory or an External Directory.
3. Select whether to export the masked data only or the entire database along with the masked data through Data to Export. Exporting the entire database might significantly increase the overall job execution time.
4. Enter the Maximum Export File Size (MB) (pre-filled). Optionally, choose to Enable Export File Compression and/or Enable Export File Encryption according to your needs. If Export File Encryption is enabled, enter and confirm an encryption password.
5. Specify the Maximum Number of Threads (pre-filled).
6. Generate Data Dump Logs is enabled by default but you can choose to turn it off as necessary.
7. Click Submit.
A message confirms that the job has been scheduled and you can use the refresh button above the table to update the job status.