3. Preliminaries
In this section, we review regular expressions and their probabilistic variants. We show that regular expressions are concise and powerful when used in a pattern definition language. However, as a DGL, the lack of support for data domains is a critical drawback.
3.1 Regular Expressions
A regular expression (regex for short) is a sequence of characters that defines patterns in text [24-26]. A pattern consists of one or more characters, operators, or constructs. Each character in a regular expression is either a meta-character with a special meaning or a regular character with a literal meaning.
Character set []. By placing the characters to be matched between square brackets, such as in [aeiou], a character set is defined such that any character in it can be matched. The regex engine matches only one of the characters in a character set. A hyphen inside a character set specifies a range of characters, for instance, [a-z], [0-9a-fA-F], and [SA-E]. Here, \d, and \w are shorthand for the character sets [0-9] and [0-9a-zA-Z_].
Quantifiers ?, *, +, {n, m}. A quantifier defines the number of times a preceding token is repeated. An asterisk (*) means that the preceding token is repeated zero or more times. The question mark (?) tells the regex engine to optionally match the preceding token. The plus operator (+) tells the regex engine to repeat the token one or more times. Here, {n, m} is an additional quantifier that specifies the number of times a token can be repeated. The value of n is zero or a positive integer number, indicating the minimum number of matches, and m is an integer equal to or greater than n indicating the maximum number of matches.
Alternation |. The alternation operator has the lowest precedence among all regression operators. The regex engine will match everything to the left of an alternation operator or everything to its right.
Capturing group (). A regular expression can be grouped by placing a part of it inside parentheses. This allows the application of a quantifier to the entire group or restricts an alternation to a part of the regex. In addition to grouping part of a regular expression together, parentheses also create a numbered capturing group. A portion of the string matched by a part of the regular expression is stored inside the parentheses. Capturing groups make it easy to extract a part of a regex match. The numbered capturing groups can be reused later inside the regular expression via back references.
Backreference, \1–\9. Backreferences match the same text as previously matched by a capturing group. By placing the opening tag into a group, the captured tag name can be reused by a backreference to match the closing tag. For example, in <(h[1-6])>.*?</\1>, captures one of the headline tags and references them later when matching the closing tag.
3.2 Probabilistic Regular Expressions
Probabilistic regular expressions are succinct representations of a corresponding probabilistic regular language. Probabilistic regular grammar is regular grammar with the probability of production rules. A probabilistic grammar G can be defined by a quintuple G = (N, T, R, S, P), where N is the set of nonterminal symbols, T is the set of terminal symbols, R is the set of production rules, S is the start symbol, and P is the set of probabilities of production rules. Consider a grammar with N = {E, A}, T = {a, b, c}, R = {[TeX:] $$E \rightarrow a E, E \rightarrow b E, E \rightarrow b A, A \rightarrow a A, A \rightarrow c$$} and S = {E}, P = {(0.50, 0.25, 0.25), (0.40, 0.60)}. The production rules for non-terminal E are as follows:
The language defined by this grammar is a set of strings with at least two c endings, such as abcc and aabcc. With the specified production probabilities, a string aaabcc of length 6 is more likely to occur than bbbbcc of the same length. By contrast, in classic regular grammar, uniform distributions are always assumed for rules with the same left-hand nonterminal. We extend the classic regular expressions to a probabilistic version by introducing a probabilistic alternation:
In (1), [TeX:] $$E_i(i=1,2, \ldots, n)$$ is either a classic or probabilistic regular expression, and [TeX:] $$p_i$$ is the probability of [TeX:] $$E_i$$. The following is an example defining the grades for a college class (S = excellent, A = good, B = well done, C = pass, and D = do not pass).
A, B, and C are the most common grades, whereas S and D are less common, and ABC is therefore chosen uniformly at random.
4. Enhanced Regex as a DGL
Although regular expressions are a simple language for specifying patterns in text, their context-free characteristics limit their expressiveness. For example, a regular expression cannot naturally define valid data types, such as dates. A regular expression has difficulty validating months in [1...12] and days in [1...31]. In this section, we introduce data domains and other important enhancements to make regex a powerful data generation language.
4.1 Data Domains
Synthetic datasets have many attributes, each of which has a data domain. A data domain is the collection of values permitted for an attribute. It is important to provide a suitable domain for synthesizing meaningful data for each attribute. The standard domain types in most database management systems (DBMSs) include texts, numerals, dates, and times. These primitive data domains are insufficient for data synthesis because the data values in a real dataset only contain meaningful data that are usually a subset of a data domain. For example, a reasonable domain for a name attribute can be a variable character, such as VARCHAR(16), and thus any string no longer than 16 characters in length is valid. However, in a real database, only a small subset of meaningful names is considered valid.
In this section, we describe the domain definition mechanism. Basically, we provide the types of domain definitions, i.e., set and range. A set is a finite collection of values in memory or external storage, such as files or databases. A range is a sequence of values in which the next value can be obtained by computing the previous one.
4.1.1 Set domains
A set domain is defined as a set in the following form:
The resource parameter specifies where to obtain the data. A resource can be a set of values in memory or values from a file or database. The case option defines whether or how to convert values in a list. The case can be lower, upper, or title case. Now, let us explain how resources are organized and utilized in the domain definition.
In-memory resources: In most programming languages, in-memory resources are a list or array. The following list of fruit names is an example of an in-memory resource:
File-based resources: Resources are organized into files with a tree-like directory structure (Fig. 1). All files and directories are under a common root dict and their paths are specified in a dot-separated form, that is, root.parent.child. The child part in the path can be either a file (leaf) or directory (non-leaf). If it is a file, the content of the file (item per line) is read and parsed into a list. If it is a directory, all files under the directory, and recursively the files in the subdirectories, are combined into a single resource.
Directory structure of file-based resources.
Wildcards * can be used for abbreviated path specifications such root.parent.*.descendant, and can be shortened as root.parent.descendant if unambiguous (if descendant is not a real directory under parent). For example, dict.person.firstname refers to the first name of a person, both men and women, which is a shorthand of dict.person.*.firstname. To use last names, and women’s first names, we can specify the resources as follows:
The files are in dict/person/lastname and dict/person/woman/firstname, where lastname and firstname are the target resource files. We can use abbreviated path specifications to name the resources.
All first names under dict/person form a single resource, including dict/person/man/firstname and dict/person/woman/firstname.
DB-based resources: Resources are also supported by backend databases. Suppose there is a table called fruits in the database, containing columns such as name and color. A resource can be defined through SQL:
4.1.2 Sequence domains
Many data domains can be treated as a sequence of consecutive values defined for certain data types. Such domains can be defined as (2). A sequence is built from a starting point, and the next values can be obtained by applying a delta to the previous value.
Delta is the difference between the consecutive values. Each delta value has a number and optional unit in the form of [TeX:] $$<\text { delta }>:=<\text { number }>[<\text { unit }>] .$$ When more than one delta value is given, it is repeatedly applied individually until reaching the stop point. A negative delta value is allowed, which means a decrement of the value. For example,
delta = [5]: increment of 5.
delta = [5, 2]: increment of 5, followed by increment of 2
delta = [-0.2]: decrement of 0.2
Note that because no data type is explicitly given in (2), it must be determined implicitly. Alternatives to (2) can be used when the data type is determined.
integer (start,stop=infinity,delta,format)
decmial (start,stop=infinity,delta,format)
datetime (start,stop=infinity,delta,format)
A unit is associated with the type of data time that specifies the time unit for a delta value. A unit can be D for day, M for month, Y for year, W for week, h for hour, m for minute, and s for second. For example,
delta = ['2D']: increment of 2 days
delta = ['W']: increment of 1 week, or equal to 7 days (7D)
delta = ['-3h', '30m']: initial decrement of 3 h, followed by increment of 30 min.
Finally, a domain can have a format definition when formatted values are required, for example, precision for a float number or leading zeros for integers. In the following section, we provide some examples of sequence domains.
Integer domains
[TeX:] $$\boldsymbol{r a n g e}(1)=[1,2,3, \cdots, \infty]$$
[TeX:] $$\boldsymbol{r a n g e}(1,120)=[1,2,3, \cdots, 119]$$
[TeX:] $$\boldsymbol{r a n g e}(001,120)=[001,002,003, \cdots, 119]$$
[TeX:] $$\boldsymbol{r a n g e}\left(1,120, \boldsymbol{f o r m a t}=^{\prime} 03 d^{\prime}\right)=[001,002,003, \cdots, 119]$$
[TeX:] $$\boldsymbol{r a n g e}(1, \boldsymbol{ delta }=[3])=[1,4,7,10, \cdots, \infty]$$
[TeX:] $$\boldsymbol{r a n g e}(1,120, \boldsymbol{ delta }=[3,1])=[1,4,5,8,9, \cdots, 116,117]$$
Decimal domains
[TeX:] $$\boldsymbol{r a n g e}(0,5, \boldsymbol{ delta }=[0.01])=[0,0.01,0.02, \cdots, 4.49]$$
[TeX:] $$\boldsymbol{r a n g e}(0,00, 5.00, \boldsymbol{ delta }=['0.01'])=[0.00,0.01,0.02, \cdots, 4.49]$$
[TeX:] $$\boldsymbol{r a n g e}\left(0,5, \boldsymbol{ delta }=[0.01], \boldsymbol{ format }=^{\prime} .2 f^{\prime}\right)=[0.00,0.01,0.02, \cdots, 4.49]$$
[TeX:] $$\boldsymbol{r a n g e}(2, -2, \boldsymbol{ delta }=[-0.1])=[2,1.9,1.8, \cdots, -1.8,-1.9]$$
[TeX:] $$\boldsymbol{r a n g e}\left(2,-2, \boldsymbol{ delta }=[-0.1], \boldsymbol{ format }=^{\prime} .2 f^{\prime}\right)=[2.00,1.90,1.80, \cdots, -1.80,-1.90]$$
Datetime domains
[TeX:] $$\boldsymbol{r a n g e}\left(' 2020-4^{\prime},{ }^{\prime} 2021-3^{\prime}, \boldsymbol{ delta }=\left[{ }^{\prime} W^{\prime}\right]=\left[' 2020-4-1^{\prime},{ }^{\prime} 2020-4-88^{\prime}, \cdots\right]\right.$$
[TeX:] $$\boldsymbol{r a n g e}\left(' 9: 00^{\prime},{ }^{\prime} 19: 00^{\prime}, \boldsymbol{ delta }=\left[{ }^{\prime} 100m^{\prime}\right]=\left[' 9: 00^{\prime},{ }^{\prime} 10: 40^{\prime},{ }^{\prime} 12: 20^{\prime},{ }^{\prime} 14: 00^{\prime}, \cdots\right]\right.$$
[TeX:] $$\boldsymbol{r a n g e}\left(' 9: 00^{\prime},{ }^{\prime} 19: 00^{\prime}, \boldsymbol{ delta }=\text { delta }=\left[{ }^{\prime} 100 m^{\prime},{ }^{\prime} 20 m^{\prime}\right]=\left[' 9: 00^{\prime},{ }^{\prime} 10: 40^{\prime},{ }^{\prime} 11: 00^{\prime},{ }^{\prime} 12: 40^{\prime}, \cdots\right]\right.$$
4.1.3 Probability distributions in a domain
Data synthesis is based on a random sampling of domains. Uniform sampling is not always suitable under all situations. Some skews often exist in real datasets. A domain can be associated with a probability distribution. A distribution can be given as a list of categorical domains as follows:
or as a function [TeX:] $$p=f(v)$$ for a numerical domain. For an example p = {^' apple^': 0.3,'banana^': 0.45} is a probability distribution in the domain ['apple','banana','cherry','pear'], where "apple" has a smaller probability than "banana" but a larger one than "cherry" and "pear" because the two share a probability of 0.25. For another example, p ={'Stewart':0.2,'James':0.35} indicates that "Stewart" and "James" are the most popular male first names in the domain.
By default, instances are uniformly extracted at random from the domain. Alternatively, the generation process can follow a statistical distribution. The distributions supported by DGL are listed in Table 1. In Section 4.2.2, we provide a formula for supporting this feature.
Probability distribution support
4.2 Primitive Generators
A generator is an iterator in a domain that traverses the domain and returns a set of samples from the domain. There are two types of primitive generators: sequence generator and random generator.
4.2.1 Sequence generator domain.seq()
domain.seq(size): A sequence generator returns each value while traversing a domain. Function seq() has a parameter size that specifies the sample size.
4.2.2 Random generator domain.rng()
domain.rng(size,replace=True): A random generator returns a random sample from the domain. By default, the samples were uniformly extracted at random. There are two parameters, i.e., one for the sample size and the other for sampling with/without a replacement.
In addition to a uniform generator, other generators for typical probability distributions are also supported.
domain.poisson(size,replace=True)
domain.normal(size,replace=True)
domain.zipf(size,replace=True)
domain.exponential(size,replace=True)
We explain the use of generators with different data domains. First, we define some domains (without a set wrapper for brevity).
grade=['S','A','B','C','D']
firstname = dict.person.firtname
lastname = dict.person.lastname
classtime = range ('9:00','19:00', delta='time[100m]')
When we call a generator grade.rng(3), a list of random samples ['B', 'A', 'S'] is returned. If we call firstname.rng(4), ['Douglas', 'Susan', 'Anne', 'Jennifer'] will be output.
A call of classtime.seq() will return ['9:00', '10:40', '12:20', '14:00', '15:40', '17:20'] as class times. In addition, classtime.rng(replace=False) returns ['17:20', '10:40', '15:40', '9:00', '12:20', '14:00'] as a random sample. When classtime.normal(replace = True) is called, we can obtain a random sample with a replacement ['14:00, '12:30', '15:40', '9:00', '12:20', '14:00'], which follows a normal distribution.
4.3 Regular Expressions as a DGL
We now present a regular expression-based DGL. Some features in regex, such as anchor, ^$, and assertion, are primarily provided for pattern-matching or replacement. We exclude these features and select the following features or operations for data generation:
(1) Literal, e.g., abc, k20rs121,
(2) Character set or character class, e.g., [a-z]
(3) Shorthand character class, e.g., \d, \s, \w
(4) Negated character set, e.g., [^aeiou]
(5) Concatenation
(6) Alternation/union |
(7) Quantifiers ?*+{n}{n,m}
(8) Grouping ()
(9) Backreference, e.g., \1, \2, ..., \9
4.3.1 Resource reference
To support data domains in a regular expression, we introduce a new feature called a resource reference for the use of resources outside a regular expression. Here, resources can be another regular expression or a domain and an associated generator. There are two types of resource references: named reference and index reference.
A named reference takes the form %{name}, and name refers to a named resource. The index reference uses a single digit (0–9) to specify the target resource. As a difference between them, the index reference is used when resources are provided as a list instead of individually.
Suppose we have the following provided resources.
man = dict.person.man.firstname
woman = dict.person.woman.firstname
lastname = dict.person.lastname
birthday = range('1980-1','2000-12', delta=['D'])
A regular expression with resource references is written as (5), which describes a pattern with a gender of M or F, followed by the first name (30% men and 70% women) and the last name. A birthday is a randomly chosen date from 1980-1-1 to 2020-12-31, and a cell phone number begins with 070, 080, or 090, followed by two 4-digit numbers.
4.3.2 Regex-based generator reg()
reg(E,size,resources): A regex-based generator, short for reg, is an iterator that returns a set of strings that exactly match the given regex E. The size parameter specifies the sample size. The resources parameter provides a list of resources. Each resource can be another reg or set/sequence generator in a certain domain.
Considering a reg for the regex in (5), we can define a generator with a sample size of 7 as follows:
When calling the generator, the output will be as follows:
Skew distributions can be generated by applying the probability distributions. It is important to achieve an inter-table dependency. Consider the following example:
p = {'apple': 0.3, 'banana': 0.45}
fruits = db('select name from fruits').rng(p)
person = db('select id as person_id from person')
date = range('2020-1', '2020-12', delta=['D'])
qty = range(100,500, delta=[10])
There were two index references, %0 and %1. These are provided in the list of resources. Here, %0 is for id and %1 is for the status regarding whether the person ran.
When calling the generator, the output will be as follows:
5. Case Study
In this section, we describe the application of the proposed DGL to the TPC-H benchmark [27]. The TPC-H benchmark was designed using TPC for decision-support systems. Commercial database vendors and related hardware vendors use these benchmarks to demonstrate the superiority and competitive edge of their products. TPC-H consists of separate and individual tables (base tables) and the relationships between columns in these tables. The data types in TPC-H include identifiers, integers, [big] decimals, fixed text, variable text, and dates, all of which can be easily translated into DGL data types.
5.1 Using Data Domains in TPC-H
TPC-H uses several dictionaries to generate meaningful text strings. For example, P_TYPE is a combination of words from three sets.
PartSize = ['Standard', 'Small', 'Medium', 'Large', 'Economy', 'Promo'},
PartCoat = ['Anodized', 'Burnished', 'Plated', 'Polished', 'Brushed'}
PartMaterial = {'Tin', 'Nickel', 'Brass', 'Steel', 'Copper'}
The following regular expression defines values for P_TYPE:
size = dict.tpc-h.PartSize
coat = dict.tpc-hPartCoat
material = dict.tpc-h.ParMaterial
E = r/%{size}\s%{cost}\s%{material}/
The output of reg(E, size=20) will be a list of strings such as
Here, P_CONTAINER is generated by concatenating syllables selected at random from each of the two lists and separated by a single space.
ContainerSize = ['SM', 'Med', 'Jumbo', 'Wrap']
ContainerType = ['Case', 'Box', 'Jar', 'Pkg', 'Pack', 'Can', 'Drum']
The following regular expression defines values for P_CONTAINER:
size = dict.tpc-h.ContainerSize
type = dict.tpc-h.ContainerType
E = r/%{size}\s%{type}/
The output of reg(E, size=20) will be a list of strings such as ['Med Can', 'Wrap Jar', … ].
5.2 Populating Tables in TPC-H
It is extremely easy to define single values for individual columns in TPC-H tables. Using resource references in the proposed DGL, it is also possible to generate inter-table dependencies between columns with respect to foreign keys. In this section, we present the table definition with DGL expressions for each column and the relationship between tables.
5.2.1 PART table (SF: Scaling Factor)
5.2.2 SUPPLIER table
5.2.3 PARTSUPP table
The skewness of the data distribution is an important factor in evaluating the join performance. As pointed out in [28-30], although uniform data distributions are a design choice for the TPC-H benchmark, it has been universally recognized that a data skew is prevalent in data warehousing. The following example demonstrates the capability of the proposed DGL to generate skewed distributions.
part = db('select P_PARTKEY from P_PART')
supply = db('select P_SUPPLYKEY from P_SUPPLY')
E = r/(%{part}:0.35,%{supply}:0.65),%0,%1, \w{1,199}/
This example describes the probabilistic correlation between foreign keys, with 35% from PART and 65% from SUPPLIER.