Lugaru's Epsilon Programmer's Editor
Context:
|
Epsilon User's Manual and Reference > Commands by Topic > Changing Text > Regular ExpressionsMost of Epsilon's searching commands, described in Searching, take a simple string to search for. Epsilon provides a more powerful regular expression search facility, and a regular expression replace facility.Instead of a simple search string, you provide a pattern, which describes a set of strings. Epsilon searches the buffer for an occurrence of one of the strings contained in the set. You can think of the pattern as generating a (possibly infinite) set of strings, and the regex search commands as looking in the buffer for the first occurrence of one of those strings. The following characters have special meaning in a regex search: vertical bar, parentheses, plus, star, question mark, square brackets, period, dollar, percent sign, left angle bracket ("<"), and caret ("^").
Plain Patterns
In a regular expression, a string that does not contain any of the above
characters denotes the set that contains precisely that one string. For
example, the regular expression Alternation
To include more than one string in the set, you can use the
vertical bar character. For example, the regular expression Grouping
You can enclose any regular expression in parentheses, and the
resulting expression refers to the same set. So searching for
Concatenation
You can concatenate two regular expressions to form a new
regular expression. Suppose the regular expressions p and
q denote sets P and Q, respectively. Then
the regular expression pq denotes the set of strings that
you can make by concatenating, to members of P, strings from
the set Q. For example, suppose you concatenate the
regular expressions Closure
Clearly, any regular expression must have finite length;
otherwise you couldn't type it in. But because of the closure
operators, the set to which the regular expression refers
may contain an infinite number of strings. If you append plus to a
parenthesized regular expression, the resulting expression denotes
the set of one or more repetitions of that string. For example,
the regular expression Optionality You can specify the question operator in the same place you might put a star or a plus. If you append a question mark to a parenthesized regular expression, the resulting expression denotes the set that contains that string, and the empty string. You would typically use the question operator to specify an optional subpart of the search string.
You can also use the plus, star, and question-mark operators with
subexpressions, and with non-parenthesized things. These operators
always apply to the smallest possible substring to their left. For
example, the regular expression
Entering special characters
In a regular expression, the percent ("
You can also quote characters by enclosing them in angle brackets.
The expression To search for the NUL character (the character with ASCII code 0), you must use the expression <Nul>, because an actual NUL character may not appear in a regular expression.
Instead of the character's name, you can provide its numeric ASCII
value using the notation Character Classes
In place of any letter, you can specify a character class. A
character class consists of a sequence of characters between square
brackets. For example, the character class
In place of a letter in a character class, you can specify a range of
characters using a hyphen: the character class
To specify the complement of a character class, put a caret as the
first character in the class. Using the above examples, the class
If you need to put a right square bracket character in
a character class, put it immediately after the opening
left square bracket, or in the case of an inverted character
class, immediately after the caret. For example, the class
To include the hyphen character Any regular expression you can write with character classes you can also write without character classes. But character classes sometimes let you write much shorter regular expressions.
The period character (outside a character class) represents any
character except a <Newline>. For example, the pattern
You can also specify a character class using a variant of the angle
bracket syntax described above. The expression
<Comma|Period|Question> represents any one of those three
punctuation characters. The expression You can also use a few character class names that match some common sets of characters. Some use Epsilon's current syntax table, which an EEL program may modify, by way of the isalpha( ) primitive. Typically these include accented characters like ê or å.
More precisely, inside the angle brackets you can put one or more
character names, character ranges, or character class names,
separated by vertical bars. (A range means two character names with
a hyphen between them.) In place of a character name, you can put
Examples
An advanced example Let's build a regular expression that includes precisely the set of legal strings in the C programming language. All C strings begin and end with double quote characters. The inside of the string denotes a sequence of characters. Most characters stand for themselves, but newline, double quote, and backslash must appear after a "quoting" backslash. Any other character may appear after a backslash as well. We want to construct a pattern that generates the set of all possible C strings. To capture the idea that the pattern must begin and end with a double quote, we begin by writing
We still have to write the something part, to generate the inside of the C strings. We said that the inside of a C string consists of a sequence of characters. The star operator means "zero or more of something". That looks promising, so we write
Now we need to come up with a something part that
stands for an individual character in a C string. Recall that
characters other than newline, double quote, and backslash stand for
themselves. The pattern
which represents precisely the set of legal C strings. In fact, if you type this pattern into a regex-search command (described below), Epsilon will find the next C string in the buffer. Searching Rules Thus far, we have described regular expressions in terms of the abstract set of strings they generate. In this section, we discuss how Epsilon uses this abstract set when it does a regular expression search. When you tell Epsilon to perform a forward regex search, it looks forward through the buffer for the first occurrence in the buffer of a string contained in the generated set. If no such string exists in the buffer, the search fails. There may exist several strings in the buffer that match a string in the generated set. Which one qualifies as the first one? By default, Epsilon picks the string in the buffer that begins before any of the others. If there exist two or more matches in the buffer that begin at the same place, Epsilon by default picks the longest one. We call this a first-beginning, longest match. For example, suppose you position point at the beginning of the following line,
then do a regex search for the pattern
Here, the underlined sections indicate portions of the buffer that match the description "s followed by a sequence of letters". We could identify 31 different occurrences of such strings on this line. Epsilon picks a match that begins first, and among those, a match that has maximum length. In our example, then, Epsilon would pick the following match:
since it begins as soon as possible, and goes on for as long as possible. The search would position point after the final "s" in "sessions". In addition to the default first-beginning, longest match searching, Epsilon provides three other regex search modes. You can specify first-beginning or first-ending searches. For each of these, you can specify shortest or longest match matches. Suppose, with point positioned at the beginning of the following line
you did a regex search with the pattern
By default, Epsilon uses first-beginning, longest matching. You can
include directives in the pattern itself to tell Epsilon to use one
of the other techniques. If you include the directive
You can change Epsilon's default regex searching mode. To make Epsilon use, by default, first-ending searches, set the variable regex-shortest to a nonzero value. To specify first-ending searches, set the variable regex-first-end to a nonzero value. (Examples of regular expression searching in this documentation assume the default settings.)
When Epsilon finds a regex match, it sets point to the end of the
match. It also sets the variables matchstart and
matchend to the beginning and end, respectively, of the match.
You can change what Epsilon considers the end of the match
using the "!" directive. For example, if you searched for
"
Without the "!" directive, the match would consist of the letters "I sought", but because of the "!" directive, the match consists of only the indicated section of the line. Notice that the first three characters of the line also consist of "I s", but Epsilon does not count that as a match. There must first exist a complete match in the buffer. If so, Epsilon will then set point and matchend according to any "!" directive.
You can force Epsilon to reject any potential match that does not
line up appropriately with a line boundary, by using the "^" and
"$" assertions. A "^" assertion specifies a beginning-of-line match,
and a "$" assertion specifies an end-of-line match. For example,
if you search for
Even though the word "new" occurs before "waste", it does not appear at the beginning of the line, so Epsilon rejects it.
Other assertions use Epsilon's angle-bracket syntax. Like the
assertions
For example, searching for
You can create new assertions from character classes specified
with the angle bracket syntax by adding
The class in the above syntax is a
For example, Overgenerating regex sets You can use Epsilon's regex search modes to simplify patterns that you write. You can sometimes write a pattern that includes more strings than you really want, and rely on a regex search mode to cut out strings that you don't want.
For example, recall the earlier example of
In C, a comment begins with
In this example, you could have written a more complicated regular expression that generated precisely the set of legal C comments, but this pattern proves easier to write. The Regex Commands You can invoke a forward regex search with the Ctrl-Alt-S key, which runs the command regex-search. The Ctrl-Alt-R key invokes a reverse incremental search. You can also enter regular expression mode from any search prompt by typing Ctrl-T to that prompt. For example, if you press Ctrl-S to invoke incremental-search, pressing Ctrl-T causes it to enter regular expression mode. See Searching for a description of the searching commands. The key Alt-* runs the command regex-replace. This command works like the command query-replace, but interprets its search string as a regular expression. In the replacement text of a regex replace, the # character followed by a digit n has a special meaning in the replacement text. Epsilon finds the nth parenthesized expression in the pattern, counting left parentheses from 1. It then substitutes the match of this subpattern for the #n in the replacement text. For example, replacing
with
changes
to If #0 appears in the replacement text, Epsilon substitutes the entire match for the search string. To include the actual character # in a replacement text, use ##. Other characters in the replacement text have no special meaning. To enter special characters, type a Ctrl-Q before each. Type Ctrl-Q Ctrl-C to include a Ctrl-C character. Type Ctrl-Q Ctrl-J to include a <Newline> character in the replacement text. Standard bindings:
|