Regular Expressions

Lugaru's Epsilon
Programmer's
Editor

Context:

Epsilon User's Manual and Reference

    Commands by Topic

       . . .

       Moving Around

          Simple Movement Commands

          Moving in Larger Units

          Searching

          . . .

          Comparing

       Changing Text

          . . .

          Capitalization

          Replacing

          Regular Expressions

          Rearranging

          Indenting Commands

          . . .

       Language Modes

          Asm Mode

          C Mode

          Configuration File Mode

          . . .

          Visual Basic Mode

       . . .

Previous	Up	Next
Replacing	Commands by Topic	Sorting

Epsilon User's Manual and Reference > Commands by Topic > Changing Text >

Regular Expressions

Most of Epsilon's searching commands, described in Searching, take a simple string to search for. Epsilon provides a more powerful regular expression search facility, and a regular expression replace facility.

Instead of a simple search string, you provide a pattern, which describes a set of strings. Epsilon searches the buffer for an occurrence of one of the strings contained in the set. You can think of the pattern as generating a (possibly infinite) set of strings, and the regex search commands as looking in the buffer for the first occurrence of one of those strings.

The following characters have special meaning in a regex search: vertical bar, parentheses, plus, star, question mark, square brackets, period, dollar, percent sign, left angle bracket ("<"), and caret ("^").

abc|def   Finds either abc or def.

(abc)   Finds abc.

abc+   Finds abc or abcc or abccc or ... .

abc*   Finds ab or abc or abcc or abccc or ... .

abc?   Finds ab or abc.

[abcx-z]   Finds any single character of a, b, c, x, y, or z.

[^abcx-z]   Finds any single character except a, b, c, x, y, or z.

.   Finds any single character except <Newline>.

abc$   Finds abc that occurs at the end of a line.

^abc   Finds abc that occurs at the beginning of a line.

%^abc   Finds a literal ^abc.

<Tab>   Finds a <Tab> character.

<#123>   Finds the character with ASCII code 123.

Plain Patterns

In a regular expression, a string that does not contain any of the above characters denotes the set that contains precisely that one string. For example, the regular expression abc denotes the set that contains, as its only member, the string "abc". If you search for this regular expression, Epsilon will search for the string "abc", just as in a normal search.

Alternation

To include more than one string in the set, you can use the vertical bar character. For example, the regular expression abc|xyz denotes the set that contains the strings "abc" and "xyz". If you search for that pattern, Epsilon will find the first occurrence of either "abc" or "xyz". The alternation operator (|) always applies as widely as possible, limited only by grouping parentheses.

Grouping

You can enclose any regular expression in parentheses, and the resulting expression refers to the same set. So searching for (abc|xyz) has the same effect as searching for abc|xyz, which works as in the previous paragraph. You would use parentheses for grouping purposes in conjunction with some of the operators described below.

Concatenation

You can concatenate two regular expressions to form a new regular expression. Suppose the regular expressions p and q denote sets P and Q, respectively. Then the regular expression pq denotes the set of strings that you can make by concatenating, to members of P, strings from the set Q. For example, suppose you concatenate the regular expressions (abc|xyz) and (def|ghi) to yield (abc|xyz)(def|ghi). From the previous paragraph, we know that (abc|xyz) denotes the set that contains "abc" and "xyz"; the expression (def|ghi) denotes the set that contains "def" and "ghi". Applying the rule, we see that (abc|xyz)(def|ghi) denotes the set that contains the following four strings: "abcdef", "abcghi", "xyzdef", "xyzghi".

Closure

Clearly, any regular expression must have finite length; otherwise you couldn't type it in. But because of the closure operators, the set to which the regular expression refers may contain an infinite number of strings. If you append plus to a parenthesized regular expression, the resulting expression denotes the set of one or more repetitions of that string. For example, the regular expression (ab)+ refers to the set that contains "ab", "abab", "ababab", "abababab", and so on. Star works similarly, except it denotes the set of zero or more repetitions of the indicated string.

Optionality

You can specify the question operator in the same place you might put a star or a plus. If you append a question mark to a parenthesized regular expression, the resulting expression denotes the set that contains that string, and the empty string. You would typically use the question operator to specify an optional subpart of the search string.

You can also use the plus, star, and question-mark operators with subexpressions, and with non-parenthesized things. These operators always apply to the smallest possible substring to their left. For example, the regular expression abc+ refers to the set that contains "abc", "abcc", "abccc", "abcccc", and so on. The expression a(bc)*d refers to the set that contains "ad", "abcd", "abcbcd", "abcbcbcd", and so on. The expression a(b?c)*d denotes the set that contains all strings that start with "a" and end with "d", with the inside consisting of any number of the letter "c", each optionally preceded by "b". The set includes such strings as "ad", "acd", "abcd", "abccccbcd".

<Comma>   ,   <Nul>   ^@   <Period>   .

<Space>      <Star>   *   <Plus>   +

<Enter>   ^M   <Percent>   %   <Vbar>   |

<Return>   ^M   <Lparen>   (   <Question>   ?

<Newline>   ^J   <Rparen>   )   <Query>   ?

<Linefeed>   ^J   <Langle>   <   <Caret>   ^

<Tab>   ^I   <Rangle>   >   <Dollar>   $

<Bell>   ^G   <LSquare>   [   <Bang>   !

<Backspace>   ^H   <RSquare>   ]   <Exclamation>   !

<FormFeed>   ^L   <Lbracket>   [   <Quote>   '

<Esc>   ^[   <Rbracket>   ]   <SQuote>   '

<Escape>   ^[   <Dot>   .   <DQuote>   "

<Null>   ^@

Entering special characters

In a regular expression, the percent ("%") character quotes the next character, removing any special meaning that character may have. For example, the expression x%+ refers to the string "x+", whereas the pattern x+ refers to the set that contains "x", "xx", "xxx", and so on.

You can also quote characters by enclosing them in angle brackets. The expression x<+> refers to the string "x+", the same as x%+. In place of the character itself, you can provide the name of the character inside the angle brackets. The table lists all the character names Epsilon recognizes.

To search for the NUL character (the character with ASCII code 0), you must use the expression <Nul>, because an actual NUL character may not appear in a regular expression.

Instead of the character's name, you can provide its numeric ASCII value using the notation <#number>. The sequence <#number> denotes the character with ASCII code number. For example, the pattern <#0> provides another way to specify the NUL character, and the pattern abc<#10>+ specifies the set of strings that begin with "abc" and end with one or more newline characters (newline has ASCII value 10). You can enter the ASCII value in hexadecimal, octal, or binary by prefixing the number with "0x", "0o", or "0b", respectively. For example, <#32>, <#0x20>, <#0o40>, and <#0b100000> all yield a <Space> character (ASCII code 32).

Character Classes

In place of any letter, you can specify a character class. A character class consists of a sequence of characters between square brackets. For example, the character class [adef] stands for any of the following characters: "a", "d", "e", or "f".

In place of a letter in a character class, you can specify a range of characters using a hyphen: the character class [a-m] stands for the characters "a" through "m", inclusively. The class [ae-gr] stands for the characters "a", "e", "f", "g", or "r". The class [a-zA-Z0-9] stands for any alphanumeric character.

To specify the complement of a character class, put a caret as the first character in the class. Using the above examples, the class [^a-m] stands for any character other than "a" through "m", and the class [^a-zA-Z0-9] stands for any non-alphanumeric character. Inside a character class, only ^ and - have special meaning. All other characters stand for themselves, including plus, star, question mark, etc.

If you need to put a right square bracket character in a character class, put it immediately after the opening left square bracket, or in the case of an inverted character class, immediately after the caret. For example, the class []x] stands for the characters "]" or "x", and the class [^]x] stands for any character other than "]" or "x".

To include the hyphen character - in a character class, it must be the first character in the class, except for ^ and ]. For example, the pattern [^]-q] matches any character except ], -, or q.

Any regular expression you can write with character classes you can also write without character classes. But character classes sometimes let you write much shorter regular expressions.

The period character (outside a character class) represents any character except a <Newline>. For example, the pattern a.c matches any three-character sequence on a single line where the first character is "a" and the last is "c".

You can also specify a character class using a variant of the angle bracket syntax described above. The expression <Comma|Period|Question> represents any one of those three punctuation characters. The expression <a-z|A-Z|?> represents either a letter or a question mark, the same as [a-zA-Z]|<?>, for example. The expression <^Newline> represents any character except newline, just as the period character by itself does.

You can also use a few character class names that match some common sets of characters. Some use Epsilon's current syntax table, which an EEL program may modify, by way of the isalpha( ) primitive. Typically these include accented characters like ê or å.

Class   Meaning

<digit>   A digit, 0 to 9.

<alpha>   A letter, according to isalpha( ).

<alphanum>   Either of the above.

<word>   All of the above, plus the _ character.

<hspace>   The same as <Space|Tab>.

<wspace>   The same as <Space|Tab|Newline>.

<any>   Any character including <Newline>.

More precisely, inside the angle brackets you can put one or more character names, character ranges, or character class names, separated by vertical bars. (A range means two character names with a hyphen between them.) In place of a character name, you can put # and the ASCII number of a character, or you can put the character itself (for any character except >, |, -, or <Nul>). Finally, just after the opening <, you can put a ^ to specify the complement of the character class.

Examples

The pattern if|else|for|do|while|switch specifies the set of statement keywords in C and EEL.
The pattern c[ad]+r specifies strings like "car", "cdr", "caadr", "caaadar". These correspond to compositions of the car and cdr Lisp operations.
The pattern c[ad][ad]?[ad]?[ad]?r specifies the strings that represent up to four compositions of car and cdr in Lisp.
The pattern [a-zA-Z]+ specifies the set of all sequences of 1 or more letters. The character class part denotes any upper- or lower-case letter, and the plus operator specifies one or more of those.
Epsilon's commands to move by words accomplish their task by performing a regular expression search. They use a pattern similar to [a-zA-Z0-9_]+, which specifies one or more letters, digits, or underscore characters. (The actual pattern includes national characters as well.)
The pattern (<Newline>|<Return>|<Tab>|<Space>)+ specifies nonempty sequences of the whitespace characters newline, return, tab, and space. You could also write this pattern as <Newline|Return|Tab|Space>+ or as <Wspace|Return>+, using a character class name.
The pattern /%*.*%*/ specifies a set that includes all 1-line C-language comments. The percent character quotes the first and third stars, so they refer to the star character itself. The middle star applies to the period, denoting zero or more occurrences of any character other than newline. Taken together then, the pattern denotes the set of strings that begin with "slash star", followed by any number of non-newline characters, followed by "star slash". You can also write this pattern as /<Star>.*<Star>/.
The pattern /%*(.|<Newline>)*%*/ looks like the previous pattern, except that instead of ".", we have (.|<Newline>). So instead of "any character except newline", we have "any character except newline, or newline", or more simply, "any character at all". This set includes all C comments, with or without newlines in them. You could also write this as /%*<Any>*%*/ instead.
The pattern <^digit|a-f> matches any character except of one these: 0123456789abcdef.

An advanced example

Let's build a regular expression that includes precisely the set of legal strings in the C programming language. All C strings begin and end with double quote characters. The inside of the string denotes a sequence of characters. Most characters stand for themselves, but newline, double quote, and backslash must appear after a "quoting" backslash. Any other character may appear after a backslash as well.

We want to construct a pattern that generates the set of all possible C strings. To capture the idea that the pattern must begin and end with a double quote, we begin by writing

"something"

We still have to write the something part, to generate the inside of the C strings. We said that the inside of a C string consists of a sequence of characters. The star operator means "zero or more of something". That looks promising, so we write

"(something)*"

Now we need to come up with a something part that stands for an individual character in a C string. Recall that characters other than newline, double quote, and backslash stand for themselves. The pattern <^Newline|"|\> captures precisely those characters. In a C string, a "quoting" backslash must precede the special characters (newline, double quote, and backslash). In fact, a backslash may precede any character in a C string. The pattern \(.|<Newline>) means, precisely "backslash followed by any character". Putting those together with the alternation operator (|), we get the pattern <^Newline|"|\>|\(.|<Newline>) which generates either a single "normal" character or any character preceded by a backslash. Substituting this pattern for the something yields

"(<^Newline|"|\>|\(.|<Newline>))*"

which represents precisely the set of legal C strings. In fact, if you type this pattern into a regex-search command (described below), Epsilon will find the next C string in the buffer.

Searching Rules

Thus far, we have described regular expressions in terms of the abstract set of strings they generate. In this section, we discuss how Epsilon uses this abstract set when it does a regular expression search.

When you tell Epsilon to perform a forward regex search, it looks forward through the buffer for the first occurrence in the buffer of a string contained in the generated set. If no such string exists in the buffer, the search fails.

There may exist several strings in the buffer that match a string in the generated set. Which one qualifies as the first one? By default, Epsilon picks the string in the buffer that begins before any of the others. If there exist two or more matches in the buffer that begin at the same place, Epsilon by default picks the longest one. We call this a first-beginning, longest match. For example, suppose you position point at the beginning of the following line,

When to the sessions of sweet silent thought

then do a regex search for the pattern s[a-z]*. That pattern describes the set of strings that start with "s", followed by zero or more letters. We can find quite a few strings on this line that match that description. Among them:

When to the sessions of sweet silent thought

When to the sessions of sweet silent thought

When to the sessions of sweet silent thought

When to the sessions of sweet silent thought

When to the sessions of sweet silent thought

When to the sessions of sweet silent thought

Here, the underlined sections indicate portions of the buffer that match the description "s followed by a sequence of letters". We could identify 31 different occurrences of such strings on this line. Epsilon picks a match that begins first, and among those, a match that has maximum length. In our example, then, Epsilon would pick the following match:

When to the sessions of sweet silent thought

since it begins as soon as possible, and goes on for as long as possible. The search would position point after the final "s" in "sessions".

In addition to the default first-beginning, longest match searching, Epsilon provides three other regex search modes. You can specify first-beginning or first-ending searches. For each of these, you can specify shortest or longest match matches. Suppose, with point positioned at the beginning of the following line

I summon up remembrance of things past,

you did a regex search with the pattern m.*c|I.*t. Depending on which regex mode you chose, you would get one of the four following matches:

I summon up remembrance of things past,   (first-ending shortest)

I summon up remembrance of things past,   (first-ending longest)

I summon up remembrance of things past,   (first-beginning shortest)

I summon up remembrance of things past,   (first-beginning longest)

By default, Epsilon uses first-beginning, longest matching. You can include directives in the pattern itself to tell Epsilon to use one of the other techniques. If you include the directive <Min> anywhere in the pattern, Epsilon will use shortest-matching instead of longest-matching. Putting <FirstEnd> selects first-ending instead of first-beginning. You can also put <Max> for longest-matching, and <FirstBegin> for first-beginning. These last two might come in handy if you've changed Epsilon's default regex mode. The sequences <FE> and <FB> provide shorthand equivalents for <FirstEnd> and <FirstBegin>, respectively. As an example, you could use the following patterns to select each of the matches listed in the previous example:

You can change Epsilon's default regex searching mode. To make Epsilon use, by default, first-ending searches, set the variable regex-shortest to a nonzero value. To specify first-ending searches, set the variable regex-first-end to a nonzero value. (Examples of regular expression searching in this documentation assume the default settings.)

When Epsilon finds a regex match, it sets point to the end of the match. It also sets the variables matchstart and matchend to the beginning and end, respectively, of the match. You can change what Epsilon considers the end of the match using the "!" directive. For example, if you searched for "I s!ought" in the following line, Epsilon would match the underlined section:

I sigh the lack of many a thing I sought,

Without the "!" directive, the match would consist of the letters "I sought", but because of the "!" directive, the match consists of only the indicated section of the line. Notice that the first three characters of the line also consist of "I s", but Epsilon does not count that as a match. There must first exist a complete match in the buffer. If so, Epsilon will then set point and matchend according to any "!" directive.

You can force Epsilon to reject any potential match that does not line up appropriately with a line boundary, by using the "^" and "$" assertions. A "^" assertion specifies a beginning-of-line match, and a "$" assertion specifies an end-of-line match. For example, if you search for ^new|waste in the following line, it would match the indicated section:

And with old woes new wail my dear times's waste;

Even though the word "new" occurs before "waste", it does not appear at the beginning of the line, so Epsilon rejects it.

Other assertions use Epsilon's angle-bracket syntax. Like the assertions ^ and $, these don't match any specific characters, but a potential match will be rejected if the assertion isn't true at that point in the pattern.

Assertion   Meaning

^   At the start of a line.

$   At the end of a line.

<bob> or <bof>   At the start of the buffer.

<eob> or <eof>   At the end of the buffer.

For example, searching for <bob>sometext<eob> won't succeed unless the buffer contains only the eight character string sometext.

You can create new assertions from character classes specified with the angle bracket syntax by adding [, ] or / at the start of the pattern.

Assertion   Meaning

<[class>   The next character matches class, the previous one does not.

<]class>   The previous character matches class, the next one does not.

</class>   Either of the above.

The class in the above syntax is a |-separated list of one or more single characters, character names like Space or Tab, character numbers like #32 or #9, ranges of any of these, or character class names like Word or Digit.

For example, </word> matches at a word boundary, and <]word> matches at the end of a word. The pattern <]0-9|a-f> matches at the end of a run of hexadecimal digits. And the pattern (cat|[0-9])</digit>(dog|[0-9]) matches cat3 or 4dog, but not catdog or 42.

Overgenerating regex sets

You can use Epsilon's regex search modes to simplify patterns that you write. You can sometimes write a pattern that includes more strings than you really want, and rely on a regex search mode to cut out strings that you don't want.

For example, recall the earlier example of /%*(.|<Newline>)*%*/. This pattern generates the set of all strings that begin with /* and end with */. This set includes all the C-language comments, but it includes some additional strings as well. It includes, for example, the following illegal C comment:

/* inside /* still inside */ outside */

In C, a comment begins with /* and ends with the very next occurrence of */. You can effectively get that by modifying the above pattern to specify a first-ending, longest match, with <FE><Max>/%*(.|<Newline>)*%*/. It would match:

/* inside /* still inside */ outside */

In this example, you could have written a more complicated regular expression that generated precisely the set of legal C comments, but this pattern proves easier to write.

The Regex Commands

You can invoke a forward regex search with the Ctrl-Alt-S key, which runs the command regex-search . The Ctrl-Alt-R key invokes a reverse incremental search. You can also enter regular expression mode from any search prompt by typing Ctrl-T to that prompt. For example, if you press Ctrl-S to invoke incremental-search, pressing Ctrl-T causes it to enter regular expression mode. See Searching for a description of the searching commands.

The key Alt-* runs the command regex-replace . This command works like the command query-replace, but interprets its search string as a regular expression.

In the replacement text of a regex replace, the # character followed by a digit n has a special meaning in the replacement text. Epsilon finds the nth parenthesized expression in the pattern, counting left parentheses from 1. It then substitutes the match of this subpattern for the #n in the replacement text. For example, replacing

([a-zA-Z0-9_]+) = ([a-zA-Z0-9_]+)

with

#2 := #1

changes

variable = value;

value := variable;

If #0 appears in the replacement text, Epsilon substitutes the entire match for the search string. To include the actual character # in a replacement text, use ##.

Other characters in the replacement text have no special meaning. To enter special characters, type a Ctrl-Q before each. Type Ctrl-Q Ctrl-C to include a Ctrl-C character. Type Ctrl-Q Ctrl-J to include a <Newline> character in the replacement text.

Standard bindings:

Ctrl-Alt-S    regex-search

Ctrl-Alt-R    reverse-regex-search

Alt-*    regex-replace

Previous Up Next
Replacing Commands by Topic Sorting