Graduate Program KB

Fundamentals of Regular Expressions

Agenda

  1. Background
  2. RegExp Objects in JavaScript
  3. How Do Regular Expressions Work?
  4. Character Sets
  5. Character Classes
  6. Unicode
  7. Quantifiers
  8. Capturing groups
  9. Backreferences in pattern: \N and \k<name>
  10. Alternation (OR) |
  11. Anchors: start ^ and end $
  12. Word Boundary: \b
  13. Lookahead and lookbehind
  14. Methods of Regexp and String
  15. Sources and Further Reading

Background

Regular Expressions have their root in automata theory, the study of modelling state machines. Regular languages were first described by Stephen Cole Kleene in 1951 as a way to represent state machines in a textual format, unlike state diagrams which were the dominant model of the time.

The first use of regular expressions in programs for pattern matching was in the text editor QED,developed by Ken Thompson, who later integrated it into ed. Regular expressions searches feature in successors to ed, such as Vim and Emacs. Their inclusion in ed later led to the creation of grep (Global Regular Expression search and Print), which inspired an explosion of regex implementations in Unix tools.

RegExp Objects in JavaScript

Regular expressions have two parts:

  • A pattern, which consists of a sequence of characters, character sets, character classes, quantifiers, groups, backreferences and/or assertions.
  • A set of optional flags that determine how methods dealing with the RegExp will behave.

RegExp Objects can be created in two ways: literal notation and use of a constructor function.

let literal = /pattern/flags // surround pattern with slashes, like you would surround a string with quotation marks

let constructed1 = new RegExp("pattern", "flags")
let constructed2 = new RegExp(/pattern/, "flags")

Note that as RegExp objects can be constructed using strings, it is possible to create regular expressions dynamically through the use of template strings.

Eight flags can be applied to RegExp objects. The descriptions given here are generalized, check the docs to see how certain flags will change the behaviour of the functions you want to use:

  • i - Matches will be case insensitive
  • g - Methods will try to find all matches instead of stopping at the first match.
  • m - Multiline mode. Beginning and endings assertions will match at the beginning and ending of each line, not just at the beginning and ending of a given string.
  • s - "Dotall" mode. Allows the . character class to match newline characters.
  • u - Enables support for Unicode characters
  • v - Enables support for Unicode characters and Unicode character sets
  • y - "Sticky" mode. Will match only from the index indicated by lastIndex property of the RegExp object and will not attempt to find further matches.

How Do Regular Expressions Work?

Regular expression patterns evaluate strings from the first character until a 'match' is found. Remembering that regular expressions originated as notation for state machines, patterns can be thought of as a sequence of criteria that must be met before reaching an end state.

Consider this example, a regular expression that matches positive dollar values:

Let's go through, step-by-step, how this state machine will attempt a match on the string "It's 9:30, do you have $20?":

  1. The matcher will check each character to until it finds a $ character, present at index 23.
  2. The matcher will move to the next character and check to see if it is in the character range 1-9.
  3. While the matcher could take the path that leads to the end state (the gray circle) now, matchers are 'greedy' by default and will try to match as many characters as possible when using quantifiers. It finds the character 0 is in the desired range 0-9, so it follows that path.
  4. The matcher notices the next character, ?, will not be in the range it wants so it takes the path to the next state, which happens to be our end state. Every character involved in a state transition that led to it reaching the end state ($, 2 and 0) is returned as the match.

Every regular expression can thought through step-by-step like this and be represented like as in the image above (although note that lookarounds are not compatible with most existing visualizer tools). If you have trouble thinking through more complicated regular expressions, Regex101.com can do it for you. Enter in your pattern and your test string and select Regex Debugger under TOOLS in the sidebar menu and it will show you step-by-step how the matcher works with your inputs.


Character Sets

Character sets can be expressed using [square brackets].

  • /gr[ae]y/ will be able to match either grey or gray.
  • /[A-Z]/ will be able to match on any letter in the range A-Z (i.e. capital letters)
  • Ranges can be combined. /[A-Za-z]/ will be able to match on any letter from A-Z or a-z.
  • Sets and ranges can be exclusionary. /[^A-Z]/ will match on any character that is not in the range A-Z.

Character Classes

Character classes essentially work as useful shorthand for important character sets:

  • \d Digits. Equivalent to [0-9]
  • \D Non-Digits. Equivalent to [^0-9]
  • \w 'Word' characters. Equivalent to [A-Za-z0-9_]
  • \W 'Non-word' characters. Equivalent to [^A-Za-z0-9_]
  • \s Whitespace symbols.
  • \S Non-whitespace symbols
  • . Any character (includes newlines when in "dotall" mode)

Unicode

  • When using the u flag, the matcher will be able to interpret Unicode. /\u{1f604}/u will match with the Unicode character '😄'. Use \u for Unicode 'code points'.
  • /\p{Emoji}/u will match with any emoji character including '😄'. Use \p to match using Unicode properties.

Quantifiers

  • {n} matches the last character/character class if it occurs n times
  • {n, m} matches the last character/character class if it occurs between n and m times (inclusive)
  • {n,} matches the last character if it occurs n or more times
  • + means "one or more". Same as {1,}
  • ? means "zero or one". Same as {0,1}
  • * means "zero or more". Same as {0,}

Note that since quantifiers are greedy by default, adding ? to any quantifier will ensure the matcher will match with the fewest characters possible (i.e. take the shortest path to the end state).


Capturing groups

  • Capturing Groups, marked by parentheses, group characters together. This allows:
    • getting part of the match as a separate item in the result array.
    • usage of quantifiers on character sequences (essentially sub-patterns).
  • Parentheses contents are put in the match. Capturing groups are numbered from least-to-most nested and from left-to-right.
    • If there is no g flag, a regex match will return an array, which at index 0 holds the full match, 1 holds the contents of the first parentheses, 2 holds the contents of the second parentheses and so on.
  • matchAll is like match but:
    • Returns an iterable instead of an array.
    • When the g flag is present, returns every match as an array with groups
    • If there are no matches, returns an empty iterable object instead of null
  • Named Groups: Put ?<name> immediately after the opening paren
    • Match will have .groups property, which will itself have properties for each named group
  • Reference capturing groups in replacement with $n where n is the group number $<name>
  • Groups can be non-capturing if you put ?: just after the left parentheses

Backreferences in pattern: \N and \k<name>

let str = `He said: "She's the one!".`;
let regexp = /(['"])(.*?)\1/g
// \1 references the content captured by capturing group 1 (ie. ['"])
// The matcher remembers which character in the set was used, so the backreference will only match with the encountered quotation mark type that was found first.

Alternation (OR) |

  • gr(a|e)y means exactly the same as gr[ae]y. Matches gray or grey
  • gra|ey matches gra or ey

Anchors: start ^ and end $

  • ^ anchors to the beginning of string (or line with m flag). ^first only matches the substring 'first' if the source string begins with it
  • $ anchors to the end of string (or line with m flag). last$ only matches the substring 'last' if the source string ends with it

Word Boundary: \b

  • A word boundary \b is an assertion, like ^ and $. Three positions may qualify as word boundaries:
    • At string start, if first string character is a word character
    • Between two characters in the string, where one is a word character and the other is not
    • At string end, if the last string character is a word character

Lookahead and lookbehind

  • Lookahead: Regex match assert modifier. A match will only fire if something immediately after also matches. Types:
    • X(?=Y) - Positive lookahead. Matches an X so long as a Y is after those two characters
    • X(?Y) - Negative lookahead. Matches an X so long as a Y is not after those two characters
  • Lookbehind: Another match assert modifier. Matches only fire if something immediately behind also matches. Types:
    • (?<=Y)X - Positive lookbehind. Matches an X so long as they are preceded by a Y.
    • (?<!Y)X - Negative lookbehind. Matches an X so long as they are not preceded by a Y.

Methods of Regexp and String

Bonus section! This content contains content not covered in the lightning talk.

  • str.match(regexp) finds matches for regexp in str. If flag g is used, returns an array of all matches as strings, without capturing groups and other details. If there are no matches, null is returned
  • str.matchAll(regexp) returns an iterable object with matches. Each match is returned as an array with capturing groups. Returns empty iterable if there are no matches.
  • str.split(regexp|substr, limit) splits a string using the regexp (or substring) as a delimiter
  • str.search(regexp) returns the position of the first match
  • str.replace(str|regexp, str|func) replaces matches found using regexp in a string with replacement. Replaces all instances if using g, otherwise only replaces the first one.
    • Replacement string can be special combination characters:
      • $& inserts the whole match
      • $` <string> inserts a string before the match
      • $' <string> inserts a string after the match
      • $n if n if a 1-2 digit number, inserts the contents of the nth capturing group
      • $name if name is a named capturing group, insert its contents
      • $$ insert the character $
    • When the first argument of replace is a string, it only replaces the first match
    • If a function is used as the second argument, it is called with arguments func(match, p1, p2, ..., pn, offset, input, groups) where:
      • match is the match
      • p1, p2, ..., pn are the contents of the capturing groups
      • offset is the position of the match
      • input is the source string
      • groups is an object with named groups as properties
  • str.replaceAll(str|regexp, str|func) same as str.replace, but:
    • if the first arg is a string, replaces all occurrences of the string
    • if the first arg is a regex without the g flag, there'll be an error
  • regexp.match(str) if there is no g, behaves the same as str.match(regexp). If flag g is used:
    • A call to regexp.exec(str) returns the first match and saves the position immediately after it in the property regexp.lastIndex
    • The next call starts the search from position regexp.lastIndex. Keeps going until end of string
    • If there are no matches, regexp.exec returns null and resets regexp.lastIndex to 0
  • regexp.test(str) returns true/false to indicate whether match exists in str
    • NOTE: Same global regexp tested repeatedly on different sources may fail

Sources and Further Reading