7.1. Regex About

  • Also known as: re, regex, regexp, Regular Expressions

W3C HTML5 Standard [4] regexp for email field:

>>> pattern = r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$"

7.1.1. SetUp

  • string = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'

  • string is short

  • string has name

  • string has date

  • string has time

  • string has punctuation (,, :)

  • string has digits and numbers

  • string has ordinals (th) - from st, nd, rd, th

  • string has lowercase and uppercase letters

  • string has email address

  • string has conjunctions (from, on, at)

>>> import re
>>> string = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> STRING = """Apollo 11 was the American spaceflight that first landed
... humans on the Moon. Commander (CDR) Neil Armstrong and lunar module
... pilot (LMP) Buzz Aldrin landed the Apollo Lunar Module (LM) Eagle on
... July 20th, 1969 at 20:17 UTC, and Armstrong became the first person
... to step (EVA) onto the Moon's surface (EVA) 6 hours 39 minutes later,
... on July 21st, 1969 at 02:56:15 UTC. Aldrin joined him 19 minutes later.
... They spent 2 hours 31 minutes exploring the site they had named Tranquility
... Base upon landing. Armstrong and Aldrin collected 47.5 pounds (21.5 kg)
... of lunar material to bring back to Earth as pilot Michael Collins (CMP)
... flew the Command Module (CM) Columbia in lunar orbit, and were on the
... Moon's surface for 21 hours 36 minutes before lifting off to rejoin
... Columbia."""

7.1.2. Python

  • import re

  • re.findall() - find all occurrences of pattern in string, returns list[str]

  • re.finditer() - find first occurrence of pattern in string, returns Iterator[re.Match]

  • re.search() - find first occurrence of pattern in string, returns re.Match (stops after first match)

  • re.match() - check if string matches pattern, used in validation: phone, email, tax id, etc., returns re.Match

  • re.compile() - compile pattern into object for further use, for example in the loop, returns re.Pattern

  • re.split() - split string by pattern, returns list[str]

  • re.sub() - substitute pattern in string with something else, returns str

7.1.3. Syntax

  • Character Class - what to find (single character)

  • Qualifiers - range to find (range)

  • Negation

  • Quantifiers - how many occurrences of preceding qualifier or character class

  • Groups

  • Look Ahead and Look Behind

  • Flags

  • Extensions

  • [] - Qualifier

  • {} - Quantifier

  • () - Groups

7.1.4. Under the Hood

  • ASCII table

  • chr()

  • ord()

  • re.DEBUG

>>> chr(97)
'a'
>>> ord('a')
97
>>> string = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>>
>>> [ord(x) for x in string]
[79, 110, 32, 83, 117, 110, 44, 32, 74, 97, 110, 32, 49, 115, 116, 44, 32, 50, 48, 48, 48, 32, 97, 116, 32, 49, 50, 58, 48, 48, 32, 65, 77, 32, 65, 108, 105, 99, 101, 32, 60, 97, 108, 105, 99, 101, 64, 101, 120, 97, 109, 112, 108, 101, 46, 99, 111, 109, 62, 32, 119, 114, 111, 116, 101]
>>> string = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> pattern = r'a'
>>>
>>> re.findall(pattern, string, flags=re.DEBUG)
LITERAL 97

 0. INFO 8 0b11 1 1 (to 9)
      prefix_skip 1
      prefix [0x61] ('a')
      overlap [0]
 9: LITERAL 0x61 ('a')
11. SUCCESS
['a', 'a', 'a', 'a']

7.1.5. Visualization

../../_images/regexp-visualization.png

Figure 7.1. Visualization for pattern r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,20}$' [1]

7.1.6. Further Reading

7.1.7. References