Back to Course

Intermediate Python

0% Complete
0/0 Steps
Lesson 13 of 33
In Progress

Introduction to Regular Expressions

Regular expressions, also known as regex, are a powerful tool for searching and manipulating text. They allow you to define a pattern of characters to search for in a string, and can be used to perform tasks such as finding all the emails in a document, or replacing all the occurrences of a word with another word.

How to Use Regular Expressions

In Python, regular expressions are supported by the re module. To use regular expressions in Python, you first need to import the re module, and then use one of the various functions provided by the module to search or manipulate a string.

For example, to search for a pattern in a string, you can use the search() function:

import re

pattern = r'\d+'  # a pattern to match one or more digits
string = 'There are 3 apples and 2 bananas'
match = re.search(pattern, string)
print(match.group())  # Output: 3

In this example, the search() function searches the string for the first occurrence of the pattern defined by pattern. The group() method of the match object returns the matched string.

To find all the occurrences of a pattern in a string, you can use the findall() function:

import re

pattern = r'\d+'  # a pattern to match one or more digits
string = 'There are 3 apples and 2 bananas'
matches = re.findall(pattern, string)
print(matches)  # Output: ['3', '2']

In this example, the findall() function searches the string for all the occurrences of the pattern defined by pattern, and returns a list of all the matches.

To replace all the occurrences of a pattern in a string, you can use the sub() function:

import re

pattern = r'\d+'  # a pattern to match one or more digits
replacement = 'X'  # the string to use as a replacement
string = 'There are 3 apples and 2 bananas'
new_string = re.sub(pattern, replacement, string)
print(new_string)  # Output: There are X apples and X bananas

In this example, the sub() function replaces all the occurrences of the pattern defined by pattern in the string with the replacement string.

Regular expressions are defined using a special syntax, which includes characters that have a special meaning in the context of a regular expression. For example, the . character matches any single character, the * character matches zero or more occurrences of the preceding character, and the + character matches one or more occurrences of the preceding character.

Examples of Regular Expressions

Here are some examples of regular expressions and their meanings:

  • \d: a digit (equivalent to [0-9])
  • \w: a word character (equivalent to [a-zA-Z0-9_])
  • \s: a whitespace character (space, tab, newline, etc.)
  • .: any single character
  • *: zero or more occurrences of the preceding character
  • +: one or more occurrences of the preceding character
  • ?: zero or one occurrence of the preceding character
  • {n}: exactly n occurrences of the preceding character
  • {n,}: at least n occurrences of the preceding character
  • {m,n}: at least m and at most n occurrences of the preceding character
  • [abc]: any single character from the set {a, b, c}
  • [^abc]: any single character that is not in the set {a, b, c}
  • (abc): a group of characters abc
  • {n,}: at least n occurrences of the preceding character
  • {m,n}: at least m and at most n occurrences of the preceding character
  • [abc]: any single character from the set {a, b, c}
  • [^abc]: any single character that is not in the set {a, b, c}
  • (abc): a group of characters abc

You can use these special characters to define a pattern that matches a specific set of characters. For example:

import re

pattern = r'\d{3}-\d{3}-\d{4}'  # a pattern to match a US phone number
string = 'My phone number is 555-555-1212'
match = re.search(pattern, string)
print(match.group())  # Output: 555-555-1212

In this example, the pattern matches a US phone number with the format XXX-XXX-XXXX.

You can also use regular expressions to split a string into a list of substrings, using the split() function:

import re

pattern = r'\s+'  # a pattern to match one or more whitespace characters
string = 'This  is  a  test'
substrings = re.split(pattern, string)
print(substrings)  # Output: ['This', 'is', 'a', 'test']

In this example, the pattern matches one or more whitespace characters, and the split() function splits the string into a list of substrings, using the whitespace characters as the delimiter.

Conclusion

Regular expressions can be very useful for searching and manipulating text, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.

Exercises

To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.

Use a regular expression to extract the domain name from a list of email addresses.

import re

pattern = r'@([\w.]+)'  # a pattern to match the domain part of an email address
emails = ['john@example.com', 'jane@example.net', 'bob@example.org']

for email in emails:
    match = re.search(pattern, email)
    domain = match.group(1)
    print(domain)

# Output:
# example.com
# example.net
# example.org

Use a regular expression to extract all the hashtags from a tweet.

import re

pattern = r'#(\w+)'  # a pattern to match hashtags
tweet = 'Check out this cool new feature #newfeature #awesome'

matches = re.findall(pattern, tweet)
print(matches)

# Output: ['newfeature', 'awesome']

Use a regular expression to replace all the occurrences of a word with another word in a string.

import re

pattern = r'\bold\b'  # a pattern to match the word "old"
replacement = 'new'  # the replacement string
string = 'This old house is in need of some repairs'

new_string = re.sub(pattern, replacement, string)
print(new_string)

# Output: This new house is in need of some repairs

Use a regular expression to split a string into a list of words, ignoring punctuation.

import re

pattern = r'[^\w\s]+'  # a pattern to match any non-word, non-whitespace character
string = 'This, is a test! Do you see any words?'

words = re.split(pattern, string)
print(words)

# Output: ['This', 'is', 'a', 'test', 'Do', 'you', 'see', 'any', 'words']

Use a regular expression to match a string that starts with a digit and ends with a letter.

import re

pattern = r'^\d.*[a-zA-Z]$'  # a pattern to match a string that starts with a digit and ends with a letter
string1 = '1abc'
string2 = 'abc1'
string3 = '1abc1'

print(re.match(pattern, string1))  # Output: <re.Match object; span=(0, 4), match='1abc'>
print(re.match(pattern, string2))  # Output: None
print(re.match(pattern, string3))  # Output: None

In this example, the pattern matches a string that starts with a digit and ends with a letter. The match() function returns a Match object if the pattern matches the string, and None if it does not.