Back to Course

Intermediate Python

0% Complete
0/0 Steps
Lesson 14 of 33
In Progress

Searching and Replacing Text Using Regular Expressions

Regular expressions, also known as regex, are a powerful tool for searching and manipulating text. In Python, the re module provides functions for working with regular expressions. With regular expressions, you can define a pattern of characters to search for in a string, and perform tasks such as replacing all the occurrences of a word with another word.

Searching Text Using Regular Expression

One of the most common tasks you might need to do with regular expressions is searching for a specific pattern in a string. To search for a pattern in a string, you can use the search() function of the re module:

import re

pattern = r'\bthe\b'  # a pattern to match the word "the"
string = 'The quick brown fox jumps over the lazy dog.'

match = re.search(pattern, string)
if match:
    print(match.group())  # Output: the

In this example, the search() function searches the string for the first occurrence of the pattern defined by pattern. If a match is found, the search() function returns a Match object. You can use the group() method of the Match object to get the matched string.

To find all the occurrences of a pattern in a string, you can use the finditer() function:

import re

pattern = r'\bthe\b'  # a pattern to match the word "the"
string = 'The quick brown fox jumps over the lazy dog.'

for match in re.finditer(pattern, string):
    print(match.group())  # Output: the, the

In this example, the finditer() function searches the string for all the occurrences of the pattern defined by pattern, and returns an iterator of Match objects. You can use a for loop to iterate through the Match objects, and use the group() method to get the matched string for each one.

Replacing Text Using Regular Expressions

To replace all the occurrences of a pattern in a string, you can use the sub() function:

import re

pattern = r'\bthe\b'  # a pattern to match the word "the"
replacement = 'a'  # the replacement string
string = 'The quick brown fox jumps over the lazy dog.'

new_string = re.sub(pattern, replacement, string)
print(new_string)  # Output: a quick brown fox jumps over a lazy dog.

In this example, the sub() function replaces all the occurrences of the pattern defined by pattern in the string with the replacement string.

Regular expressions can also be used to split a string into substrings, using the split() function:

import re

pattern = r'\s+'  # a pattern to match one or more whitespace characters
string = 'This  is  a  test'

substrings = re.split(pattern, string)
print(substrings)  # Output: ['This', 'is', 'a', 'test']

In this example, the pattern matches one or more whitespace characters, and the split() function splits the string into a list of substrings, using the whitespace characters as the delimiter.

You can also use the findall() function to extract all the occurrences of a pattern from a string:

import re

pattern = r'\b\d+\b'  # a pattern to match one or more digits
string = 'There are 100 apples and 50 oranges.'

matches = re.findall(pattern, string)
print(matches)  # Output: ['100', '50']

In this example, the pattern matches one or more digits, and the findall() function returns a list of all the matches.

Conclusion

Regular expressions can be very useful for searching and manipulating text, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.

Exercises

To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.

Use a regular expression to extract the domain name from a list of email addresses.

import re

pattern = r'@([\w.]+)'  # a pattern to match the domain part of an email address
emails = ['john@example.com', 'jane@example.net', 'bob@example.org']

for email in emails:
    match = re.search(pattern, email)
    domain = match.group(1)
    print(domain)

# Output:
# example.com
# example.net
# example.org

Use a regular expression to extract all the words that start with a vowel from a string.

import re

pattern = r'\b[aeiouAEIOU]\w+'  # a pattern to match words that start with a vowel
string = 'This is a test sentence with some vowels'
matches = re.findall(pattern, string)
print(matches)  # Output: ['is', 'a', 'some', 'vowels']

Use a regular expression to replace all the numbers in a string with the word “NUMBER”.

import re

pattern = r'\b\d+\b'  # a pattern to match one or more digits surrounded by word boundaries
replacement = 'NUMBER'  # the replacement string
string = 'There are 3 apples and 2 bananas'
new_string = re.sub(pattern, replacement, string)
print(new_string)  # Output: There are NUMBER apples and NUMBER bananas

Use a regular expression to split a string into a list of words, ignoring punctuation.

import re

pattern = r'[^\w\s]+'  # a pattern to match any non-word, non-whitespace character
string = 'This, is a test! Do you see any words?'

words = re.split(pattern, string)
print(words)  # Output: ['This', 'is', 'a', 'test', 'Do', 'you', 'see', 'any', 'words']

Use a regular expression to match a string that starts with a digit and ends with a letter.

import re

pattern = r'^\d.*[a-zA-Z]$'  # a pattern to match a string that starts with a digit and ends with a letter
string1 = '1abc'
string2 = 'abc1'
string3 = '1abc1'

print(re.match(pattern, string1))  # Output: <re.Match object; span=(0, 4), match='1abc'>
print(re.match(pattern, string2))  # Output: None
print(re.match(pattern, string3))  # Output: None

In this example, the pattern matches a string that starts with a digit and ends with a letter. The match() function returns a Match object if the pattern matches the string, and None if it does not.