Regular expressions, also known as regex, are a powerful tool for searching and manipulating text. In Python, the re
module provides functions for working with regular expressions. With regular expressions, you can define a pattern of characters to search for in a string, and perform tasks such as replacing all the occurrences of a word with another word.
Searching Text Using Regular Expression
One of the most common tasks you might need to do with regular expressions is searching for a specific pattern in a string. To search for a pattern in a string, you can use the search()
function of the re
module:
import re
pattern = r'\bthe\b' # a pattern to match the word "the"
string = 'The quick brown fox jumps over the lazy dog.'
match = re.search(pattern, string)
if match:
print(match.group()) # Output: the
In this example, the search()
function searches the string
for the first occurrence of the pattern defined by pattern
. If a match is found, the search()
function returns a Match
object. You can use the group()
method of the Match
object to get the matched string.
To find all the occurrences of a pattern in a string, you can use the finditer()
function:
import re
pattern = r'\bthe\b' # a pattern to match the word "the"
string = 'The quick brown fox jumps over the lazy dog.'
for match in re.finditer(pattern, string):
print(match.group()) # Output: the, the
In this example, the finditer()
function searches the string
for all the occurrences of the pattern defined by pattern
, and returns an iterator of Match
objects. You can use a for
loop to iterate through the Match
objects, and use the group()
method to get the matched string for each one.
Replacing Text Using Regular Expressions
To replace all the occurrences of a pattern in a string, you can use the sub()
function:
import re
pattern = r'\bthe\b' # a pattern to match the word "the"
replacement = 'a' # the replacement string
string = 'The quick brown fox jumps over the lazy dog.'
new_string = re.sub(pattern, replacement, string)
print(new_string) # Output: a quick brown fox jumps over a lazy dog.
In this example, the sub()
function replaces all the occurrences of the pattern defined by pattern
in the string
with the replacement
string.
Regular expressions can also be used to split a string into substrings, using the split()
function:
import re
pattern = r'\s+' # a pattern to match one or more whitespace characters
string = 'This is a test'
substrings = re.split(pattern, string)
print(substrings) # Output: ['This', 'is', 'a', 'test']
In this example, the pattern
matches one or more whitespace characters, and the split()
function splits the string
into a list of substrings, using the whitespace characters as the delimiter.
You can also use the findall()
function to extract all the occurrences of a pattern from a string:
import re
pattern = r'\b\d+\b' # a pattern to match one or more digits
string = 'There are 100 apples and 50 oranges.'
matches = re.findall(pattern, string)
print(matches) # Output: ['100', '50']
In this example, the pattern
matches one or more digits, and the findall()
function returns a list of all the matches.
Conclusion
Regular expressions can be very useful for searching and manipulating text, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.
Exercises
To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.
Use a regular expression to extract the domain name from a list of email addresses.
import re
pattern = r'@([\w.]+)' # a pattern to match the domain part of an email address
emails = ['john@example.com', 'jane@example.net', 'bob@example.org']
for email in emails:
match = re.search(pattern, email)
domain = match.group(1)
print(domain)
# Output:
# example.com
# example.net
# example.org
Use a regular expression to extract all the words that start with a vowel from a string.
import re
pattern = r'\b[aeiouAEIOU]\w+' # a pattern to match words that start with a vowel
string = 'This is a test sentence with some vowels'
matches = re.findall(pattern, string)
print(matches) # Output: ['is', 'a', 'some', 'vowels']
Use a regular expression to replace all the numbers in a string with the word “NUMBER”.
import re
pattern = r'\b\d+\b' # a pattern to match one or more digits surrounded by word boundaries
replacement = 'NUMBER' # the replacement string
string = 'There are 3 apples and 2 bananas'
new_string = re.sub(pattern, replacement, string)
print(new_string) # Output: There are NUMBER apples and NUMBER bananas
Use a regular expression to split a string into a list of words, ignoring punctuation.
import re
pattern = r'[^\w\s]+' # a pattern to match any non-word, non-whitespace character
string = 'This, is a test! Do you see any words?'
words = re.split(pattern, string)
print(words) # Output: ['This', 'is', 'a', 'test', 'Do', 'you', 'see', 'any', 'words']
Use a regular expression to match a string that starts with a digit and ends with a letter.
import re
pattern = r'^\d.*[a-zA-Z]$' # a pattern to match a string that starts with a digit and ends with a letter
string1 = '1abc'
string2 = 'abc1'
string3 = '1abc1'
print(re.match(pattern, string1)) # Output: <re.Match object; span=(0, 4), match='1abc'>
print(re.match(pattern, string2)) # Output: None
print(re.match(pattern, string3)) # Output: None
In this example, the pattern
matches a string that starts with a digit and ends with a letter. The match()
function returns a Match
object if the pattern matches the string, and None
if it does not.