Regular expressions, also known as regex, are a powerful tool for searching and manipulating text. They allow you to define a pattern of characters to search for in a string, and can be used to perform tasks such as finding all the emails in a document, or replacing all the occurrences of a word with another word.
How to Use Regular Expressions
In Python, regular expressions are supported by the re
module. To use regular expressions in Python, you first need to import the re
module, and then use one of the various functions provided by the module to search or manipulate a string.
For example, to search for a pattern in a string, you can use the search()
function:
import re
pattern = r'\d+' # a pattern to match one or more digits
string = 'There are 3 apples and 2 bananas'
match = re.search(pattern, string)
print(match.group()) # Output: 3
In this example, the search()
function searches the string
for the first occurrence of the pattern defined by pattern
. The group()
method of the match
object returns the matched string.
To find all the occurrences of a pattern in a string, you can use the findall()
function:
import re
pattern = r'\d+' # a pattern to match one or more digits
string = 'There are 3 apples and 2 bananas'
matches = re.findall(pattern, string)
print(matches) # Output: ['3', '2']
In this example, the findall()
function searches the string
for all the occurrences of the pattern defined by pattern
, and returns a list of all the matches.
To replace all the occurrences of a pattern in a string, you can use the sub()
function:
import re
pattern = r'\d+' # a pattern to match one or more digits
replacement = 'X' # the string to use as a replacement
string = 'There are 3 apples and 2 bananas'
new_string = re.sub(pattern, replacement, string)
print(new_string) # Output: There are X apples and X bananas
In this example, the sub()
function replaces all the occurrences of the pattern defined by pattern
in the string
with the replacement
string.
Regular expressions are defined using a special syntax, which includes characters that have a special meaning in the context of a regular expression. For example, the .
character matches any single character, the *
character matches zero or more occurrences of the preceding character, and the +
character matches one or more occurrences of the preceding character.
Examples of Regular Expressions
Here are some examples of regular expressions and their meanings:
\d
: a digit (equivalent to[0-9]
)\w
: a word character (equivalent to[a-zA-Z0-9_]
)\s
: a whitespace character (space, tab, newline, etc.).
: any single character*
: zero or more occurrences of the preceding character+
: one or more occurrences of the preceding character?
: zero or one occurrence of the preceding character{n}
: exactlyn
occurrences of the preceding character{n,}
: at leastn
occurrences of the preceding character{m,n}
: at leastm
and at mostn
occurrences of the preceding character[abc]
: any single character from the set{a, b, c}
[^abc]
: any single character that is not in the set{a, b, c}
(abc)
: a group of charactersabc
{n,}
: at leastn
occurrences of the preceding character{m,n}
: at leastm
and at mostn
occurrences of the preceding character[abc]
: any single character from the set{a, b, c}
[^abc]
: any single character that is not in the set{a, b, c}
(abc)
: a group of charactersabc
You can use these special characters to define a pattern that matches a specific set of characters. For example:
import re
pattern = r'\d{3}-\d{3}-\d{4}' # a pattern to match a US phone number
string = 'My phone number is 555-555-1212'
match = re.search(pattern, string)
print(match.group()) # Output: 555-555-1212
In this example, the pattern
matches a US phone number with the format XXX-XXX-XXXX
.
You can also use regular expressions to split a string into a list of substrings, using the split()
function:
import re
pattern = r'\s+' # a pattern to match one or more whitespace characters
string = 'This is a test'
substrings = re.split(pattern, string)
print(substrings) # Output: ['This', 'is', 'a', 'test']
In this example, the pattern
matches one or more whitespace characters, and the split()
function splits the string
into a list of substrings, using the whitespace characters as the delimiter.
Conclusion
Regular expressions can be very useful for searching and manipulating text, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.
Exercises
To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.
Use a regular expression to extract the domain name from a list of email addresses.
import re
pattern = r'@([\w.]+)' # a pattern to match the domain part of an email address
emails = ['john@example.com', 'jane@example.net', 'bob@example.org']
for email in emails:
match = re.search(pattern, email)
domain = match.group(1)
print(domain)
# Output:
# example.com
# example.net
# example.org
Use a regular expression to extract all the hashtags from a tweet.
import re
pattern = r'#(\w+)' # a pattern to match hashtags
tweet = 'Check out this cool new feature #newfeature #awesome'
matches = re.findall(pattern, tweet)
print(matches)
# Output: ['newfeature', 'awesome']
Use a regular expression to replace all the occurrences of a word with another word in a string.
import re
pattern = r'\bold\b' # a pattern to match the word "old"
replacement = 'new' # the replacement string
string = 'This old house is in need of some repairs'
new_string = re.sub(pattern, replacement, string)
print(new_string)
# Output: This new house is in need of some repairs
Use a regular expression to split a string into a list of words, ignoring punctuation.
import re
pattern = r'[^\w\s]+' # a pattern to match any non-word, non-whitespace character
string = 'This, is a test! Do you see any words?'
words = re.split(pattern, string)
print(words)
# Output: ['This', 'is', 'a', 'test', 'Do', 'you', 'see', 'any', 'words']
Use a regular expression to match a string that starts with a digit and ends with a letter.
import re
pattern = r'^\d.*[a-zA-Z]$' # a pattern to match a string that starts with a digit and ends with a letter
string1 = '1abc'
string2 = 'abc1'
string3 = '1abc1'
print(re.match(pattern, string1)) # Output: <re.Match object; span=(0, 4), match='1abc'>
print(re.match(pattern, string2)) # Output: None
print(re.match(pattern, string3)) # Output: None
In this example, the pattern
matches a string that starts with a digit and ends with a letter. The match()
function returns a Match
object if the pattern matches the string, and None
if it does not.