Regular expressions, also known as regex, are a powerful tool for extracting data from strings. In Python, the re
module provides functions for working with regular expressions. With regular expressions, you can define a pattern of characters to search for in a string, and extract specific pieces of data from the string.
Extracting Data from Structured Text
One common use case for regular expressions is extracting data from structured text, such as HTML or XML documents. For example, you might want to extract the title and body of an HTML article, or the attributes of an XML element. To do this, you can use the search()
or finditer()
functions of the re
module to find the patterns that define the data you want to extract.
Here is an example of extracting the title and body of an HTML article:
import re
html = '<html><head><title>My Article</title></head><body>This is the body of my article.</body></html>'
# define a pattern to match the title
title_pattern = r'<title>(.+?)</title>'
# search for the title
match = re.search(title_pattern, html)
if match:
title = match.group(1) # get the title from the match
# define a pattern to match the body
body_pattern = r'<body>(.+?)</body>'
# search for the body
match = re.search(body_pattern, html)
if match:
body = match.group(1) # get the body from the match
print(title) # Output: My Article
print(body) # Output: This is the body of my article.
In this example, the title_pattern
and body_pattern
are regular expressions that define the patterns to search for in the html
string. The search()
function searches the html
string for the first occurrence of the pattern, and returns a Match
object if a match is found. You can use the group()
method of the Match
object to get the matched string.
You can also use the finditer()
function to extract multiple pieces of data from a string. For example, you might want to extract all the links from an HTML document:
import re
html = '<html><body><p>This is a paragraph with a <a href="http://example.com">link</a>.</p></body></html>'
# define a pattern to match links
link_pattern = r'<a href="(.+?)">(.+?)</a>'
# find all the links
for match in re.finditer(link_pattern, html):
url = match.group(1) # get the URL from the match
text = match.group(2) # get the link text from the match
print(url, text) # Output: http://example.com link
In this example, the link_pattern
is a regular expression that defines a pattern to match links in the html
string. The finditer()
function searches the html
string for all the occurrences of the pattern, and returns an iterator of Match
objects. You can use a for
loop to iterate through the Match
objects, and use the group()
method to get the URL and link text for each one.
Extracting Data from Unstructured Text
Regular expressions can also be used to extract data from unstructured text, such as log files or CSV files. For example, you might want to extract the date and time from a log file, or the name and email from a CSV file. To do this, you can use regular expressions to define the patterns of the data you want to extract, and use the search()
, finditer()
, or findall()
functions to find the matches.
Here is an example of extracting the date and time from a log file:
import re
log = '2022-12-24T12:34:56: This is a log message.'
# define a pattern to match the date and time
dt_pattern = r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'
# search for the date and time
match = re.search(dt_pattern, log)
if match:
dt = match.group(1) # get the date and time from the match
print(dt) # Output: 2022-12-24T12:34:56
In this example, the dt_pattern
is a regular expression that defines a pattern to match the date and time in the log
string. The search()
function searches the log
string for the first occurrence of the pattern, and returns a Match
object if a match is found. You can use the group()
method of the Match
object to get the date and time from the match.
Conclusion
Regular expressions can be very useful for extracting data from strings, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.
Exercises
To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.
Use a regular expression to extract the title and author from an XML document.
import re
xml = '<book><title>War and Peace</title><author>Leo Tolstoy</author></book>'
# define a pattern to match the title
title_pattern = r'<title>(.+?)</title>'
# search for the title
match = re.search(title_pattern, xml)
if match:
title = match.group(1) # get the title from the match
# define a pattern to match the author
author_pattern = r'<author>(.+?)</author>'
# search for the author
match = re.search(author_pattern, xml)
if match:
author = match.group(1) # get the author from the match
print(title) # Output: War and Peace
print(author) # Output: Leo Tolstoy
Use a regular expression to extract the name and email from a CSV file.
import re
csv = 'John,Doe,john.doe@example.com\nJane,Doe,jane.doe@example.net'
# define a pattern to match the name and email
record_pattern = r'([\w\s]+),([\w\s]+),([\w.-]+@[\w.-]+)'
# find all the records
for match in re.finditer(record_pattern, csv):
name = match.group(1) # get the name from the match
email = match.group(3) # get the email from the match
print(name, email)
# Output:
# John john.doe@example.com
# Jane jane.doe@example.net
Use a regular expression to extract the date and time from a log file, ignoring the milliseconds.
import re
log = '2022-12-24T12:34:56.789: This is a log message.'
# define a pattern to match the date and time, ignoring the milliseconds
dt_pattern = r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'
# search for the date and time
match = re.search(dt_pattern, log)
if match:
dt = match.group(1) # get the date and time from the match
print(dt) # Output: 2022-12-24T12:34:56
Use a regular expression to extract the domain name from a URL.
import re
url = 'http://www.example.com/path/to/page'
# define a pattern to match the domain name
domain_pattern = r'http://(.+?)/'
# search for the domain name
match = re.search(domain_pattern, url)
if match:
domain = match.group(1) # get the domain name from the match
print(domain) # Output: www.example.com
Use a regular expression to extract the phone numbers from a string.
import re
string = 'My phone numbers are 555-123-4567 and (123) 456-7890.'
# define a pattern to match phone numbers
phone_pattern = r'\b\d{3}[- )\.]?\d{3}[- .]?\d{4}\b'
# find all the phone numbers
for match in re.finditer(phone_pattern, string):
phone = match.group() # get the phone number from the match
print(phone)
# Output:
# 555-123-4567
# (123) 456-7890