Extracting Data Using Regular Expressions

Yasin Cakal

Regular expressions, also known as regex, are a powerful tool for extracting data from strings. In Python, the re module provides functions for working with regular expressions. With regular expressions, you can define a pattern of characters to search for in a string, and extract specific pieces of data from the string.

Extracting Data from Structured Text

One common use case for regular expressions is extracting data from structured text, such as HTML or XML documents. For example, you might want to extract the title and body of an HTML article, or the attributes of an XML element. To do this, you can use the search() or finditer() functions of the re module to find the patterns that define the data you want to extract.

Here is an example of extracting the title and body of an HTML article:

import re

html = '<html><head><title>My Article</title></head><body>This is the body of my article.</body></html>'

# define a pattern to match the title
title_pattern = r'<title>(.+?)</title>'

# search for the title
match = re.search(title_pattern, html)
if match:
    title = match.group(1)  # get the title from the match

# define a pattern to match the body
body_pattern = r'<body>(.+?)</body>'

# search for the body
match = re.search(body_pattern, html)
if match:
    body = match.group(1)  # get the body from the match

print(title)  # Output: My Article
print(body)  # Output: This is the body of my article.

In this example, the title_pattern and body_pattern are regular expressions that define the patterns to search for in the html string. The search() function searches the html string for the first occurrence of the pattern, and returns a Match object if a match is found. You can use the group() method of the Match object to get the matched string.

You can also use the finditer() function to extract multiple pieces of data from a string. For example, you might want to extract all the links from an HTML document:

import re

html = '<html><body><p>This is a paragraph with a <a href="http://example.com">link</a>.</p></body></html>'

# define a pattern to match links
link_pattern = r'<a href="(.+?)">(.+?)</a>'

# find all the links
for match in re.finditer(link_pattern, html):
    url = match.group(1)  # get the URL from the match
    text = match.group(2)  # get the link text from the match
print(url, text)  # Output: http://example.com link

In this example, the link_pattern is a regular expression that defines a pattern to match links in the html string. The finditer() function searches the html string for all the occurrences of the pattern, and returns an iterator of Match objects. You can use a for loop to iterate through the Match objects, and use the group() method to get the URL and link text for each one.

Extracting Data from Unstructured Text

Regular expressions can also be used to extract data from unstructured text, such as log files or CSV files. For example, you might want to extract the date and time from a log file, or the name and email from a CSV file. To do this, you can use regular expressions to define the patterns of the data you want to extract, and use the search(), finditer(), or findall() functions to find the matches.

Here is an example of extracting the date and time from a log file:

import re

log = '2022-12-24T12:34:56: This is a log message.'

# define a pattern to match the date and time
dt_pattern = r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'

# search for the date and time
match = re.search(dt_pattern, log)
if match:
    dt = match.group(1)  # get the date and time from the match

print(dt)  # Output: 2022-12-24T12:34:56

In this example, the dt_pattern is a regular expression that defines a pattern to match the date and time in the log string. The search() function searches the log string for the first occurrence of the pattern, and returns a Match object if a match is found. You can use the group() method of the Match object to get the date and time from the match.

Conclusion

Regular expressions can be very useful for extracting data from strings, but they can also be complex to use. It is recommended to use regular expressions only when necessary, and to test your regular expressions carefully to make sure they are working as intended.

Exercises

To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.

Use a regular expression to extract the title and author from an XML document.

import re

xml = '<book><title>War and Peace</title><author>Leo Tolstoy</author></book>'

# define a pattern to match the title
title_pattern = r'<title>(.+?)</title>'

# search for the title
match = re.search(title_pattern, xml)
if match:
    title = match.group(1)  # get the title from the match

# define a pattern to match the author
author_pattern = r'<author>(.+?)</author>'

# search for the author
match = re.search(author_pattern, xml)
if match:
    author = match.group(1)  # get the author from the match

print(title)  # Output: War and Peace
print(author)  # Output: Leo Tolstoy

Use a regular expression to extract the name and email from a CSV file.

import re

csv = 'John,Doe,john.doe@example.com\nJane,Doe,jane.doe@example.net'

# define a pattern to match the name and email
record_pattern = r'([\w\s]+),([\w\s]+),([\w.-]+@[\w.-]+)'

# find all the records
for match in re.finditer(record_pattern, csv):
    name = match.group(1)  # get the name from the match
    email = match.group(3)  # get the email from the match
    print(name, email)

# Output:
# John john.doe@example.com
# Jane jane.doe@example.net

Use a regular expression to extract the date and time from a log file, ignoring the milliseconds.

import re

log = '2022-12-24T12:34:56.789: This is a log message.'

# define a pattern to match the date and time, ignoring the milliseconds
dt_pattern = r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'

# search for the date and time
match = re.search(dt_pattern, log)
if match:
    dt = match.group(1)  # get the date and time from the match

print(dt)  # Output: 2022-12-24T12:34:56

Use a regular expression to extract the domain name from a URL.

import re

url = 'http://www.example.com/path/to/page'

# define a pattern to match the domain name
domain_pattern = r'http://(.+?)/'

# search for the domain name
match = re.search(domain_pattern, url)
if match:
    domain = match.group(1)  # get the domain name from the match

print(domain)  # Output: www.example.com

Use a regular expression to extract the phone numbers from a string.

import re

string = 'My phone numbers are 555-123-4567 and (123) 456-7890.'

# define a pattern to match phone numbers
phone_pattern = r'\b\d{3}[- )\.]?\d{3}[- .]?\d{4}\b'

# find all the phone numbers
for match in re.finditer(phone_pattern, string):
    phone = match.group()  # get the phone number from the match
    print(phone)

# Output:
# 555-123-4567
# (123) 456-7890

Intermediate Python

Participants 4567