By Konstantinos Patronas — 24 Jul 2022

Python: simple regular expression matching

Regular expressions or regex for short, is a text pattern that can help match text; Python has built in support for regular expressions…

Photo by Brett Jordan on Unsplash

Regular expressions or regex for short, is a text pattern that can help match text; Python has built in support for regular expressions with the `re` module. Many people find it hard to understand regular expressions, but learning regular expressions can save your hours if you constantly work with text or parsing big amounts of data.

What is the purpose of this article?

In this article i will present you the simplest concepts on text matching in Python, it will be super easy and will allow you to use regular expressions in no-time.

Simple string matching

Lets start with the simplest form of regex, searching if one string exists on another string, this is the simplest form of regular expressions.

In Python shell enter>>> import re
>>> result = re.search("cat","catastrophic")
>>> print(result)
<re.Match object; span=(0, 3), match='cat'>
>>>

The first parameter of the search function its the regular expression rule that we want to match vs the second parameter, the string that we want to apply the rule, as a result python returned an object with the following properties

span=(0,3) : this indicates that the string “cat” is found in string “catastrophic” between the first and the third character
match='cat' : is the matched result

In case we try a regex that does not have a match the search function does not return anything.>>> result = re.search("cat","dogs only")
>>> print(result)
None
>>>

Note that the search function will return only the first occurrence of a match, if we try to search for ‘x’ in string ‘xerox’ it will return the span property of the first occurrence only>>> import re
>>> result = re.search("x","xerox")
>>> print(result)
<re.Match object; span=(0, 1), match='x'>

How to find all occurrences of a string in another string

To find all occurrences and their span properties you must use the finditer function.>>> import re
>>> matches = re.finditer('a','abcabcabc')
>>> for match in matches:
... print(match)
...
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(6, 7), match='a'>

We saw the simplest form of search, its much like the search we do in word or windows notepad, just enter a string and search if its present in another string.

How to find strings that start with a specific character

But what if we want more complex rules, like matching strings only in case where a string starts with a specific character or ends with.

Here comes the special characters! special characters are characters that indicate conditions of matching, lets assume that we want to find the occurrences of the “x” character in string but only the cases that the string starts with an “x”, to indicate this we must use the ^ character before “x” which instructs Python to search only with strings starting with “x”.

It returned only one “x” match and with a span property that indicates that this is the beginning of the string>>> import re
>>> matches = re.finditer('^x','xerox')
>>> for match in matches:
... print(match)
...
<re.Match object; span=(0, 1), match='x'>
>>>

How to find strings that end with a specific character

How about to match only strings ending in a specific string? we have to use the $ character after “x” which instructs Python to search only with strings ending with “x” .

It returned only one “x” match and with a span property that indicates that this is the end of the string.>>> import re
>>> matches = re.finditer('x$','xerox')
>>> for match in matches:
... print(match)
...
<re.Match object; span=(4, 5), match='x'>

How to match any character

Another match case would be to match a string where one of each characters can be anything, lets assume that we have the following string “ping pang bang” and we want to match everything that starts with “p” and ends in “ng”, its very easy to understand that “ping” and “pang” are a match and they differ only in the second character; To indicate that we don't care of the value of a character in a match we use the “.” character>>> import re
>>> matches = re.finditer('p.ng','ping pang bang')
>>> for match in matches:
... print(match)
...
<re.Match object; span=(0, 4), match='ping'>
<re.Match object; span=(5, 9), match='pang'>

You might need to access the span and match properties programmatically, this is easy with the start() , end() , and group() properties of the returned object.>>> import re
>>> matches = re.finditer('p.ng','ping pang bang')
>>> for match in matches:
... print("Start: %s Stop: %s Match: %s"%(match.start(),match.end(),match.group(0)))
...
Start: 0 Stop: 4 Match: ping
Start: 5 Stop: 9 Match: pang

Conclusion

We just saw a tiny part of the regexes, regexes can be really complex but in many cases you might just need a tiny fragment of regexes, like those we learned here!.