learn regular expressions the easy way
Regular expressions is a sequence of characters that specify a search pattern, regexes are super useful and can help you in administration…
Regular expressions is a sequence of characters that specify a search pattern, regexes are super useful and can help you in administration and programming tasks, especially can be super effiecient to use tools like grep or sed to do regexes/substitutions and not to write code that will manipulate text.
lets see them in action with a simple example
Create file ip_addresses.txt with the following content1.1.1.1
adsds.3.4.5
127.0.0.1
192.168.0.1
Now, lets try to match all valid ip addresses in this file using grep$ grep -P "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" ip_addresses.txt
1.1.1.1
127.0.0.1
192.168.0.1
As we can see grep ignored adsds.3.4.5 which is not a valid ip address, lets see how this regular expression works
- \d{0,3} : instructs grep to match a number ranging from 1 to 3 numerical digits in a row.
- \. : Its a string literal, it matches a character, in this case is the dot.
in this case grep knows that there is a match when there are 4 numbers separated by a dot and each number length is one to three digits.
Literal pattern matching, be careful!grep -P "1.1.1.1" ip_addresses.txt
1.1.1.1
We can see that in this case it matched the ip address, but things are not as clear as it seem
open ip_addresses.txt and change 1.1.1.1 to 1a1.1.1
Now re-run grep with the previous parameters$ grep -P "1.1.1.1" ip_addresses.txt
1a1.1.1
We can see that matched 1a1.1.1 but why??
This happens because the dot is not interpreted by the regex engine as the character “.” but rather as a regex rule which dot means “any character”, to match a dot as a character we must use the \. syntax.
Executing this should not return anything$ grep -P "1\.1\.1\.1" ip_addresses.txt
But if we change 1a1.1.1 to 1.1.1.1 and re-run this grep statement we should get back a match$ grep -P "1\.1\.1\.1" ip_addresses.txt
1.1.1.1
Characters and words
Create the data.txt with the following contentA-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
B-300 CCC $3000
A-600 BBB $XXXX
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
D-100 BBB $2500
D-100 BBB $2500
Lets find only lines that start with “C-”, That was easy:$ grep "C-" data.txt
C-100 BBB $2500
C-100 BBB $2500
Lets make things a bit more complicate, lets try to match every line that contains characters A,B,C and the next character is “-”$ grep "[A-C]-" data.txt
A-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
B-300 CCC $3000
A-600 BBB $XXXX
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
[]: character set, matches any character between A-C range, if we omit the “-” it will match only any lines contains A or C and the next character is “-”$ grep "[AC]-" data.txt
A-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
A-600 BBB $XXXX
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
Lets try to match all lines that start from A and C, matches a non-word character (-) and 3 digits$ grep "[AC]\W\w\w\w" data.txt
A-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
A-600 BBB $XXXX
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
- \W: match a non-word character (not A-Z,a-z,0–9)
- \w: match a word character (A-Z,a-z,0–9)
So in this case we instruct grep to match any line that contains A or C, continues a non word character which in this case the “-” and three word characters in row (3 digits).
Digits
Instead of using \w to match numerical characters we can use the \d which is used for numbers exclusively
Lets try to match everything that contains A or D next character is “-” and the number after is 1 or 5 and the next two characters are digits$ grep -P "[AD]-[1,5]\d\d" data.txt
A-100 AAA $1000
A-500 BBB $1500
D-100 BBB $2500
D-100 BBB $2500
Character sets with digits also support ranges, if we change [1,5] to [1–5] we will get an extra line of A-200 because “2” is in the range of [1–5].$ grep -P "[AD]-[1-5]\d\d" data.txt
A-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
D-100 BBB $2500
D-100 BBB $2500
sed
sed, also known as stream editor used regular expressions to manipulate text files, a very common task of sed is to replace all whitespaces to a single one, white spaces might be the space character or even tab
create the following file and save it as test.txtsdsdsd dsdsds dsdsd dsdsd dsdsd
fdsafadsfs dsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef
Note that some spaces in this file are not space but those are tabs, we can view the special characters using cat -t commandcat -t ./test.txt
sdsdsd dsdsds dsdsd dsdsd dsdsd
fdsafadsfs^I^Idsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef
Now to remove all spaces and tabs we can use the following sed commandsed -r 's/\s+/ /g' test.txt
sdsdsd dsdsds dsdsd dsdsd dsdsd
fdsafadsfs dsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef
Using sed without the -i parameter outputs the processed file to stdout, if we want to make changes in file use the followingsed -r -i.backup 's/\s+/ /g' test.txt
- \s+ : match every whitespace characters that exists at least one or more times
- -r means use extended regular expressions.
The -i instructs sed to do changes in file and the .backup instructs sed to keep a backup of the original file using the .backup extension (we can use any extension we like, i just like full names).-rw-r--r-- 1 kpatronas kpatronas 106 Jul 11 00:41 test.txt.backup
-rw-r--r-- 1 kpatronas kpatronas 100 Jul 11 00:51 test.txt
location
One very common task is to find only text that starts or ends with specific characters, lets go back to file data.txtA-100 AAA $1000
A-200 BBB $1000
A-500 BBB $1500
B-300 CCC $3000
A-600 BBB $XXXX
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
D-100 BBB $2500
D-100 BBB $2500
Lets filter only lines that start with B, to do this we have to use ^ charactergrep '^B' data.txt
B-300 CCC $3000
Now lets filter only lines that end with $2500 and start with Cgrep '^C.*\$2500$' data.txt
C-100 BBB $2500
C-100 BBB $2500
- ^ : Starts with C
- \$: We want to match the literal dollar sign
- $: Ends with 2500
- .*: Any characters
Boundaries
There might be cases that we want to find a string which is not part of another string, this means that we might want to find a string that the characters before and after the string are not words or digits
Lets try to find lines that contain 500 in file data.txt$ grep '500' data.txt
A-500 BBB $1500
A-600 GGG $2500
C-100 BBB $2500
C-100 BBB $2500
D-100 BBB $2500
D-100 BBB $2500
It works somehow but what if we want to find only lines contain 500 and is not surrounded by any digit or word, we can use the \b which means boundaries$ grep '\b500\b' data.txt
A-500 BBB $1500
In this case found only this line because spaces and “-” are not words or digits
Alternation
What if we want to filter out lines that contain 200 and 100 but are not surrounded by any word or digit$ grep -E '\b(100|200)\b' data.txt
A-100 AAA $1000
A-200 BBB $1000
C-100 BBB $2500
C-100 BBB $2500
D-100 BBB $2500
D-100 BBB $2500
- E: Extended regular expressions
- (string1|string2) match string1 or string2
- \b: boundaries
Repetition
Lets consider the following url https://www.public.gr/blog this url is
- https: the protocol
- www: subdomain
- public: domain
- gr: top-level domain
- blog: path
Save the following as webs.txthttp://google.com
http://www.public.gr/blog
https://microsoft.com
http://zzz.com
The first thing to match is the protocol which can be http or https
- ?: using ? after a character makes the matching of this character optional$ grep -Po '^https?' webs.txt
http
http
https
http
Then we need to match ://
- \: before a character does a literal match of this character$ grep -Po '^https?:\/\/' webs.txt
http://
http://
https://
http://
Now we need to match the subdomain that might not be present in all urls
- (www\.)?: Strings or regex rules inside braces make matching groups, note that this group ands with “?” this means that preceding character or group matching is optional$ grep -Po '^https?:\/\/(www\.)?' webs.txt
http://
http://www.
https://
http://
Now we have to match the domain name
- [a-z,A-Z,0–9]: The first rule inside the group means that the first character of the domain name might be a character that is lower or capital or a digit
- [a-z,A-Z,0–9,\-]: Everything from the second character might contain characters, lower or capital, digits and dashes
- *: An asterisk means that the last character(s) might exist or not
- \. matching ends with a dot$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.' webs.txt
http://google.
http://www.public.
https://microsoft.
http://zzz.
We need to brake down this ([a-z,A-Z,0–9][a-z,A-Z,0–9\-]*)\.
- (): Its a group
- [a-z,A-Z,0–9]: The first character might start from a-z, A-Z, 0–9
- [a-z,A-Z,0–9\-]* : Rest of the characters might or might not exist, and can be a-z, A-Z,0–9 or “-”
- \. a dot, this is the end of the second level domain
Now we want to capture the domain$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.([a-z,A-Z,0-9]{2,63})' webs.txt
http://google.com
http://www.public.gr
https://microsoft.com
http://zzz.com
To capture
- ([a-z,A-Z]{2,63}): means any word of minimum length of two and maximum of 63
Lets get the optional part of the URL$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.([a-z,A-Z]{2,63})\/?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)?' webs.txt
http://google.com
http://www.public.gr/blog
https://microsoft.com
http://zzz.com
To capture the optional part of the URL we use the following expression\/?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)?
\/? : Optional /
([a-z,A-Z,0–9][a-z,A-Z,0–9\-]*)?: optional part of the URL
greedy and lazy regular expressions
- greedy regular expressions are the regular expressions that will try to match longest possible string$ echo "konstantinos" | grep -Po 'k.*o'
konstantino
in this case ‘k.*o’ stops matches at the last “o” found
- lazy regular expressions are the regular expressions that will try to match smallest possible string$ echo "konstantinos" | grep -Po 'k.*?o'
ko - in this case ‘k.*?o’ stops matches at the first “o” found
- ?: Makes the regular expression lazy, it stops on the first match
backreferences
Create the following file and save it as data.txt<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>
<DATA-TAG> DSFDFDGFSDSDS <DATA-TAG2>
If we want to grep lines that they start and end with <DATA-TAG> we can apply the following regex$ grep -Po '(<DATA-TAG>)(.*)(<DATA-TAG>)' data.txt
<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>
Any regex between braces is a named group, in this case we have three groups numbered 1–3, this is very useful because we can just do a reference in a numbered group and not re-write the rule$ grep -Po '(<DATA-TAG>)(.*)\1' data.txt
<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>
In this case \1 is a reference to the first group: (<DATA-TAG>)
Groups
Apart from numbered groups we have also named groups, a named group is just a group with a label that we can use it when we want to refer to a group and we don't want to use its number
- ?’mytag’: The name of the group
- \k’mytag’: The reference to the group$ grep -Po "(?'mytag'<DATA-TAG>)(.*)\k'mytag'" data.txt
<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>
What if we want to print only specific groups?
Unfortunately with grep we cannot choose to print specific groups, but we can overcome this using tools like perl
- perl: the perl programming language, exists in most linux distros by default
- -lne: needed to run this one-liner
- ‘/regex/’: The regex
- && print $2: print the second capturing group
- data.txt: the file to parse$ perl -lne '/(<DATA-TAG>)(.*)(<DATA-TAG>)/ && print $2' data.txt
SDSFDFDFDSSDS
DFGEFGDFFDSFS
Lookahead
Create the following file and save it as webs.txthttp://google.com
1 http://www.public.gr/blog
2 https://www.public.gr
https://microsoft.com
http://zzz.com
Positive look ahead
- ?= Using a positive look ahead will match /blog but it will not print itgrep -Po '(\d\shttps?://www.public.gr)(?=/blog)' webs.txt
1 http://www.public.gr
Negative look ahead
- ?! Using a negative look ahead will negate the match of /blog as a successful match$ grep -Po '(\d\shttps?://www.public.gr)(?!/blog)' webs.txt
2 https://www.public.gr
Lookbehind
Edit webs.txt to match the following contenthttp://google.com
http://www.public.gr/blog
https://www.public.gr
https://microsoft.com
http://zzz.com
Positive lookbehind
- ?<= Using a positive look behind will match lines with http:// but it will not print it
How to run regexes in parallel
A limitation of grep/sed and other tools is that they process data in a serial fashion, but using those tools with parallel can enchance them and process data in parallel
- pipepart: Pipe parts of a physical file.
- block: size of each part
- -P: Number of parallel jobs
- -a: file$ parallel --pipepart --block 1M -P 8 -a file.txt grep regex-pattern
I hope you find this article useful!