learn regular expressions the easy way

Regular expressions is a sequence of characters that specify a search pattern, regexes are super useful and can help you in administration…

learn regular expressions the easy way
Photo by Shahadat Rahman on Unsplash

Regular expressions is a sequence of characters that specify a search pattern, regexes are super useful and can help you in administration and programming tasks, especially can be super effiecient to use tools like grep or sed to do regexes/substitutions and not to write code that will manipulate text.

lets see them in action with a simple example

Create file ip_addresses.txt with the following content1.1.1.1
adsds.3.4.5
127.0.0.1
192.168.0.1

Now, lets try to match all valid ip addresses in this file using grep$ grep -P "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" ip_addresses.txt
1.1.1.1
127.0.0.1
192.168.0.1

As we can see grep ignored adsds.3.4.5 which is not a valid ip address, lets see how this regular expression works

  • \d{0,3} : instructs grep to match a number ranging from 1 to 3 numerical digits in a row.
  • \. : Its a string literal, it matches a character, in this case is the dot.

in this case grep knows that there is a match when there are 4 numbers separated by a dot and each number length is one to three digits.

Literal pattern matching, be careful!grep -P "1.1.1.1" ip_addresses.txt
1.1.1.1

We can see that in this case it matched the ip address, but things are not as clear as it seem

open ip_addresses.txt and change 1.1.1.1 to 1a1.1.1

Now re-run grep with the previous parameters$ grep -P "1.1.1.1" ip_addresses.txt
1a1.1.1

We can see that matched 1a1.1.1 but why??

This happens because the dot is not interpreted by the regex engine as the character “.” but rather as a regex rule which dot means “any character”, to match a dot as a character we must use the \. syntax.

Executing this should not return anything$ grep -P "1\.1\.1\.1" ip_addresses.txt

But if we change 1a1.1.1 to 1.1.1.1 and re-run this grep statement we should get back a match$ grep -P "1\.1\.1\.1" ip_addresses.txt
1.1.1.1

Characters and words

Create the data.txt with the following contentA-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
B-300   CCC $3000
A-600   BBB $XXXX
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500
D-100   BBB $2500
D-100   BBB $2500

Lets find only lines that start with “C-”, That was easy:$ grep  "C-" data.txt
C-100   BBB $2500
C-100   BBB $2500

Lets make things a bit more complicate, lets try to match every line that contains characters A,B,C and the next character is “-”$ grep  "[A-C]-" data.txt
A-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
B-300   CCC $3000
A-600   BBB $XXXX
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500

[]: character set, matches any character between A-C range, if we omit the “-” it will match only any lines contains A or C and the next character is “-”$ grep  "[AC]-" data.txt
A-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
A-600   BBB $XXXX
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500

Lets try to match all lines that start from A and C, matches a non-word character (-) and 3 digits$ grep  "[AC]\W\w\w\w" data.txt
A-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
A-600   BBB $XXXX
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500

  • \W: match a non-word character (not A-Z,a-z,0–9)
  • \w: match a word character (A-Z,a-z,0–9)

So in this case we instruct grep to match any line that contains A or C, continues a non word character which in this case the “-” and three word characters in row (3 digits).

Digits

Instead of using \w to match numerical characters we can use the \d which is used for numbers exclusively

Lets try to match everything that contains A or D next character is “-” and the number after is 1 or 5 and the next two characters are digits$ grep -P "[AD]-[1,5]\d\d" data.txt
A-100   AAA $1000
A-500   BBB $1500
D-100   BBB $2500
D-100   BBB $2500

Character sets with digits also support ranges, if we change [1,5] to [1–5] we will get an extra line of A-200 because “2” is in the range of [1–5].$ grep -P "[AD]-[1-5]\d\d" data.txt
A-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
D-100   BBB $2500
D-100   BBB $2500

sed

sed, also known as stream editor used regular expressions to manipulate text files, a very common task of sed is to replace all whitespaces to a single one, white spaces might be the space character or even tab

create the following file and save it as test.txtsdsdsd     dsdsds dsdsd  dsdsd dsdsd
fdsafadsfs              dsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef

Note that some spaces in this file are not space but those are tabs, we can view the special characters using cat -t commandcat -t ./test.txt
sdsdsd     dsdsds dsdsd  dsdsd dsdsd
fdsafadsfs^I^Idsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef

Now to remove all spaces and tabs we can use the following sed commandsed -r 's/\s+/ /g' test.txt
sdsdsd dsdsds dsdsd dsdsd dsdsd
fdsafadsfs dsfsfedf dfsfd fdf
sdsd d sd sd s fe f ef s cs fe f ef

Using sed without the -i parameter outputs the processed file to stdout, if we want to make changes in file use the followingsed -r -i.backup 's/\s+/ /g' test.txt

  • \s+ : match every whitespace characters that exists at least one or more times
  • -r means use extended regular expressions.

The -i instructs sed to do changes in file and the .backup instructs sed to keep a backup of the original file using the .backup extension (we can use any extension we like, i just like full names).-rw-r--r-- 1 kpatronas kpatronas  106 Jul 11 00:41  test.txt.backup
-rw-r--r-- 1 kpatronas kpatronas  100 Jul 11 00:51  test.txt

location

One very common task is to find only text that starts or ends with specific characters, lets go back to file data.txtA-100   AAA $1000
A-200   BBB $1000
A-500   BBB $1500
B-300   CCC $3000
A-600   BBB $XXXX
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500
D-100   BBB $2500
D-100   BBB $2500

Lets filter only lines that start with B, to do this we have to use ^ charactergrep '^B' data.txt
B-300   CCC $3000

Now lets filter only lines that end with $2500 and start with Cgrep '^C.*\$2500$' data.txt
C-100   BBB $2500
C-100   BBB $2500

  • ^ : Starts with C
  • \$: We want to match the literal dollar sign
  • $: Ends with 2500
  • .*: Any characters

Boundaries

There might be cases that we want to find a string which is not part of another string, this means that we might want to find a string that the characters before and after the string are not words or digits

Lets try to find lines that contain 500 in file data.txt$ grep '500' data.txt
A-500   BBB $1500
A-600   GGG $2500
C-100   BBB $2500
C-100   BBB $2500
D-100   BBB $2500
D-100   BBB $2500

It works somehow but what if we want to find only lines contain 500 and is not surrounded by any digit or word, we can use the \b which means boundaries$ grep '\b500\b' data.txt
A-500   BBB $1500

In this case found only this line because spaces and “-” are not words or digits

Alternation

What if we want to filter out lines that contain 200 and 100 but are not surrounded by any word or digit$ grep -E '\b(100|200)\b' data.txt
A-100   AAA $1000
A-200   BBB $1000
C-100   BBB $2500
C-100   BBB $2500
D-100   BBB $2500
D-100   BBB $2500

  • E: Extended regular expressions
  • (string1|string2) match string1 or string2
  • \b: boundaries

Repetition

Lets consider the following url https://www.public.gr/blog this url is

  • https: the protocol
  • www: subdomain
  • public: domain
  • gr: top-level domain
  • blog: path

Save the following as webs.txthttp://google.com
http://www.public.gr/blog
https://microsoft.com
http://zzz.com

The first thing to match is the protocol which can be http or https

  • ?: using ? after a character makes the matching of this character optional$ grep -Po '^https?' webs.txt
    http
    http
    https
    http

Then we need to match ://

  • \: before a character does a literal match of this character$ grep -Po '^https?:\/\/' webs.txt
    http://
    http://
    https://
    http://

Now we need to match the subdomain that might not be present in all urls

  • (www\.)?: Strings or regex rules inside braces make matching groups, note that this group ands with “?” this means that preceding character or group matching is optional$ grep -Po '^https?:\/\/(www\.)?' webs.txt
    http://
    http://www.
    https://
    http://

Now we have to match the domain name

  • [a-z,A-Z,0–9]: The first rule inside the group means that the first character of the domain name might be a character that is lower or capital or a digit
  • [a-z,A-Z,0–9,\-]: Everything from the second character might contain characters, lower or capital, digits and dashes
  • *: An asterisk means that the last character(s) might exist or not
  • \. matching ends with a dot$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.' webs.txt
    http://google.
    http://www.public.
    https://microsoft.
    http://zzz.

We need to brake down this ([a-z,A-Z,0–9][a-z,A-Z,0–9\-]*)\.

  • (): Its a group
  • [a-z,A-Z,0–9]: The first character might start from a-z, A-Z, 0–9
  • [a-z,A-Z,0–9\-]* : Rest of the characters might or might not exist, and can be a-z, A-Z,0–9 or “-”
  • \. a dot, this is the end of the second level domain

Now we want to capture the domain$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.([a-z,A-Z,0-9]{2,63})' webs.txt
http://google.com
http://www.public.gr
https://microsoft.com
http://zzz.com

To capture

  • ([a-z,A-Z]{2,63}): means any word of minimum length of two and maximum of 63

Lets get the optional part of the URL$ grep -Po '^https?:\/\/(www\.)?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)\.([a-z,A-Z]{2,63})\/?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)?' webs.txt
http://google.com
http://www.public.gr/blog
https://microsoft.com
http://zzz.com

To capture the optional part of the URL we use the following expression\/?([a-z,A-Z,0-9][a-z,A-Z,0-9\-]*)?

\/? : Optional /

([a-z,A-Z,0–9][a-z,A-Z,0–9\-]*)?: optional part of the URL

greedy and lazy regular expressions

  • greedy regular expressions are the regular expressions that will try to match longest possible string$ echo "konstantinos" | grep -Po 'k.*o'
    konstantino

in this case ‘k.*o’ stops matches at the last “o” found

  • lazy regular expressions are the regular expressions that will try to match smallest possible string$ echo "konstantinos" | grep -Po 'k.*?o'
    ko
  • in this case ‘k.*?o’ stops matches at the first “o” found
  • ?: Makes the regular expression lazy, it stops on the first match

backreferences

Create the following file and save it as data.txt<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>
<DATA-TAG> DSFDFDGFSDSDS <DATA-TAG2>

If we want to grep lines that they start and end with <DATA-TAG> we can apply the following regex$ grep -Po '(<DATA-TAG>)(.*)(<DATA-TAG>)' data.txt
<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>

Any regex between braces is a named group, in this case we have three groups numbered 1–3, this is very useful because we can just do a reference in a numbered group and not re-write the rule$ grep -Po '(<DATA-TAG>)(.*)\1' data.txt
<DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
<DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>

In this case \1 is a reference to the first group: (<DATA-TAG>)

Groups

Apart from numbered groups we have also named groups, a named group is just a group with a label that we can use it when we want to refer to a group and we don't want to use its number

  • ?’mytag’: The name of the group
  • \k’mytag’: The reference to the group$ grep -Po "(?'mytag'<DATA-TAG>)(.*)\k'mytag'" data.txt
    <DATA-TAG> SDSFDFDFDSSDS <DATA-TAG>
    <DATA-TAG> DFGEFGDFFDSFS <DATA-TAG>

What if we want to print only specific groups?

Unfortunately with grep we cannot choose to print specific groups, but we can overcome this using tools like perl

  • perl: the perl programming language, exists in most linux distros by default
  • -lne: needed to run this one-liner
  • ‘/regex/’: The regex
  • && print $2: print the second capturing group
  • data.txt: the file to parse$ perl -lne '/(<DATA-TAG>)(.*)(<DATA-TAG>)/ && print $2' data.txt
    SDSFDFDFDSSDS
    DFGEFGDFFDSFS

Lookahead

Create the following file and save it as webs.txthttp://google.com
1 http://www.public.gr/blog
2 https://www.public.gr
https://microsoft.com
http://zzz.com

Positive look ahead

Negative look ahead

Lookbehind

Edit webs.txt to match the following contenthttp://google.com
http://www.public.gr/blog
https://www.public.gr
https://microsoft.com
http://zzz.com

Positive lookbehind

  • ?<= Using a positive look behind will match lines with http:// but it will not print it

How to run regexes in parallel

A limitation of grep/sed and other tools is that they process data in a serial fashion, but using those tools with parallel can enchance them and process data in parallel

  • pipepart: Pipe parts of a physical file.
  • block: size of each part
  • -P: Number of parallel jobs
  • -a: file$ parallel --pipepart --block 1M -P 8 -a file.txt grep regex-pattern

I hope you find this article useful!

Join Medium with my referral link - Konstantinos Patronas
As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…