Python Regular Expressions
Python Regular Expressions are a powerful tool for various kinds of string manipulation. These are a special text string that is used for describing a search pattern to extract information from text such as code, files, log, spreadsheets, etc.
Python Regular Expressions are a domain-specific language (DSL) that is present as a library in most of the modern programming languages. A regular expression is a special sequence of characters that helps to match or find strings in another string.
The match() function matches a pattern to a string with optional flags. It has the following syntax:
re.match(pattern, string, flags=0)
This function tries to match the pattern with a string. The flag field is optional and some values of flags are specified in the following table:
|re.I||Case sensitive matching|
|re.M||Matches at the end of the line|
|re.X||Ignores white-space characters|
|re.U||Interprets letters according to Unicode character set.|
The search() function searches for first occurrence of pattern within a string with optional flags. If the search is successful, a match object is returned and none otherwise. It has the following syntax:
re.search(pattern, string, flag=0)
Note: re.search() finds a match of a pattern anywhere in the string.
The sub() function in the re module can be used to search a pattern in the string and replace it with another pattern. It has the following syntax:
re.sub(pattern, repl, string, max=0)
findall() function is used to search a string and returns a list of matches of the pattern in the string. If no match is found, then the returned list is empty. It has the following syntax:
matchlist=re.findall(pattern, input_str, flags=0)
Note: re.findall() function returns a list of all substrings that match a pattern.
The finditer() function is same as findall() function but instead of returning match objects, it returns an iterator. This iterator can be used to print the index of match in the given string.
A group is created by surrounding a part of the regular expression with parentheses. You can even give a group as an argument to the metacharacters such as
import re pattern=r"gr(ea)*t" if re.match(pattern, "great"): print("Ram is ea") if re.match(pattern, "greaeaeaeaeaeaeat"): print("Ram is greaeaeaeaeaeaeat")
Ram is ea Ram is greaeaeaeaeaeaeat
Python supports two useful types of groups:
1. Named Group
2. Non-capturing Group
It has the format(?P
It has the format (?:…) are not accessible by the group method, so, they can be added to an existing regular expression without breaking the numbering.
Example of Named Group and Non-Capturing Group:
import re pattern=r"Go(?P<FIRST>od)Go(?:in)gPy(th)on" match=re.match(pattern, "GoodGoingPythonGoodGoingPythonGoodGoingPython") if re.match: print(match.group("FIRST")) print(match.group(1)) print(match.group(2)) print(match.groups())
od od th ('od', 'th')
Application of Regular Expressions:
We can use Regular Expressions to extract date, time, e-mail address, etc from the text.
Example: We know that an e-mail address has a username which consists of characters and it may include dots or dashes. The username is followed by @ sign and the domain name. The domain name may also include characters, dashes, and dots.
Now consider the following e-mail address given below:
Now, the regular expression representing the structure of e-mail address can be given as:
pattern= r"[\w.-][email protected][\w.-]+"
Where [\w.-]+ matches one or more occurrences of characters, dot or dash.
import re pattern= r"[\w.-][email protected][\w.-]+" string=" Please write us at [email protected]" match=re.search(pattern, string) if match: print("Email to: ", match.group()) else: print("No Match")
Email to: [email protected]