What does the following re code mean? - python

I don't understand this code in Python. I know what the function does but I don't get how and what those \\ do in the replacement string.
import re
caps="([A-Z])"
pre="(Mr|mr|Mr|St|st|ST|Mrs|MRS|mrs|Ms|MS|ms|Dr|DR|dr|miss|Miss|MISS)[\.\.\.]"
def tokenize_sentence(text):
text=re.sub(" ?"+pre,"\\1<dot>",text)
text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
print(text)
tokenize_sentence("Mr. Ansh sahajpal A.B.C.")
The output is:
Mr<dot> Ansh sahajpal A<prd>B<prd>C<prd>
If you can explain me how does the \\1, \\2 and \\3 does in the sub functions it will be amazing. Also I changed the input to Mr... Ansh sahajpal and in the line 5 I made the following change:
text=re.sub(" ?"+pre,"\\1<dot>\\2<dot>",text)
What I thought it will do is replace the first . and the second . but it gave me an error.
Please any explanation will do.

The \1, \2, etc refer to matched subexpressions (enclosed in parentheses) in a regular expression. \1 is the first expression matched, \2 is the second, etc. They are used in the replacement string to mark where matched subexpressions are to be modified.
Common convention for matching subexpressions is to use parenthesis. Here is an example:
str = 'an example word:cat!!'
print (re.sub (r'word:(\w+)', r'\0dog', str))
which says match any number of alphanumeric characters following a colon and yields:
an example word:dog!!
In this case, (\w+) is a grouped expression for the set of alphanumeric characters including "_".

Related

Ignore an optional word if present in a string - regular expression in python

I'm trying to match a string with regular expression using Python, but ignore an optional word if it's present.
For example, I have the following lines:
First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]
I'm looking to capture everything before [Ignore This Part]. Notice I also want to exclude the whitespace before [Ignore This Part]. Therefore my results should look like this:
First string
Second string
Third string (1)
I have tried the following regular expression with no luck, because it still captures [Ignore This Part]:
.+(?:\s\[.+\])?
Any assistance would be appreciated.
I'm using python 3.8 on Window 10.
Edit: The examples are meant to be processed one line at a time.
Use [^[] instead of . so it doesn't match anything with square brackets and doesn't match across newlines.
^[^[\n]+(?\s\[.+\])?
DEMO
Perhaps you can remove the part that you don't want to match:
[^\S\n]*\[[^][\n]*]$
Explanation
[^\S\n]* Match optional spaces
\[[^][\n]*] Match from [....]
$ End of string
Regex demo
Example
import re
pattern = r"[^\S\n]*\[[^][\n]*]$"
s = ("First string\n"
"Second string [Ignore This Part]\n"
"Third string (1) [Ignore This Part]")
result = re.sub(pattern, "", s, 0, re.M)
if result:
print(result)
Output
First string
Second string
Third string (1)
If you don't want to be left with an empty string, you can assert a non whitespace char to the left:
(?<=\S)[^\S\n]*\[[^][\n]*]$
Regex demo
With your shown samples, please try following code, written and tested in Python3.
import re
var="""First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]"""
[x for x in list(map(lambda x:x.strip(),re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var))) if x]
Output will be as follows, in form of list which could be accessed as per requirement.
['First string', 'Second string', 'Third string (1)']
Here is the complete detailed explanation for above Python3 code:
Firstly using re module's split function where passing regex (.*?)(?:$|\s\[[^]]*\]) with multiline reading flag enabled. This is complete function of split: re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var)
Then passing its output to a lambda function to use strip function to remove elements which are having new lines in it.
Applying map to it and creating list from it.
Then simply removing NULL items from list to get only required part as per OP.
You may use this regex:
^.+?(?=$|\s*\[[^]]*]$)
RegEx Demo
If you want better performing regex then I suggest:
^\S+(?:\s+\S+)*?(?=$|\s*\[[^]]*]$)
RegEx Demo 2
RegEx Details:
^: Start
.+?: Match 1+ of any characters (lazy match)
(?=: Start lookahead
$: End
|: OR
\s*: Match 0 or more whitespaces
\[[^]]*]: Match [...] text
$: End
): Close lookahead

How to parse parameters from text?

I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?
You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.
I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

Using Python Regex to find a phrase between 2 tags

I've got a string that I want to use regex to find the characters encapsulated between two known patterns, "Cp_6%3A" then some characters then "&" and potentially more characters, or no & and just the end of string.
My code looks like this:
def extract_id_from_ref(ref):
id = re.search("Cp\_6\%3A(.*?)(\& | $)", ref)
print(id)
But this isn't producing anything, Any ideas?
Thanks in advance
Note that (\& | $) matches either the & char and a space after it, or a space and end of string (the spaces are meaningful here!).
Use a negated character class [^&]* (zero or more chars other than &) to simplify the regex (no need for an alternation group or lazy dot matching pattern) and then access .group(1):
def extract_id_from_ref(ref):
m = re.search(r"Cp_6%3A([^&]*)", ref)
if m:
print(m.group(1))
Note that neither _ nor % are special regex metacharacters, and do not have to be escaped.
See the regex demo.
The problem is that spaces in a regex pattern, are also taken into account. Furthermore in order to add a backspace to the string, you either have to add \\ (two backslashes) or use a raw string:
So you should write:
r"Cp_6\%3A(.*?)(?:\&|$)"
If you then match with:
def extract_id_from_ref(ref):
id = re.search(r"Cp_6\%3A(.*?)(?:\&|$)", ref)
print(id)
It should work.

Regexp Word within a word with a fullstop

I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:
led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION
In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:
(\w)*\/PERSON
And my code:
match = re.findall(r'(\w)*\/PERSON', string)
Basically, I am matching any word that comes before /PERSON. The output:
>>> reg
['Timothy', '', 'Geithner']
My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:
match = re.findall(r'(\w|.*?)\/PERSON', string)
But the match now is:
['led/O by/O Timothy', ' R.', ' Geithner']
It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?
Thanks,
Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:
match = re.findall(r'([^ ]*)\/PERSON', string)
Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).
Escaping this with a backslash will do the trick: (\w|\.)
Second, as #Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.
If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.

Categories