Verbatim-like context in a regular expression - python

Question:
Is there any way to tell a regular expression engine to treat a certain part of a regular expression as verbatim (i.e. look for that part exactly as it is, without the usual parsing) without manually escaping special characters?
Some context:
I'm trying to backreference a group on a given regular expression from another regular expression. For instance, suppose I want to match hello(.*?)olleh against text 1 and then look for bye$1eyb in text 2, where $1 will be replaced by whatever matched group 1 in text 1. Therefore, if text 1 happens to contain the string "helloFOOolleh", the program will look for "byeFOOeyb" in text 2.
The above works fine in most cases, but if text 1 were to contain something like "hello.olleh", the program will match not only "hello.olleh" but also "helloXolleh", "hellouolleh", etc. in text 2, as it is interpreting . as a regex special character and not the plain dot character.
Additional comments:
I can't just search for the plain string resulting from parsing $1 into whatever group 1 matches, as whatever I want to search for in text 2 could itself contain other unrelated regular expressions.
I have been trying to avoid parsing the match returned from text 1 and escape every single special character, but if anyone knows of a way to do that neatly that could also work.
I'm currently working on this in Python, but if it can be done easily with any other language/program I'm happy to give it a try.

You can use the re.escape function to escape the text you want to match literally. So after you extract your match text (e.g., "." in "hello.olleh"), apply re.escape to it before inserting it into your second regex.

To illustrate what BrenBarn wrote,
import re
text1 = "hello.olleh"
text2_match = "bye.eyb"
text2_nomatch = "byeXeyb"
found = re.fullmatch(r"hello(.*?)olleh", text1).group(1)
You can then make a new search with the re.escape:
new_search = "bye{}eyb".format(re.escape(found))
Tests:
re.search(new_search, text2_match)
#>>> <_sre.SRE_Match object; span=(0, 7), match='bye.eyb'>
re.search(new_search, text2_nomatch)
#>>> None

Related

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?
Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.
You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

findall() behaviour (python 2.7)

Suppose I have the following string:
"<p>Hello</p>NOT<p>World</p>"
and i want to extract the words Hello and World
I created the following script for the job
#!/usr/bin/env python
import re
string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)
print match
I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.
The interpreter prints
['<p>Hello</p>NOT<p>World</p>']
so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).
And if it shouldn't, how can i alter the code to do so?
PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.
The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your <p> and </p> tags). To prevent this, you want to use the non-greedy version by appending a ?.
match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']
From the documentation:
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.
If you don't want the <p> and </p> tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.
match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']
As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.

Python not recognizing regex

I'm using the solution obtained from this question Regular expression to match any character being repeated more than 10 times
The regex you need is /(.)\1{9,}/.
https://regex101.com/ is recognizing it, grep recognizes it, but python does not.
Ultimately I want to replace the match with a single space, for example:
>> text = 'this is text???????????????'
>> pattern = re.compile(r'/(.)\1{5,}/')
>> re.sub(pattern,'\s',text)
'this is text '
However, search, findall, even match do not recognize the pattern, any idea as to why?
re.sub(r'(.)\1{9,}', ' ',text)
The slashes are not part of the regex, they are a syntactic construct by which some languages form regex literals (and in case of PHP's preg module, an oddity).
With your regexp, you would have matched this is text/?????????????/, and transformed it into this is text\s (note that \s has no meaning in the replacement string).

Regular expression to find syllables in Bengali word

Here is the code:
BanglaAlphabet = {
'Consonant' : '[\u0995-\u09B9\u09CE\u09DC-\u09DF]',
'IndependantVowel' : '[\u0985-\u0994]',
'DependantVowel' : '[\u09BE-\u09CC\u09D7]',
'Nukta' : '[\u09BC]'
}
BanglaWordPattern = ur"""(
({DependantVowel}{Nukta}{Consonant}) |
({DependantVowel}{Consonant}) |
{IndependantVowel} |
{Consonant} |
)+""".format(**BanglaAlphabet)
BanglaWordPattern = re.compile(BanglaWordPattern, re.VERBOSE)
The matching is done with:
re.match(BanglaWordPattern, w[::-1])
This is meant to match a valid Bengali word when matched from right to left.
However, it is matching invalid words, such as োগাড় and িদগ.
What could be the problem?
Edit
After numerous corrections as suggested by #GarethRees and #ChrisMorgan, I ended up with:
bangla_alphabet = dict(
consonant = u'[\u0995-\u09b9\u09ce\u09dc-\u09df]',
independent_vowel = u'[\u0985-\u0994]',
dependent_vowel = u'[\u09be-\u09cc\u09d7]',
dependent_sign = u'[\u0981-\u0983\u09cd]',
virama = u'[\u09cd]'
)
bangla_word_pattern = re.compile(ur'''(?:
{consonant}
({virama}{consonant})?
({virama}{consonant})?
{dependent_vowel}?
{dependent_sign}?
|
{independent_vowel}
{dependent_sign}?
)+
The matching is now:
bangla_word_pattern.match(w)
This code not only corrects errors, but accounts for more characters and valid constructs than before.
I am happy to report that it is working as expected. As such, this code now serves as a very basic regular expression for validating the syntax of Bengali words.
There are several special rules / exceptions not implemented. I will be looking into those and adding them to this basic structure incrementally.
Many ''.format(**bangla_alphabet), re.VERBOSE)
The matching is now:
xCodexBlockxPlacexHolderx
This code not only corrects errors, but accounts for more characters and valid constructs than before.
I am happy to report that it is working as expected. As such, this code now serves as a very basic regular expression for validating the syntax of Bengali words.
There are several special rules / exceptions not implemented. I will be looking into those and adding them to this basic structure incrementally.
Your string কয়া is made up of these characters:
>>> import unicodedata
>>> map(unicodedata.name, u'কয়া')
['BENGALI LETTER KA', 'BENGALI LETTER YA', 'BENGALI SIGN NUKTA', 'BENGALI VOWEL SIGN AA']
U+09BC BENGALI SIGN NUKTA is not matched by your regular expression.
Looking at the Bengali code chart it seems possible that you missed some other characters.
OK, to answer your updated question. You are making three mistakes:
Your strings in the BanglaAlphabet dictionary are lacking the u (Unicode) flag. This means that Unicode escape sequences like \u0995 are not being translated into Unicode characters. You just get backslashes, letters, and digits.
In the BanglaWordPattern regular expression, there is a vertical bar | near the end, with nothing after it. That means the whole regular expression looks like (stuff1|stuff2|stuff3|stuff4|)+. So there are really five alternatives, the last one being empty. The empty regular expression matches anything, of course.
You are not actually looking at the result of your program to see what it actually matched. If you write m = re.match(BanglaWordPattern, w[::-1]); print m.group(0) you'll see that what actually matched was the empty string.
I think the following are also mistakes, but you haven't explained what you are trying to do, so I'm not so confident:
You are doing the match backwards, which is unnecessary. It would be simpler and easier to understand if you turned your patterns round and matched forwards.
You are using capturing parentheses in your regular expressions. If you don't need the results, use non-capturing parentheses (?:...) instead.
The inner sets of parentheses are unnecessary anyway.
You are not anchoring the end of your regular expression at a word boundary or the end of the string.
I would write something like this:
import re
bangla_categories = dict(
consonant = u'[\u0995-\u09B9\u09CE\u09DC-\u09DF]',
independent_vowel = u'[\u0985-\u0994]',
dependent_vowel = u'[\u09BE-\u09CC\u09D7]',
nukta = u'[\u09BC]',
)
bangla_word_re = re.compile(ur"""(?:
{consonant}{nukta}{dependent_vowel} |
{consonant}{dependent_vowel} |
{independent_vowel} |
{consonant}
)+(?:\b|$)""".format(**bangla_categories), re.VERBOSE)
But I would also look at the other Bangla signs in the code charts that you've omitted. What about U+0981 BENGALI SIGN CANDRABINDU and U+0982 BENGALI SIGN ANUSVARA (which nasalise vowels)? What about U+09CD BENGALI SIGN VIRAMA (which cancels a vowel)? And so on.
There are a few issues with what you've got:
Your regular expressions end up including the literal \u0995 etc. in them, because they're not Unicode strings; you need to include the actual Unicode character.
You want $ at the end of the regular expression so that it will only be matching the whole string.
You had an empty string in your group as valid (by ending the first group with a pipe, leaving an empty option). This, in combination with the lack of a $ symbol, meant that it wouldn't work.
It's not complete (as observed by Gareth).
Also be aware that you can also do bengali_word_pattern.match(s) instead of re.match(bengali_word_pattern, s) once you've got a compiled regular expression object.
bengali_alphabet = {
'consonant': u'[\u0995-\u09B9\u09CE\u09DC-\u09DF]',
'independent_vowel': u'[\u0985-\u0994]',
'dependent_vowel': u'[\u09BE-\u09CC\u09D7]',
'nukta': u'\u09BC'
}
bengali_word_pattern = ur'''^(?:
(?:{dependent_vowel}{nukta}{consonant}) |
(?:{dependent_vowel}{consonant}) |
{independent_vowel} |
{consonant}
)+$'''.format(**bengali_alphabet)
bengali_word_pattern = re.compile(bengali_word_pattern, re.VERBOSE)
Now,
>>> bengali_word_pattern.match(u'বাংলা'[::-1])
This one doesn't work because of the "ং" character, U+0982; it's not in any of your ranges. Not sure what category that bit falls into off-hand; If we just take out the offending character it works. (Google Translate tells me that the resulting word could be translated "bracelet"—I don't know, I'd need to ask my sister; approximately all I can truthfully say is আমি বাংলা বলতে পারি না. Almost all I know is convenient everyday phrases like মুরগি চোর. And the first word of that contains a vowel missed thus far, too. Anyway, that's beside the point.)
>>> bengali_word_pattern.match(u'বালা')
<_sre.SRE_Match object at 0x7f00f5bf9620>
It works on the "chicken thief" phrase, too.
>>> [bengali_word_pattern.match(w[::-1]) for w in u'মুরগি চোর'.split()]
[<_sre.SRE_Match object at 0x7f00f5bf9620>, <_sre.SRE_Match object at 0x7f00f5bf97e8>]
And it doesn't give a match for those two examples of incomplete words:
>>> bengali_word_pattern.match(u'োগাড়'[::-1])
>>> bengali_word_pattern.match(u'িদগ'[::-1])
I will also at this point admit myself puzzled as to why you are parsing the strings backwards; I would have thought it would make sense for it to be done forwards (this regular expression works correctly, then you don't need to use [::-1]):
^(?:
{consonant}
(?:
{nukta}?
{dependent_vowel}
)?
|
{independent_vowel}
)+$
At each conjunct/letter, get either an independent vowel or a consonant possibly followed by a dependent vowel, with a nukta possibly between them.
Other alterations that I have made:
Variable/item naming, to fit in with standard Python coding style;
Replaced (...) (matching group) with (?:...) (non-matching group) for performance (see docs);
Corrected the spelling of "dependent";
Changed "bangla" to "bengali" as in English it is Bengali; I prefer when speaking English to use the standard English name for a language rather than the native language pronunciation, Anglicised if necessary—e.g. French rather than le français. On the other hand, I do realise that Bengali is regularly called Bangla by English speakers.

Categories