Regular Expression fails if newline is included [duplicate]

Regular Expression fails if newline is included [duplicate] - python

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
I'm trying to extract a simple sentence from a string delimited with a # character.
str = "#text text text \n text#"
with this pattern
pattern = '#(.+)#'
now, the funny thing is that regular expression isn't matched when the string contains newline character
out = re.findall(pattern, str) # out contains empty []
but if I remove \n from string it works fine.Any idea how to fix this ?

Also pass the re.DOTALL flag, which makes the . match truly everything.
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Use re.DOTALL if you want your . to match newline also: -
>>> out = re.findall('#(.+)#', my_str, re.DOTALL)
>>> out
['text text text \n text']
Also, it's not a good idea to use built-in names as your variable names. Use my_str instead of str.

Try this regex "#([^#]+)#"
It will match everything between the delimiters.

Add the DOTALL flag to your compile or match.

Related

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?

Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.

You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

Python regex doesnt match when string contains the special character '+' [duplicate]

This question already has answers here:
Escape special characters in a Python string
(7 answers)
Escaping regex string
(4 answers)
Closed 2 years ago.
import re
response = 'string contains+ as special character'
re.match(response, response)
print match
The string match is not successful as the strring contains the special character '+' . If any other special character , then match is successfull.
Even if putting back slash in special character , it doesnt match.
Both doesnt match:
response = r'string contains\+ as special character'
response = 'string contains\\+ as special character'
How to match it when the string is a variable and has this special character.

If you want use an arbitrary string and in a regex but treat it as plain text (so the special regex characters don't take effect), you can escape the whole string with re.escape.
>>> import re
>>> response = 'string contains+ as special character'
>>> re.match(re.escape(response), response)
<re.Match object; span=(0, 37), match='string contains+ as special character'>

In the general case, an arbitrary string does not match itself, though of course this is true for any string which doesn't contain regex metacharacters.
There are several characters which are regex metacharacters and which do not match themselves. A corner case is . which matches any character (except newline, by default), and so of course it also matches a literal ., but not exclusively. The quantifiers *, +, and ? as well as the generalized repetition operator {m,n} modify the preceding regular expression, round parentheses are reserved for grouping, | for alternation, square brackets define character classes, and finally of course the backslash \ is used to escape any of the preceding metacharacters, or itself.
Depending on what you want to accomplish, you can convert a string to a regex which matches exactly that literal string with re.escape(); but perhaps you simply need to have an incorrect assumption corrected.

str.split(r'\n') doesn't split a string on a newline character in a raw string literal as expected [duplicate]

This question already has an answer here:
Different way to specify matching new line in Python regex
(1 answer)
Closed 4 years ago.
Suppose I have a string s = hi\nhellon\whatsup and I want to split it.
If I use s.split('\n'), I get the expected output:
['hi', 'hello', 'whatsup']
However, if I use re.split('\n', s), it is actually `re.split(r'\n', s) and I also get the same output:
['hi', 'hello', 'whatsup']
Why does splitting on a raw string literal with re.split() work?
What is this black magic?

\n is both the ASCII escape for newlines and the regex escape meaning "match a newline". So in a raw string, used with re.split, it looks for it as the regex escape; in a non-raw string, it looks for the literal ASCII character, but either way it finds the newline to split on.

How to remove ALL kind of linebreaks or formattings from strings in python [duplicate]

This question already has answers here:
How to delete a character from a string using Python
(17 answers)
Regular expression to remove line breaks
(3 answers)
Closed 6 years ago.
I know the classic way of dealing with linebreaks, tabs,.. is to .strip() or .remove('\n','').
But sometimes there are special cases in which these methods fail, e.g.
'H\xf6cke\n\n:\n\nDie'.strip()
gives: 'H\xf6cke\n\n:\n\nDie'
How can I catch these rare cases which would have to be covered one by one (e.g. by .remove('*', '')? The above is just one example I came across.

In [1]: import re
In [2]: text = 'H\xf6cke\n\n:\n\nDie'
In [3]: re.sub(r'\s+', '', text)
Out[3]: 'Höcke:Die'
\s:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v],
and also many other characters, for example the non-breaking spaces
mandated by typography rules in many languages). If the ASCII flag is
used, only [ \t\n\r\f\v] is matched (but the flag affects the entire
regular expression, so in such cases using an explicit [ \t\n\r\f\v]
may be a better choice).
'+'
Causes the resulting RE to match 1 or more repetitions of the
preceding RE.

Use replace if you dont want to import anything
a = "H\xf6cke\n\n:\n\nDie"
print(a.replace("\n",""))
# Höcke:Die

Strip's documentation:
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
That's why it didn't remove the '\n' within the text.
If you want to remove the '\n' occurrences you can use
'H\xf6cke\n\n:\n\nDie'.replace('\n','')
Output: Höcke:Die

using Python to search for keywords in pdf [duplicate]

This question already has answers here:
Searching text in a PDF using Python? [duplicate]
(11 answers)
Closed 8 years ago.
I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:
import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
print "yes"
why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist.
any one can help me with it ?

\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).
Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :
re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)
The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.
I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.
Demo:
>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
... print 'Found'
...
Found

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression fails if newline is included [duplicate] - python

Also pass the re.DOTALL flag, which makes the . match truly everything. Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Use re.DOTALL if you want your . to match newline also: - >>> out = re.findall('#(.+)#', my_str, re.DOTALL) >>> out ['text text text \n text'] Also, it's not a good idea to use built-in names as your variable names. Use my_str instead of str.

Try this regex "#([^#]+)#" It will match everything between the delimiters.

Add the DOTALL flag to your compile or match.

Related

Regex search fail when input has line breaks [duplicate]

Python regex doesnt match when string contains the special character '+' [duplicate]

str.split(r'\n') doesn't split a string on a newline character in a raw string literal as expected [duplicate]

How to remove ALL kind of linebreaks or formattings from strings in python [duplicate]

using Python to search for keywords in pdf [duplicate]

Categories

Resources