Odd behavior on negative look behind in python - python

I am trying to do a re.split using a regex that is utilizing look-behinds. I want to split on newlines that aren't preceded by a \r. To complicate things, I also do NOT want to split on a \n if it's preceded by a certain substring: XYZ.
I can solve my problem by installing the regex module which lets me do variable width groups in my look behind. I'm trying to avoid installing anything, however.
My working regex looks like:
regex.split("(?<!(?:\r|XYZ))\n", s)
And an example string:
s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
Which when split would look like:
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
My closest non-working expression without the regex module:
re.split("(?<!(?:..\r|XYZ))\n", s)
But this split results in:
['DATA1', 'DA\r\n \r', ' \r', 'TA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
And this I don't understand. From what I understand about look behinds, this last expression should work. Any idea how to accomplish this with the base re module?

You can use:
>>> re.split(r"(?<!\r)(?<!XYZ)\n", s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Here we have broken your lookbehind assertions into two assertions:
(?<!\r) # previous char is not \r
(?<!XYZ) # previous text is not XYZ
Python regex engine won't allow (?<!(?:\r|XYZ)) in lookbehind due to this error
error: look-behind requires fixed-width pattern

You could use re.findall
>>> s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
>>> re.findall(r'(?:(?:XYZ|\r)\n|.)+', s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Explanation:
(?:(?:XYZ|\r)\n|.)+ This would match XYZ\n or \r\n greedily if there's any if the character going to be matched is not the one from the two then the control transfered to the or part that is . which would match any character but not of line breaks. + after the non-capturing group would repeat the whole pattern one or more times.

Related

re.sub match with first occurrence of bracketed characters

I'm trying to capture the first group of characters before one or more underscores or dashes in a string using re.sub in Python 3.7. My current function is:
re.sub(r'(\w+)[-_]?.*', r'\1', x).
Example strings:
x = 'CAM14_20190417121301_000'
x = 'CAM16-20190417121301_000'
Actual output:
CAM14_20190417121301_000
CAM16
Desired output:
CAM14
CAM16
Why is it working when there is a dash after the first group, but not an underscore? I also tried re.sub(r'(\w+)_?.*', r'\1', x) to try and force it to catch the underscore, but that returned the same result. I would like the code to be flexible enough to catch either.
\w matches underscores, consider using this regex instead:
re.sub(r'([a-zA-Z0-9]+)[-_]?.*', r'\1', x)

Python - How to remove spaces between Chinese characters while remaining the spaces in between a character and a number?

the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:
text = "今天特别 热,但是我买了 3 个西瓜。"
The output I want to get is
text = "今天特别热,但是我买了 3 个西瓜。"
I tried to use Python script and regular expression:
import re
text = re.sub(r'\s(?=[^A-z0-9])','')
However, the result is
text = '今天特别热,但是我买了 3个西瓜。'
So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".
I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!
I understand the spaces you need to remove reside in between letters.
Use
re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)
Details:
(?<=[^\W\d_]) - a positive lookbehind requiring a Unicode letter immediately to the left of the current location
\s+ - 1+ whitespaces (remove + if only one is expected)
(?=[^\W\d_]) - a positive lookahead that requires a Unicode letter immediately to the right of the current location.
You do not need re.U flag since it is on by default in Python 3. You need it in Python 2 though.
You may also use capturing groups:
re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)
where the non-consuming lookarounds are turned into consuming capturing groups ((...)). The \1 and \2 in the replacement pattern are backreferences to the capturing group values.
See a Python 3 online demo:
import re
text = "今天特别 热,但是我买了 3 个西瓜。"
print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
// => 今天特别热,但是我买了 3 个西瓜。

Python: Regular Expressions on getting repeating set of numbers

I'm working with a file, that is a Genbank entry (similar to this)
My goal is to extract the numbers in the CDS line, e.g.:
CDS join(1200..1401,3490..4302)
but my regex should also be able to extract the numbers from multiple lines, like this:
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
I'm using this regular expression:
import re
match=re.compile('\w+\D+\W*(\d+)\D*')
result=match.findall(line)
print(result)
This gives me the correct numbers but also numbers from the rest of the file, like
gene complement(3300..4037)
so how can I change my regex to get the numbers?
I should only use regex on it..
I'm going to use the numbers to print the coding part of the base sequence.
You could use the heavily improved regex module by Matthew Barnett (which provides the \G functionality). With this, you could come up with the following code:
import regex as re
rx = re.compile("""
(?:
CDS\s+join\( # look for CDS, followed by whitespace and join(
| # OR
(?!\A)\G # make sure it's not the start of the string and \G
[.,\s]+ # followed by ., or whitespace
)
(\d+) # capture these digits
""", re.VERBOSE)
string = """
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
"""
numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']
\G makes sure the regex engine looks for the next match at the end of the last match.
See a demo on regex101.com (in PHP as the emulator does not provide the same functionality for Python [it uses the original re module]).
A far inferior solution (if you are only allowed to use the re module), would be to use lookarounds:
(?<=[(.,\s])(\d+)(?=[,.)])
(?<=) is a positive lookbehind, while (?=) is a positive lookahead, see a demo for this approach on regex101.com. Be aware though there might be a couple of false positives.
The following re pattern might work:
>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))
But you'll need to call findall on the whole text body, not just a line at a time.
You can use parentheses just to grab out the numbers:
>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302 # Numbers only
Let me know if that achieves what you want!

search a repeated structure with regex

I have a string of the structure:
A_1: text
a lot more text
A_2: some text
a lot more other text
Now I want to extract the descriptive title (A_1) and the following text. Something like
[("A_1", "text\na lot more text"),("A_2", "some text\na lot more other text")]
My expression I use is
(A_\d+):([.\s]+)
But I get only [('A_1', ' '), ('A_2', ' ')].
Has someone an idea for me?
Thanks in advance,
Martin
You can use a lookahead to limit the match to another occurence of the searched start indicator.
(?s)A_\d+:.*?(?=\s*A_\d+:|$)
(?s) dotall flag to make dot also match newlines
A_\d+: your start indicator
.*? match as few as possible (lazy dot)
(?=\s*A_\d+:|$) until start pattern with optional spaces ahead or $ end
See demo at regex101.com (Python code generator)
Your [.\s]+ matches one or more literal dots (since . inside a character class loses its special meaning) and whitespaces. I think you meant to use . with a re.DOTALL flag. However, you can use something different, a tempered greedy token (there are other ways, too).
You can use
(?s)(A_\d+):\s*((?:(?!A_\d).)+)
See regex demo
IDEONE demo:
import re
p = re.compile(r'(A_\d+):\s*((?:(?!A_\d).)+)', re.DOTALL)
test_str = "A_1: text\na lot more text\n\nA_2: some text\na lot more other text"
print(p.findall(test_str))
The (?:(?!A_\d).)+ tempered greedy token will match any text up to the first A_+digit pattern.

Regex that considers custom escape characters in the string (not in the pattern)

I'm building a regex that must match a certain pattern that starts with a specific symbol, but at the same time it must not match a pattern that starts with two or more occurrences of that same specific symbol.
To elaborate better, this is my scenario. I have a string like this:
Hello %partials/footer/mail,
%no_slashes_here
%{using_braces}_here
%%should_not_be_matched
And I'm trying to match those substrings that start with exactly one % symbol (since in my case a double %% means "escaping" and should not be matched) and they could optionally be surrounded by curly braces. And at the end, I need to capture the matched substrings but without the % symbol.
So far my regular expression is:
%\{*([0-9a-zA-Z_/]+)\}*
And the captured matches result is:
partials/footer/mail
no_slashes_here
using_braces
should_not_be_matched
Which is very close to what I need, but I got stuck into the double %% escaping part. I don't know how to negate two or more % symbols at the beginning and at the same time allow exactly one occurrence at the beginning too.
EDIT:
Sorry that I missed that, I'm using python.
With negative lookbehind:
%(?<!%%)\{*([0-9a-zA-Z_\/]+)\}*
Regex 101
If this is line based -- you can do:
(?:^|[^%])%\{?([^%}]+)\}?
Demo
Python demo:
txt='''\
Hello %partials/footer/mail,
%no_slashes_here
%{using_braces}_here
%%should_not_be_matched
This %% niether'''
import re
for line in txt.splitlines():
m=re.search(r'(?:^|[^%])%\{?([^%}]+)\}?', line)
if m:
print m.group(1)
It is unclear from your question how % this % should be treated
What about
(?<=%)([^%]+)
Regex101 demo
I've assumed PCRE, as you've not declared which flavour of Regex you're using.

Categories