python regex not behaving as i think it should

python regex not behaving as i think it should - python

I am trying to sort through a some files using regex expression.
I have a file which contains the two following lines
NET "MBC_ADR_I1<1>" LOC = "R2";
NET "GP_O<7>" LOC = "R20";
I am using the following expression to get one of the lines only
f2MatchLoc = re.search('(LOC)[ ]+=[ ]+["]?({})'.format(f1LocValue), f2Line, re.IGNORECASE)
where f1LocValue = R2. However I'm getting a match on both lines.
I've tried to enter the same expression here
regex101.com
which shows that my argument should be correctly formatted

f2MatchLoc = re.search(r'(LOC)[ ]+=[ ]+["]?({}\b)'.format(f1LocValue), f2Line, re.IGNORECASE)
^^
You need to use \b after R2 so that there are no partial matches. See demo. Also use r or raw mode.

Because you have no conditions how the string should end.
'(LOC)[ ]+=[ ]+["]?({})'
^??
So it matches anything that starts with LOC = "R2. Following are all valid search results
LOC = "R2
LOC = "R2asd
LOC = "R2121
LOC = "R2 "
Simply, you can use double quotes or semicolon to identify end of search string. Also you can replace \s for white-space capturing and you can remove [] around single element lists
r'(LOC)\s+=\s+"?({})"?;'

Related

Regex to ignore Semicolon

I have one column in a dataframe with key value pairs I would like to extract.
'AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619'
I would like to parse key value pairs like so
('AF_ESP', '0.00546')
('AF_EXAC', '0.00165')
('AF_TGP', '0.00619')
Here is my regex.
([^=]+)=([^;]+)
This gets me most of way there:
('AF_ESP', '0.00546')
(';AF_EXAC', '0.00165')
(';AF_TGP', '0.00619')
How can I adjust it so semicolons are not captured in the result?

You can consume the semi-colon or start of string in front:
(?:;|^)([^=]+)=([^;]+)
See the regex demo. Details:
(?:;|^) - a non-capturing group matching ; or start of string
([^=]+) - Group 1: one or more chars other than =
= - a = char
([^;]+) - Group 2: one or more chars other than ;.
See the Python demo:
import re
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print( re.findall(r'(?:;|^)([^=]+)=([^;]+)', text) )
# => [('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]
A non-regex solution is also possible:
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print( [x.split('=') for x in text.split(';')] )
# => [['AF_ESP', '0.00546'], ['AF_EXAC', '0.00165'], ['AF_TGP', '0.00619']]
See this Python demo.

This can be also solved with a split method:
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print([tuple(i.split('=')) for i in text.split(';')])
output:
[('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]

An alternate and somewhat simpler approach to #Wiktor's solution is, in steps:
Capture everything until the =.
Get the = but don't capture that.
Get everything after the = up until an optional ; if that exists.
This would translate to the following regex:
([^=]+)=([^;]+);?
And in python:
>>> re.findall(r'([^=]+)=([^;]+);?', "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619")
[('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]

How to extract multiple searches with regular expression in python?

I have a string from which I want to extract certain words and I have used Python's regular expression but unfortunately I am only getting one word from all the matches.
Here is my input string
go = "GO:0030054^cellular_component^cell junction`GO:0005813^cellular_component^centrosome`GO:0036064^cellular_component^ciliary basal body`GO:0005737^cellular_component^cytoplasm"
Here is my regular expression I tried
go_search = re.search(r'(GO:\d+)', go, re.MULTILINE|re.IGNORECASE)
But the only output I got is
go_search.group(1)
'GO:0030054'
How can I get all the 4 GO'id's (GO:0005813, GO:0036064, and GO:0005737) and not just one which I am just getting now

You need re.findall instead of re.search:
re.findall(r'GO:\d+', go, re.MULTILINE|re.IGNORECASE)
# ['GO:0030054', 'GO:0005813', 'GO:0036064', 'GO:0005737']

Replace re.search() with re.finditer(). This gives you an iterator over the matches.
go = "GO:0030054^cellular_component^cell junction`GO:0005813^cellular_component^centrosome`GO:0036064^cellular_component^ciliary basal body`GO:0005737^cellular_component^cytoplasm"
go_search_iter = re.finditer(r'(GO:\d+)', go, re.MULTILINE|re.IGNORECASE)
for go_search in go_search_iter:
print(go_search.group(1))
# do stuff with go_search

How to return a list of strings that do not match a specific pattern?

I am trying to return all results that do not match a specific pattern from a text file, but I am having difficulty with the syntax.
pattern is [A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}
Tried the following with no success:
'^(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}$).*$'
r'^(?!([A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}).)*$'
Below is the code to match the pattern, now I need to find all those entries that do not match.
pattern = r'[A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}'
regex1 = re.compile(pattern, flags = re.IGNORECASE)
regex1.findall(text1)
Sample of the data is as follows:
plos_annotate5_1375_1.txt plos_annotate5_1375_2.txt plos_anno%tate5_1375_3.txt plos_annotate6_1032_1.txt
The third string is what I would like to pull

Why do negation in the regexp if you can do it in Python?
strings_without_rx = [s for s in the_strings if not regex1.search(s)]
If you want to scan file lines, you don't even need to store them all, because an open file is an iterable of its lines:
with open("some.file") as source:
lines_without_rx = [s for s in source if not regex1.search(s)]
# Here the file is auto-closed.

You can just check that your regex is not mathing:
if regex.match(text1) is None:
# Do magic you need

I would suggest of using negative lookahead assertion with your pattern:
r'(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}[^A-Za-z0-9_+\.-]+)'
Without any loop it will give you all patterns that do not match if you use it with findall :
re.findall(r'(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}[^A-Za-z0-9_+\.-]+)')

Python split before a certain character

I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.

Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
: - a colon
(.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
(?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.
Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.

Python regex to match multiple times

I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.

Use this. You need to place 'review' outside the capturing group to achieve the desired result.
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]

You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.

Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex not behaving as i think it should - python

f2MatchLoc = re.search(r'(LOC)[ ]+=[ ]+["]?({}\b)'.format(f1LocValue), f2Line, re.IGNORECASE) ^^ You need to use \b after R2 so that there are no partial matches. See demo. Also use r or raw mode.

Related

Regex to ignore Semicolon

How to extract multiple searches with regular expression in python?

How to return a list of strings that do not match a specific pattern?

Python split before a certain character

Python regex to match multiple times

Categories

Resources