Regular expression not capturing all the elements - python

I have created a regular expression to match a string which would have a "K" preceeded by 10 characters before and proceeded by 10 characters after.
Check Demo Here
However, I'm not able to detect strings wherever a K is said to exist. I would like to have multiple combinations of a string whenever a K is present ?

you can use re.findall() :
print re.findall('([\w\n]{10}?K[\w\n]{10})',s)
result:
['GGKKKTKICDKVSHEEDRISQ', 'ISEILFHLSTKDSVRTSALST', 'FDSHRDSWIRKLRLDLGYHHD', 'HLDVHCFHDNKIPLSIYTCTT', 'PEFVSLP\nCLKIMHFENVSYP', 'ELILFSTMYPKGNVLQLRSDT', 'YAPLLQCLRAKMYSTK\nNFQI', 'DFVNTGGRYQKKKVIEDILID', 'RDLVISSNTWKEFFLYSKSRP', 'MLPTLLESCPKLESLILVMSS']

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

How to return regular expression match as one entire string?

I want to match phone numbers, and return the entire phone number but only the digits. Here's an example:
(555)-555-5555
555.555.5555
But I want to use regular expressions to return only:
5555555555
But, for some reason I can't get the digits to be returned:
import re
phone_number='(555)-555-5555'
regex = re.compile('[0-9]')
r = regex.search(phone_number)
regex.match(phone_number)
print r.groups()
But for some reason it just prints an empty tuple? What is the obvious thing I am missing here? Thanks.
You're getting empty result because you don't have any capturing groups, refer to the documentation for details.
You should change it to group() instead, now you'll get the first digit as a match. But this is not what you want because the engine stops when it encounter a non digit character and return the match until there.
You can simply remove all non-numeric characters:
re.sub('[^0-9]', '', '(555)-555-5555')
The range 0-9 is negated, so the regex matches anything that's not a digit, then it replaces it with the empty string.
You can do it without as regular expression using str.join and str.isdigit:
s = "(555)-555-5555"
print("".join([ch for ch in s if ch.isdigit()]))
5555555555
If you printed r.group() you would get some output but using search is not the correct way to find all the matches, search would return the first match and since you are only looking for a single digit it would return 5, even with '[0-9]+') to match one or more you would still only get the first group of consecutive digits i.e 555 in the string above. Using "".join(r.findall(s)) would get the digits but that can obviously be done with str.digit.
If you knew the potential non-digit chars then str.translate would be the best approach:
s = "(555)-555-5555"
print(s.translate(None,"()-."))
5555555555
The simplest way is here:
>>> import re
>>> s = "(555)-555-5555"
>>> x = re.sub(r"\D+", r"", s)
>>> x
'5555555555'

Python parentheses and returning only certain part of regex

I have a list of strings that I'm looping through. I have the following regular expression (item is the string I'm looping through at any given moment):
regularexpression = re.compile(r'set(\d+)e', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
What I want it to do is return numbers that have the word set before them and the letter e after them.
However, I also want it to return numbers that have set before them and x after them. If I use the following code:
regularexpression = re.compile(r'set(\d+)(e|x)', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
Instead of returning just the number, it also returns e or x. Is there a way to use parentheses to group my regular expression into bits without it returning everything in the parentheses?
Your example code seems fine already, but to answer your question, you can make a non-capturing group using the (?:) syntax, e.g.:
set(\d+)(?:e|x)
Additionally, in this specific example you can just use a character class:
set(\d+)[ex]
It appears you are looking at more than just .group(1); you have two capturing groups defined in your regular expression.
You can make the second group non-capturing by using (?:...) instead of (...):
regularexpression = re.compile(r'set(\d+)(?:e|x)', re.IGNORECASE)

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Categories