How do I make a regex that only matches integers? - python

I have this polynomial in a string.
x^3+0.125x+2
I want to match here the 3 and the 2, but not the 0.125. Just the integers. Be best I came with so far is this, but this still matches the 25 in 0.125.
(?<!\.)\d+(?!\.)

You can try this:
>>> import re
>>> re.findall(r'(?<!\.)\b\d+\b(?!\.)', "x^3+0.125x+2")
['3', '2']
use \b\d+\b to make sure that matching entire number

An integer is a number that contains only digits, an optional e or E (only if followed by numbers) and optionally starts with a -. To the left there can only be a non-number and non-letter (since x2 would be considered a variable name) or nothing. To the right there can only be a non-number or nothing (2x on the right would be 2*x).
The following pattern should match all integers in a string according to the given specification:
r'(?:^|(?<=[^\d\w\.]))(?:(?:(?<![\d\w])|^)\-)?\d+(?:[eE]\d+)?(?!\.)(?=[^\d]|$)''

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Python regular expression for substring

All I want is to grab the first 3 numeric characters of string:
st = '123_456'
import re
r = re.match('([0-9]{3})', st)
print r.groups()[0]
Am I doing the right thing for grabbing first 3 characters?
This returns 123 but what if I want to get the first 3 characters regardless of numbers and alphabets or special characters?
When given 12_345, I want to grab only 12_
Thanks,
If you always need first three characters in a string, then you can use the below:
first_3_charaters = st[:3]
There is no need of regular expression in your case.
You are really close, just drop the extra set of parenthesis and use the proper indexing of zero instead of one. Python indexing starts at zero. See below.
This works:
import re
mystring = '123_456'
check = re.search('^[0-9]{3}', mystring)
if check:
print check.group(0)
the ^ anchors to the beginning of the string which will ensure a match to the first three numeric digits only. If you do not use the carrot the regexp will match any three digits in a row in the string.
Some may suggest \d but this includes more than 0-9.
As others will surely point out a simple substring operation will do the trick if all the fields start with three numeric digits that you want to extract.
Good luck!
If all digits are separated by _, then you can simply use this regular expression which greedily matches all numeric characters before the first _ .
r = re.match('([0-9]*)_', st)
Actually, the _ in this RE is not necessary,so you can simplify it to (so that any separator is accepted ):
r = re.match('(\d*)', st)
But this solution will give you 1234 if st = '1234_56'. I'm not sure whether it is your intention.
So, if you want at most 3 numeric characters, you can just modify the regular expression to:
r = re.match('(\d{,3})', st)

regex conditional matching

I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Categories