Extracting float or int number and substring from a string - python

I've just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
Input: 2.5ax
Output:['2.5','ax']
Input: -5bcf
Output:['-5','bcf']
Input:-69.67Gh
Output:['-69.67','Gh']
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?\d+(\.\d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was ['-2.55','xy']
But the output came:
[('-2.55', '.55'), ('', '')]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(\d+\.\d+)|\d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[('-2.55', '2.55'), ('', '')]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn't look like an effective way to do it.
Conclusion:
So I don't know why my 1st and 2nd attempt isn't working. The output comes with a list of tuples which is maybe because of the groups but I don't know the exact reason and don't know how to solve them. Maybe I didn't understand the way the pattern works. Also why the substring didn't show in the output?
In the end, I want to know what's the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.

The alternation | matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don't need the alternation | and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?\d+(?:\.\d+)?)([a-zA-Z]+)
Explanation
( Capture group 1
-?\d+ Match an optional -
(?:\.\d+)? Optionally match . and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
) Close group 1
( Capture group 2
[a-zA-Z]+ Match 1+ times a char a-z or A-Z
) Close group 2
regex demo
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?\d+(?:\.\d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]

lookahead and lookbehind in re.sub simplify things sometimes.
(?<=\d) look behind
(?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=\d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

ignore first occurance of letter regex

using the following strings
9989S90K72MF-1
9989S90S-1
9989S75K60MF-1
9989S75S-1
I Would like to extract the below from those strings.
9989S90
9989S90
9989S75
9989S75
So far I have:
(^.*?(?=K|-))
Which gives me:
9989S90
9989S90S
9989S75
9989S75S
Here's a link https://regex101.com/r/d1nQj0/1
I've tried a few different regex but can't seem to nail it. Is there a way to ignore the first occurrence of a digit/letter? Which in my case would be S
The following regex matches a string at the beginning of a line that contains a single S up to but not including the first occurrence of S or K
^(.*?S.*?)(?=K|S)
For the example data, you could also match 1+ digits, then S followed by 1+ digits.
^\d+S\d+
Regex demo
If there has to be a S K or - at the right:
^\d+S\d+(?=[KS-])
Regex demo
Example
import re
regex = r"^\d+S\d+(?=[KS-])"
s = ("9989S90K72MF-1\n"
"9989S90S-1\n"
"9989S75K60MF-1\n"
"9989S75S-1")
print(re.findall(regex, s, re.MULTILINE))
Output
['9989S90', '9989S90', '9989S75', '9989S75']

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Regex match adjacent digits after second occurrence of character

Stuck with the following issue:
I have a string 'ABC.123.456XX' and I want to use regex to extract the 3 numeric characters that come after the second period. Really struggling with this and would appreciate any new insights, this is the closest I got but its not really close to what I want:
'.*\.(.*?\.\d{3})'
I appreciate any help in advance - thanks.
If your input will always be in a similar format, like xxx.xxx.xxxxx, then one solution is string manipulation:
>>> s = 'ABC.123.456XX'
>>> '.'.join(s.split('.')[2:])[0:3]
Explanation
In the line '.'.join(s.split('.')[2:])[0:3]:
s.split('.') splits the string into the list ['ABC', '123', '456XX']
'.'.join(s.split('.')[2:]) joins the remainder of the list after the second element, so '456XX'
[0:3] selects the substring from index 0 to index 2 (inclusive), so the result is 456
This expression might also work just OK:
[^\r\n.]+\.[^\r\n.]+\.([0-9]{3})
Test
import re
regex = r'[^\r\n.]+\.[^\r\n.]+\.([0-9]{3})'
string = '''
ABC.123.456XX
ABCOUOU.123123123.000871XX
ABCanything_else.123123123.111871XX
'''
print(re.findall(regex, string))
Output
['456', '000', '111']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Dot, not-Dot twice then the 3 digits follow in capture group 1
[^.]*(?:\.[^.]*){2}(\d{3})
https://regex101.com/r/qWpfHx/1
Expanded
[^.]*
(?: \. [^.]* ){2}
( \d{3} ) # (1)

Regex for string that has 5 numbers or IND/5numbers

I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?
Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).
Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo
This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .
Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number
I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable

Categories