python regular expression for "_n1_n2_n1_n3_n1_n1_n2" - python

Lets say I have the following string,
My ID is _n1_n2_n1_n3_n1_n1_n2 ,
I'm looking to extract the _n1_n2_n1_n3_n1_n1_n2, we only need to consider word where _n occurs between 5-10 times in a word. the numbers followed by _n anywhere between 0-9.
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'_n\d{0,9}', str)
if match:
print('found', match.group())
else:
print('did not find')
I was able to extract the _n1 with _n\d{0,9} but unable to extend further. Can any one help me to extend further in python.

You need a regex that sees 7 times a _n\d : '(_n\d){7}'
match = re.search(r'(_n\d){7}', value)
(_n\d){4,8} for range of amount
(_n\d)+ for any amount

I'm not sure if this is what you want but how about:
(_n\d)+
Explanation:
(..) signifies a group
+ means we want the group to repeat 1 or more times
_n\d means we want to have _n followed by a number
To extract the complete match, we can use regex group 0 which refers to the full match:
import re
test_str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'(_n\d)+', test_str)
print(match.group(0))
Will output: _n1_n2_n1_n3_n1_n1_n2

In Regex, {0,9} is not a number between 0 and 9, it's an amount of occurrences for the term that is in front of that, which can be a single character or a group (in parentheses).
If you want single digits from 0 to 9, that is [0-9], which is almost identical to \d (but may include non-arabic digits).
So, what you need is either
(_n[0-9])+
or
(_n\d)+
(online), where + is the number of occurrences from 1 to infinity.
From the comment
#KellyBundy I mean _n occurs 5-10 times, sorry for wrong phrasing the question.
you can further restrict + to be
(_n\d){5,10}
(online)
As per the comment
how about extracting _n1 _n2 _n1 _n4 _n1 _n1 ?
you would construct the Regex for an individual part only and use findall() like so:
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.findall(r'_n\d', str)
if match:
print('found', match)
else:
print('did not find')
but if you're not comfortable with Regex so much, you could also try much simpler string operations, e.g.
result = str.split("_n")
print(result[1:])

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Python; parse version number with dots from string via regex

So I have this king of string:
some_string-1.4.2.4-RELEASE.some_extension
And I want to parse the version number (in my example: 1.4.2.4)
But the number between the dots will not always be 1 digit, it could be something like: 1.40.2.4 or 11.4.2.4.
This is what i have tried:
(\d+\.)?\d+\.\d+
And this does not parse all the numbers.
EDIT
I tried to use the answer from the duplicate link: \d+(\.\d+)+
And according to regex101 I get this result:
Full match 17-24 1.4.2.4
Group 1. 22-24 .4
But in my code I got only .4:
file_name = 'some_string-1.4.2.4-RELEASE.some_extension'
match = re.findall('\d+(\.\d+)+', file_name)
if len(match) == 0:
print('Failed to match version number')
else:
print(match[0])
return match[0]
You might want to consider the following pattern :
file_name = 'some_string-1.4.2.4-RELEASE.some_extension'
pattern = r'\-([0-9.-]*)\-'
match = re.findall(pattern, file_name)
if len(match) == 0:
print('Failed to match version number')
else:
print(match[0])
output:
1.4.2.4
Your pattern is almost right.
Use
(\d+(?:\.\d+)+)
This changes the first group to be the entire version number, and ignore the internal repeating group.
str = "some_string-1.4.2.4-RELEASE.some_extension"
regex = r'''(\d+(?:\.\d+)*)'''
print(re.findall(regex, str)) # prints ['1.4.2.4']
The pattern \d+(\.\d+)+ contains a repeating capturing group and will contain the value of the last iteration which is .4 and will be returned by findall.
If you would make it a non capturing group it will match the whole value but also values like 1.1 and 9.9.9.99.9.9
\d+(?:\.\d+)+
If the digits must consists of 3 dots and between hyphens, you might use a capturing group:
-(\d+(?:\.\d+){3})-
Regex demo
Or use lookarounds to get a match without using a group:
(?<=-)\d+(?:\.\d+){3}(?=-)

Regular expression match last occurence of year in string

I have written a python script with the following function, which takes as input a file name that contains multiple dates.
CODE
import re
from datetime import datetime
def ExtractReleaseYear(title):
rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
match = rg.search(title) # Using non-greedy match on filler
if match:
releaseYear = match.group(1)
try:
if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
return releaseYear
except ValueError:
print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
return ""
print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))
OUTPUT
Returned: 2012 -- I'd like this to be 2009 (i.e. last occurrence of year in string)
Returned: 2012 -- This is correct! (last occurrence of year is the first one, thus right)
Returned: 2001 -- I'd like this to be 1968 (i.e. last occurrence of year in string)
ISSUE
As can be observed, the regex will only target the first occurrence of a year instead of the last. This is problematic because some titles (such as the ones included here) begin with a year.
Having searched for ways to get the last occurrence of the year has led me to this resources like negative lookahead, last occurrence of repeated group and last 4 digits in URL, none of which have gotten me any closer to achieving the desired result. No existing question currently answers this unique case.
INTENDED OUTCOME
I would like to extract the LAST occurrence (instead of the first) of a year from the given file name and return it using the existing definition/function as stated in the output quote above.
While I have used online regex references, I am new to regex and would appreciate someone showing me how to implement this filter to work on the file names above. Cheers guys.
as per #kenyanke answer choosing findall() over search() will be a better option as former returns all non-overlapping matching pattern. You can choose last matching pattern as releaseYear. here is my regex to find releaseYear
rg = re.compile(r'[^a-z](\d{4})[^a-z]', re.IGNORECASE)
match = rg.findall(title)
if match:
releaseYear = match[-1]
Above regex expression is made with an assumption that immediate letter before or after releaseYear is non-alphabet character. Result(match) for three string are
['2009']
['2012']
['1968']
There are two things you need to change:
The first .*? lazy pattern must be turned to greedy .* (in this case, the subpatterns after .* will match the last occurrence in the string)
The group you need to use is Group 2, not Group 1 (as it is the group that stores the year data). Or make the first capturing group non-capturing.
See this demo:
rg = re.compile('.*([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(2)
Or:
rg = re.compile('.*(?:[\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(1)
Consider using findall() over search()?
It will put all values found into a list from left-to-right, just gotta access the right most value to get what you want.
import re
from datetime import datetime
def ExtractReleaseYear(title):
rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
match = rg.findall(title)
if match:
try:
releaseYear = match[-1][-1]
if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
return releaseYear
except ValueError:
print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
return ""
print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))

Searching for multiple substrings of unknown size in string in python

I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:
summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt
The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"
I need to extract the letter(s) piece.
I was trying:
match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)
but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).
Any ideas?
Thanks
You're missing repetitions, i.e.:
re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
\w will only match a single character
\w+ will match at least one
\w* will match any amount (0 or more)
Reference: Regular expression syntax
You were almost there all you need to do is to repeat the regex in caputure group
summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt
Example usage
>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'
Note
The match.group(n) will return the value captured by the nth caputre group
You don't need a regex, there is nothing complex about the pattern and it does not change:
s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb
If you want both sets of lettrrs:
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')
Since you only want to capture the letters at the beginning, you could do:
re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)
Which doesn't bother giving you the groups you don't need.
[0-9] looks for a number and [0-9]{3,6} allows for 3 to 6 numbers.
You're on the right track with your regex, but as everyone else forgets, \w includes alphanumerics and the underscore, so you should use [a-z] instead.
re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)
Or, as Padraic mentioned, you can just use str.split("_").

Categories