I have written a python script with the following function, which takes as input a file name that contains multiple dates.
CODE
import re
from datetime import datetime
def ExtractReleaseYear(title):
rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
match = rg.search(title) # Using non-greedy match on filler
if match:
releaseYear = match.group(1)
try:
if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
return releaseYear
except ValueError:
print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
return ""
print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))
OUTPUT
Returned: 2012 -- I'd like this to be 2009 (i.e. last occurrence of year in string)
Returned: 2012 -- This is correct! (last occurrence of year is the first one, thus right)
Returned: 2001 -- I'd like this to be 1968 (i.e. last occurrence of year in string)
ISSUE
As can be observed, the regex will only target the first occurrence of a year instead of the last. This is problematic because some titles (such as the ones included here) begin with a year.
Having searched for ways to get the last occurrence of the year has led me to this resources like negative lookahead, last occurrence of repeated group and last 4 digits in URL, none of which have gotten me any closer to achieving the desired result. No existing question currently answers this unique case.
INTENDED OUTCOME
I would like to extract the LAST occurrence (instead of the first) of a year from the given file name and return it using the existing definition/function as stated in the output quote above.
While I have used online regex references, I am new to regex and would appreciate someone showing me how to implement this filter to work on the file names above. Cheers guys.
as per #kenyanke answer choosing findall() over search() will be a better option as former returns all non-overlapping matching pattern. You can choose last matching pattern as releaseYear. here is my regex to find releaseYear
rg = re.compile(r'[^a-z](\d{4})[^a-z]', re.IGNORECASE)
match = rg.findall(title)
if match:
releaseYear = match[-1]
Above regex expression is made with an assumption that immediate letter before or after releaseYear is non-alphabet character. Result(match) for three string are
['2009']
['2012']
['1968']
There are two things you need to change:
The first .*? lazy pattern must be turned to greedy .* (in this case, the subpatterns after .* will match the last occurrence in the string)
The group you need to use is Group 2, not Group 1 (as it is the group that stores the year data). Or make the first capturing group non-capturing.
See this demo:
rg = re.compile('.*([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(2)
Or:
rg = re.compile('.*(?:[\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(1)
Consider using findall() over search()?
It will put all values found into a list from left-to-right, just gotta access the right most value to get what you want.
import re
from datetime import datetime
def ExtractReleaseYear(title):
rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
match = rg.findall(title)
if match:
try:
releaseYear = match[-1][-1]
if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
return releaseYear
except ValueError:
print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
return ""
print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))
Related
I have a series of strings some of which have a year string at the end in the format -2022. I'm looking to match everything up to but excluding the - before 4 digit year string but if there is no year present then I would like to return the entire string. The following:
import re
x = "itf-m15-cancun-15-men-2022"
re.search(r"^.+?(?=-\d\d\d\d)", x).group()
Gets me 'itf-m15-cancun-15-men' which I'm looking for. However, the following:
import re
x = "itf-m15-cancun-15-men"
re.search(r"^.+?(?=-\d\d\d\d)", x).group()
Errors as no result is returned. How do I capture everything up to but excluding the - before the 4 digit year string or return the whole string if the year string isn't present?
Add OR end |$ inside your lookahead:
^.+?(?=-\d{4}|$)
See demo at regex101
Alternatively an explicit greedy alternation could be used here like in this demo.
Make the (?=-\d\d\d\d) conditional by adding a ? after it. (Tested in JavaScript)
/^.+?(?=-\d\d\d\d)?$/
Lets say I have the following string,
My ID is _n1_n2_n1_n3_n1_n1_n2 ,
I'm looking to extract the _n1_n2_n1_n3_n1_n1_n2, we only need to consider word where _n occurs between 5-10 times in a word. the numbers followed by _n anywhere between 0-9.
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'_n\d{0,9}', str)
if match:
print('found', match.group())
else:
print('did not find')
I was able to extract the _n1 with _n\d{0,9} but unable to extend further. Can any one help me to extend further in python.
You need a regex that sees 7 times a _n\d : '(_n\d){7}'
match = re.search(r'(_n\d){7}', value)
(_n\d){4,8} for range of amount
(_n\d)+ for any amount
I'm not sure if this is what you want but how about:
(_n\d)+
Explanation:
(..) signifies a group
+ means we want the group to repeat 1 or more times
_n\d means we want to have _n followed by a number
To extract the complete match, we can use regex group 0 which refers to the full match:
import re
test_str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'(_n\d)+', test_str)
print(match.group(0))
Will output: _n1_n2_n1_n3_n1_n1_n2
In Regex, {0,9} is not a number between 0 and 9, it's an amount of occurrences for the term that is in front of that, which can be a single character or a group (in parentheses).
If you want single digits from 0 to 9, that is [0-9], which is almost identical to \d (but may include non-arabic digits).
So, what you need is either
(_n[0-9])+
or
(_n\d)+
(online), where + is the number of occurrences from 1 to infinity.
From the comment
#KellyBundy I mean _n occurs 5-10 times, sorry for wrong phrasing the question.
you can further restrict + to be
(_n\d){5,10}
(online)
As per the comment
how about extracting _n1 _n2 _n1 _n4 _n1 _n1 ?
you would construct the Regex for an individual part only and use findall() like so:
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.findall(r'_n\d', str)
if match:
print('found', match)
else:
print('did not find')
but if you're not comfortable with Regex so much, you could also try much simpler string operations, e.g.
result = str.split("_n")
print(result[1:])
I am searching for a specific string within a document that will have known words before and after a date, and I want to extract the date. For example, if the substring is "dated as of 29 Jan 2017 to the schedule", I want to extract "29 Jan 2017".
My code is:
m = re.search(r'dated as of \w+\s+(.+?)+to the schedule', text, re.IGNORECASE)
if m:
items["date"] = m.group(1)
But - this just gives me "Jan 2017" - it misses the day.
I have tried various variations on the regex search string, but still can't get the day. Any thoughts?
You have your capturing group (parentheses) not enclose the first part that is captured by \w+.
Try mixing capturing group (for the whole part) and non-capturing group for your current parentheses:
r'dated as of (\w+\s+(?:.+?)+) to the schedule'
As you can see, we have a simple grouping with no repetition that encloses both \w+ and your previous parentheses.
And your previous parentheses were changed to non-capturing group with ?: just inside them.
Better yet, your already-existing parentheses and combination of +? and + doesn't make much sense, so you can just remove it:
r'dated as of (\w+\s+.+) to the schedule'
"re" module included with Python primarily used for string searching and manipulation
\w = letters ( Match alphanumeric character, including "_")
\d= any number (a digit)
+ = matches 1 or more
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
import re
data = "dated as of 29 Jan 2017 to the schedule"
match = re.findall(r'\d+ \w+ \d{4}', data)
print (match[0])
output:
29 Jan 2017
This works fine :-
text ="dated as of 29 Jan 2017"
m =re.search(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}',
text, re.IGNORECASE)
if m:
print (m.group(0))
So I have this king of string:
some_string-1.4.2.4-RELEASE.some_extension
And I want to parse the version number (in my example: 1.4.2.4)
But the number between the dots will not always be 1 digit, it could be something like: 1.40.2.4 or 11.4.2.4.
This is what i have tried:
(\d+\.)?\d+\.\d+
And this does not parse all the numbers.
EDIT
I tried to use the answer from the duplicate link: \d+(\.\d+)+
And according to regex101 I get this result:
Full match 17-24 1.4.2.4
Group 1. 22-24 .4
But in my code I got only .4:
file_name = 'some_string-1.4.2.4-RELEASE.some_extension'
match = re.findall('\d+(\.\d+)+', file_name)
if len(match) == 0:
print('Failed to match version number')
else:
print(match[0])
return match[0]
You might want to consider the following pattern :
file_name = 'some_string-1.4.2.4-RELEASE.some_extension'
pattern = r'\-([0-9.-]*)\-'
match = re.findall(pattern, file_name)
if len(match) == 0:
print('Failed to match version number')
else:
print(match[0])
output:
1.4.2.4
Your pattern is almost right.
Use
(\d+(?:\.\d+)+)
This changes the first group to be the entire version number, and ignore the internal repeating group.
str = "some_string-1.4.2.4-RELEASE.some_extension"
regex = r'''(\d+(?:\.\d+)*)'''
print(re.findall(regex, str)) # prints ['1.4.2.4']
The pattern \d+(\.\d+)+ contains a repeating capturing group and will contain the value of the last iteration which is .4 and will be returned by findall.
If you would make it a non capturing group it will match the whole value but also values like 1.1 and 9.9.9.99.9.9
\d+(?:\.\d+)+
If the digits must consists of 3 dots and between hyphens, you might use a capturing group:
-(\d+(?:\.\d+){3})-
Regex demo
Or use lookarounds to get a match without using a group:
(?<=-)\d+(?:\.\d+){3}(?=-)
I have to find dates in multiple formats in a text.
I have some regex like this one:
# Detection of:
# 25/02/2014 or 25/02/14 or 25.02.14
regex = r'\b(0?[1-9]|[12]\d|3[01])[-/\._](0?[1-9]|1[012])[-/\._]((?:19|20)\d\d|\d\d)\b'
The problem is that it also matches dates like 25.02/14 which is not good because the splitting character is not the same.
I could of course do multiple regex with a different splitting character for every regex, or do a post-treatment on the matching results, but I would prefer a complete solution using only one good regex. Is there a way to do so?
In addition to my comment (the original word boundary approach lets the pattern match "dates" that are in fact parts of other entities, like IPs, serial numbers, product IDs, etc.), see the improved version of your regex in comparison with yours:
import re
s = '25.02.19.35 6666-20-03-16-67875 25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = [m.group() for m in re.finditer(r'\b(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d\b', s)]
print(found_dates) # initial regex
found_dates = [m.group() for m in re.finditer(r'(?<![\d.-])(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d(?!\1\d)', s)]
print(found_dates) # fixed boundaries
# = >['25.02.19', '20-03-16', '11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
# => ['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
See, your regex extracts '25.02.19' (part of a potential IP) and '20-03-16' (part of a potential serial number/product ID).
Note I also shortened the regex and extraction code a bit.
Pattern details:
(?<![\d.-]) - a negative lookbehind making sure there is no digit, .
and - immediately to the left of the current location (/ has been discarded since dates are often found inside URLs)
(?:0?[1-9]|[12]\d|3[01]) - 01 / 1 to 31 (day part)
([./-]) - Group 1 (technical group to hold the separator value) matching either ., or / or -
(?:0?[1-9]|1[012]) - month part: 01 / 1 to 12
\1 - backreference to the Group 1 value to make sure the same separator comes here
(?:19|20)?\d\d - year part: 19 or 20 (optional values) and then any two digits.
(?!\1\d) - negative lookahead making sure there is no separator (captured into Group 1) followed with any digit immediately to the right of the current location.
Based on the comment of Rawing, this did the trick:
regex = r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b'
So, the complete code is:
import re
s = '25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = []
for m in re.finditer(r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b', s):
found_dates.append(m.group(0))
print(found_dates)
The output is, as desired :
['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']