Extract date from inside a string with Python

Extract date from inside a string with Python - python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!

You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030

You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030

This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.

Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.

Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']

I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)

without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10

[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)

Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

ignore first occurance of letter regex

using the following strings
9989S90K72MF-1
9989S90S-1
9989S75K60MF-1
9989S75S-1
I Would like to extract the below from those strings.
9989S90
9989S90
9989S75
9989S75
So far I have:
(^.*?(?=K|-))
Which gives me:
9989S90
9989S90S
9989S75
9989S75S
Here's a link https://regex101.com/r/d1nQj0/1
I've tried a few different regex but can't seem to nail it. Is there a way to ignore the first occurrence of a digit/letter? Which in my case would be S

The following regex matches a string at the beginning of a line that contains a single S up to but not including the first occurrence of S or K
^(.*?S.*?)(?=K|S)

For the example data, you could also match 1+ digits, then S followed by 1+ digits.
^\d+S\d+
Regex demo
If there has to be a S K or - at the right:
^\d+S\d+(?=[KS-])
Regex demo
Example
import re
regex = r"^\d+S\d+(?=[KS-])"
s = ("9989S90K72MF-1\n"
"9989S90S-1\n"
"9989S75K60MF-1\n"
"9989S75S-1")
print(re.findall(regex, s, re.MULTILINE))
Output
['9989S90', '9989S90', '9989S75', '9989S75']

Using Regex to search for a string unless it finds another string first

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.
This is what I have right now and it definitely doesn't work.
(\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))
So in this instance It would throw a match since the string is before the next date:
01/20/20
string
01/21/20
Here it shouldn't match since the string is after the next date:
01/20/20
this isn't the phrase you're looking for
01/21/20
string
Any help on this would be greatly appreciated.

You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.
If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}
Regex demo | Python demo
For example (using re.DOTALL to make the dot match a newline)
import re
regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"
test_str = """01/20/20\n\n"
"string\n\n"
"01/21/20\n\n"
"01/20/20\n\n"
"this isn't the phrase you're looking for\n\n"
"01/21/20\n\n"
"string"""
print(re.findall(regex, test_str, re.DOTALL))
Output
['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']
If the string can not occur 2 times between the date, you might use
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}
Regex demo
Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b

One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:
inp = """01/20/20
string # <-- this is matched
01/21/20
01/20/20
01/21/20
string""" # <-- this is not matched
matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)
This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

Why does this regex to find repeated characters fail?

I'm trying to build a regex to match any occurrence of two or more repeated alphanumeric characters. The following regex fails:
import re
s = '__commit__'
m = re.search(r'([a-zA-Z0-9])\1\1', s)
But when I change it to this it works:
m = re.search(r'([a-zA-A0-9])\1+', s)
I'm pretty baffled as to why this is the way it is. Can anyone provide some insight?

Look at this line.
m = re.search(r'([a-zA-Z0-9])\1\1', s)
You are using a pattern and two backreferences (A reference of already matched pattern). So, it will match only when minimum of three consecutive characters appear. You can do:
m = re.search(r'([a-zA-Z0-9])\1', s)
Which will match when minimum of two consecutive character appears.
However, the following one is much better.
m = re.search(r'([a-zA-A0-9])\1+', s)
That's because, now you are trying to match at least one or more backreferences \1+, that is minimum two consecutive characters.

The \1 is a back-reference to any of the previously matching groups. So the original regex that does not work for you essentially means :
Match alphanumeric strings that contain 3 occurences of the previously matchd group. In this case the previously matched group ([a-zA-Z0-9]) contains a single character a-z or A-Z or 0-9. You then have two '\1 in your regex which accounts for two back-references to the previously matched character.
In the second regex the back-reference \1 has a + in front of it which means match atleast one occurence of the previously captured character - which means that the string confirming to this pattern has to be atleast 2 characters in length.
Hope this helps.

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.

You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.

Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'

Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.

Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times

Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'

I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)

Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract date from inside a string with Python - python

You can do: a = 'PR191030.213101.ABD' int(''.join([c for c in a if c.isdigit()][:6])) Output: 191030

This can also be done by: filename_without_ending.split(".")[0][2::] This splits the string from the 3rd letter to the end.

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

ignore first occurance of letter regex

Using Regex to search for a string unless it finds another string first

Why does this regex to find repeated characters fail?

python regex: get end digits from a string

Categories

Resources