Check for very specified numbers padding - python

I am trying to check for a list of items in my scene to see if they bear 3 (version) paddings at the end of their name - eg. test_model_001 and if they do, that item will be pass and items that do not pass the condition will be affected by a certain function..
Suppose if my list of items is as follows:
test_model_01
test_romeo_005
test_charlie_rig
I tried and used the following code:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
mo = re.sub('.*?([0-9]*)$',r'\1', item)
print mo
And it return me 01 and 005 as the output, in which I am hoping it will return me just the 005 only.. How do I ask it to check if it contains 3 paddings? Also, is it possible to include underscore in the check? Is that the best way?

You can use the {3} to ask for 3 consecutive digits only and prepend underscore:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
match = re.search(r'_([0-9]{3})$', item)
if match:
print(match.group(1))
This would print 005 only.

The asterisk after the [0-9] specification means that you are expecting any random number of occurrences of the digits 0-9. Technically this expression matches test_charlie_rig as well. You can test that out here http://pythex.org/
Replacing the asterisk with a {3} says that you want 3 digits.
.*?([0-9]{3})$
If you know your format will be close to the examples you showed, you can be a bit more explicit with the regex pattern to prevent even more accidental matches
^.+_(\d{3})$

for item in eg_list:
if re.match(".*_\d{3}$", item):
print item.split('_')[-1]
This matches anything which ends in:
_ and underscore, \d a digit, {3} three of them, and $ the end of the line.
Debuggex Demo
printing the item, we split it on _ underscores and take the last value, index [-1]
The reason .*?([0-9]*)$ doesn't work is because [0-9]* matches 0 or more times, so it can match nothing. This means it will also match .*?$, which will match any string.
See the example on regex101.com

I usually don't like regex unless needed. This should work and be more readable.
def name_validator(name, padding_count=3):
number = name.split("_")[-1]
if number.isdigit() and number == number.zfill(padding_count):
return True
return False
name_validator("test_model_01") # Returns False
name_validator("test_romeo_005") # Returns True
name_validator("test_charlie_rig") # Returns False

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

python regular expression for "_n1_n2_n1_n3_n1_n1_n2"

Lets say I have the following string,
My ID is _n1_n2_n1_n3_n1_n1_n2 ,
I'm looking to extract the _n1_n2_n1_n3_n1_n1_n2, we only need to consider word where _n occurs between 5-10 times in a word. the numbers followed by _n anywhere between 0-9.
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'_n\d{0,9}', str)
if match:
print('found', match.group())
else:
print('did not find')
I was able to extract the _n1 with _n\d{0,9} but unable to extend further. Can any one help me to extend further in python.
You need a regex that sees 7 times a _n\d : '(_n\d){7}'
match = re.search(r'(_n\d){7}', value)
(_n\d){4,8} for range of amount
(_n\d)+ for any amount
I'm not sure if this is what you want but how about:
(_n\d)+
Explanation:
(..) signifies a group
+ means we want the group to repeat 1 or more times
_n\d means we want to have _n followed by a number
To extract the complete match, we can use regex group 0 which refers to the full match:
import re
test_str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.search(r'(_n\d)+', test_str)
print(match.group(0))
Will output: _n1_n2_n1_n3_n1_n1_n2
In Regex, {0,9} is not a number between 0 and 9, it's an amount of occurrences for the term that is in front of that, which can be a single character or a group (in parentheses).
If you want single digits from 0 to 9, that is [0-9], which is almost identical to \d (but may include non-arabic digits).
So, what you need is either
(_n[0-9])+
or
(_n\d)+
(online), where + is the number of occurrences from 1 to infinity.
From the comment
#KellyBundy I mean _n occurs 5-10 times, sorry for wrong phrasing the question.
you can further restrict + to be
(_n\d){5,10}
(online)
As per the comment
how about extracting _n1 _n2 _n1 _n4 _n1 _n1 ?
you would construct the Regex for an individual part only and use findall() like so:
import re
str = 'My ID is _n1_n2_n1_n3_n1_n1_n2'
match = re.findall(r'_n\d', str)
if match:
print('found', match)
else:
print('did not find')
but if you're not comfortable with Regex so much, you could also try much simpler string operations, e.g.
result = str.split("_n")
print(result[1:])

Python: Search a string for a variable repeating characters

I'm trying to write a function that will search a string (all numeric, 0-9) for a variable sequence of 4 or more repeating characters.
Here are some example inputs:
"14888838": the function would return True because it found "8888".
"1111": the function would return True because it found "1111".
"1359": the function would return False because it didn't find 4 repeating characters in a row.
My first inclination is to use re, so I thought the pattern :"[0-9]{4}" would work but that returns true as long as it finds any four numerics in a row, regardless of whether they are matching or not.
Anyway, thanks in advance for your help.
Dave
You may rely on capturing and backreferences:
if re.search(r'(\d)\1{3}', s):
print(s)
Here, (\d) captures a digit into Group 1 and \1{3} matches 3 occurrences of the value captured that are immediately to the right of that digit.
See the regex demo and a Python demo
import re
values = ["14888838", "1111", "1359"]
for s in values:
if re.search(r'(\d)\1{3}', s):
print(s)
Output:
14888838
1111

Searching for multiple substrings of unknown size in string in python

I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:
summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt
The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"
I need to extract the letter(s) piece.
I was trying:
match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)
but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).
Any ideas?
Thanks
You're missing repetitions, i.e.:
re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
\w will only match a single character
\w+ will match at least one
\w* will match any amount (0 or more)
Reference: Regular expression syntax
You were almost there all you need to do is to repeat the regex in caputure group
summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt
Example usage
>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'
Note
The match.group(n) will return the value captured by the nth caputre group
You don't need a regex, there is nothing complex about the pattern and it does not change:
s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb
If you want both sets of lettrrs:
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')
Since you only want to capture the letters at the beginning, you could do:
re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)
Which doesn't bother giving you the groups you don't need.
[0-9] looks for a number and [0-9]{3,6} allows for 3 to 6 numbers.
You're on the right track with your regex, but as everyone else forgets, \w includes alphanumerics and the underscore, so you should use [a-z] instead.
re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)
Or, as Padraic mentioned, you can just use str.split("_").

Check String for / against Characters in Python

I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers
you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.
SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.
pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'
Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."

Categories