Python: Search a string for a variable repeating characters

Python: Search a string for a variable repeating characters - python

I'm trying to write a function that will search a string (all numeric, 0-9) for a variable sequence of 4 or more repeating characters.
Here are some example inputs:
"14888838": the function would return True because it found "8888".
"1111": the function would return True because it found "1111".
"1359": the function would return False because it didn't find 4 repeating characters in a row.
My first inclination is to use re, so I thought the pattern :"[0-9]{4}" would work but that returns true as long as it finds any four numerics in a row, regardless of whether they are matching or not.
Anyway, thanks in advance for your help.
Dave

You may rely on capturing and backreferences:
if re.search(r'(\d)\1{3}', s):
print(s)
Here, (\d) captures a digit into Group 1 and \1{3} matches 3 occurrences of the value captured that are immediately to the right of that digit.
See the regex demo and a Python demo
import re
values = ["14888838", "1111", "1359"]
for s in values:
if re.search(r'(\d)\1{3}', s):
print(s)
Output:
14888838
1111

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.

Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']

I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)

without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10

[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)

Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!

You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030

You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030

This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.

Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Check for very specified numbers padding

I am trying to check for a list of items in my scene to see if they bear 3 (version) paddings at the end of their name - eg. test_model_001 and if they do, that item will be pass and items that do not pass the condition will be affected by a certain function..
Suppose if my list of items is as follows:
test_model_01
test_romeo_005
test_charlie_rig
I tried and used the following code:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
mo = re.sub('.*?([0-9]*)$',r'\1', item)
print mo
And it return me 01 and 005 as the output, in which I am hoping it will return me just the 005 only.. How do I ask it to check if it contains 3 paddings? Also, is it possible to include underscore in the check? Is that the best way?

You can use the {3} to ask for 3 consecutive digits only and prepend underscore:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
match = re.search(r'_([0-9]{3})$', item)
if match:
print(match.group(1))
This would print 005 only.

The asterisk after the [0-9] specification means that you are expecting any random number of occurrences of the digits 0-9. Technically this expression matches test_charlie_rig as well. You can test that out here http://pythex.org/
Replacing the asterisk with a {3} says that you want 3 digits.
.*?([0-9]{3})$
If you know your format will be close to the examples you showed, you can be a bit more explicit with the regex pattern to prevent even more accidental matches
^.+_(\d{3})$

for item in eg_list:
if re.match(".*_\d{3}$", item):
print item.split('_')[-1]
This matches anything which ends in:
_ and underscore, \d a digit, {3} three of them, and $ the end of the line.
Debuggex Demo
printing the item, we split it on _ underscores and take the last value, index [-1]
The reason .*?([0-9]*)$ doesn't work is because [0-9]* matches 0 or more times, so it can match nothing. This means it will also match .*?$, which will match any string.
See the example on regex101.com

I usually don't like regex unless needed. This should work and be more readable.
def name_validator(name, padding_count=3):
number = name.split("_")[-1]
if number.isdigit() and number == number.zfill(padding_count):
return True
return False
name_validator("test_model_01") # Returns False
name_validator("test_romeo_005") # Returns True
name_validator("test_charlie_rig") # Returns False

How to return regular expression match as one entire string?

I want to match phone numbers, and return the entire phone number but only the digits. Here's an example:
(555)-555-5555
555.555.5555
But I want to use regular expressions to return only:
5555555555
But, for some reason I can't get the digits to be returned:
import re
phone_number='(555)-555-5555'
regex = re.compile('[0-9]')
r = regex.search(phone_number)
regex.match(phone_number)
print r.groups()
But for some reason it just prints an empty tuple? What is the obvious thing I am missing here? Thanks.

You're getting empty result because you don't have any capturing groups, refer to the documentation for details.
You should change it to group() instead, now you'll get the first digit as a match. But this is not what you want because the engine stops when it encounter a non digit character and return the match until there.
You can simply remove all non-numeric characters:
re.sub('[^0-9]', '', '(555)-555-5555')
The range 0-9 is negated, so the regex matches anything that's not a digit, then it replaces it with the empty string.

You can do it without as regular expression using str.join and str.isdigit:
s = "(555)-555-5555"
print("".join([ch for ch in s if ch.isdigit()]))
5555555555
If you printed r.group() you would get some output but using search is not the correct way to find all the matches, search would return the first match and since you are only looking for a single digit it would return 5, even with '[0-9]+') to match one or more you would still only get the first group of consecutive digits i.e 555 in the string above. Using "".join(r.findall(s)) would get the digits but that can obviously be done with str.digit.
If you knew the potential non-digit chars then str.translate would be the best approach:
s = "(555)-555-5555"
print(s.translate(None,"()-."))
5555555555

The simplest way is here:
>>> import re
>>> s = "(555)-555-5555"
>>> x = re.sub(r"\D+", r"", s)
>>> x
'5555555555'

Find 1 letter and 2 numbers using RegEx

I have been writing a program recently and a part of it requires me to get information form inside a string. I need to find where there is 1 letter immediately followed by 2 numbers (e.g. S07) and I can't work out the RegEx for it.
def get_season(filenames):
pattern = "^[a-zA-z]{1}[\d]{2}$"
found = re.search(filenames[0], pattern)
season_name = found.string
season = season_name[1:3]
print(season)
I know that this information is in the string but it keeps giving me "None" in response
(I'm not too sure if the code section has formatted correctly, in the preview it shows as on the same line, but the indentation in my program is correct)

You swapped the arguments to re.search(). The first argument is the pattern, not the string to match:
found = re.search(pattern, filenames[0])
Your pattern is also overly wide; A-z matches everything between Z (uppercase) and a (lowercase) too. The correct pattern is:
pattern = "^[a-zA-Z]\d{2}$"
where {1} is the default, so I omitted that.
If you are matching this against filenames, you probably do not want to use the start or end anchors, that would limit matches to exact strings only:
>>> re.search("^[a-zA-Z]\d{2}$", "S07").string
'S20'
>>> re.search("^[a-zA-Z]\d{2}$", "S07E01 - Meet the New Boss.avi") is None
True
>>> re.search("^[a-zA-Z]\d{2}$", "S07E01 - Meet the New Boss.avi") is None
True
>>> re.search("[a-zA-Z]\d{2}", "S07E01 - Meet the New Boss.avi").string
'S07E01 - Meet the New Boss.avi'
And you want to use .group() to get the matched portion, not string (which is the original input string):
>>> re.search("[a-zA-Z]\d{2}", "S07E01 - Meet the New Boss.avi").group()
'S07'
If you only wanted the numbers, you need to add a group, and pick that. You create a capturing group with parenthesis:
>>> re.search("[a-zA-Z](\d{2})", "S07E01 - Meet the New Boss.avi").group(1)
'07'
This selects the first group (.group(1)), which is the parenthesis around the 2 digits portion.

Your regex will catch only the string which consits only of one letter and two digits, to check whole string for multiple occurences use these:
Try this regex:
[a-zA-Z]\d{2}
INPUT
asdasdasS01asfasfsa
OUTPUT
S01
If you want to find a word wich consists only of a letter followed by two digits use this regex:
\b[a-zA-Z]\d{2}\b
Only numebers capture regex:
[a-zA-Z](\d{2})
INPUT
asdasdasS01asfasfsa
OUTPUT
01
Also swap the arguments in serach method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Search a string for a variable repeating characters - python

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

Extract date from inside a string with Python

Check for very specified numbers padding

How to return regular expression match as one entire string?

Find 1 letter and 2 numbers using RegEx

Categories

Resources