regular expressions in python for searching file - python

I want to know how to get files which match this type:
recording_i.file_extension
Ex:
recording_1.mp4
recording_112.mp4
recording_11.mov
I have a regular expression:
(recording_\d*)(\..*)
My regular expression doesn't works as i want.
Wrong file names which not match my type: lalala_recording_1.mp4, recording_.mp4
But my re works for this examples, however my code should return [] for this examples.
Can u fix my regular expression, please?
Thanks.

Use
(^recording_\d+)(\.\w{3}$)
Test
import re
s = """recording_1.mp4
recording_112.mp4
recording_11.mov
lalala_recording_1.mp4,
recording_.mp4"""
pattern = re.compile(r"(^recording_\d+)(\.\w{3}$)")
for l in s.split():
if pattern.match(l):
print(l)
Output (only the desired files)
recording_1.mp4
recording_112.mp4
recording_11.mov
Explanation
With r"(^recording_\d+)(\.\w{3}$)"--1)
- use \d+ since need at least one number
- \w{3} for three letter suffix
- ^ to ensure starts with recording
- $ to ensure ends after suffix
Particular Suffixes
import re
# List of suffixes to match
suffixes_list = ['mp4', 'mov']
suffixes = '|'.join(suffixes_list)
# Use suffixes in pattern (rather than excepting
# any 3 letter word
pattern = re.compile(fr"(^recording_\d+)(\.{suffixes}$)")
Test
s = """recording_1.mp4
recording_112.mp4
recording_11.mov
lalala_recording_1.mp4,
recording_.mp4
dummy1.exe
dummy2.pdf
dummy3.exe"""
for l in s.split():
if pattern.match(l):
print(l)
Output
recording_1.mp4
recording_112.mp4
recording_11.mov

Related

Filtering a list of strings using regex

I have a list of strings that looks like this,
strlist = [
'list/category/22',
'list/category/22561',
'list/category/3361b',
'list/category/22?=1512',
'list/category/216?=591jf1!',
'list/other/1671',
'list/1y9jj9/1yj32y',
'list/category/91121/91251',
'list/category/0027',
]
I want to use regex to find the strings in this list, that contain the following string /list/category/ followed by an integer of any length, but that's it, it cannot contain any letters or symbols after that.
So in my example, the output should look like this
list/category/22
list/category/22561
list/category/0027
I used the following code:
newlist = []
for i in strlist:
if re.match('list/category/[0-9]+[0-9]',i):
newlist.append(i)
print(i)
but this is my output:
list/category/22
list/category/22561
list/category/3361b
list/category/22?=1512
list/category/216?=591jf1!
list/category/91121/91251
list/category/0027
How do I fix my regex? And also is there a way to do this in one line using a filter or match command instead of a for loop?
You can try the below regex:
^list\/category\/\d+$
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits that occur one or more times.
$ - Matches the end of the test string. This is the part your regex missed.
Demo of the above regex in here.
IMPLEMENTATION IN PYTHON
import re
pattern = re.compile(r"^list\/category\/\d+$", re.MULTILINE)
match = pattern.findall("list/category/22\n"
"list/category/22561\n"
"list/category/3361b\n"
"list/category/22?=1512\n"
"list/category/216?=591jf1!\n"
"list/other/1671\n"
"list/1y9jj9/1yj32y\n"
"list/category/91121/91251\n"
"list/category/0027")
print (match)
You can find the sample run of the above implementation here.

Python regex that matches any word that contains exactly n digits, but can contain other characters too

e.g. if n=10, then the regex:
Should match:
(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
But should not match:
(123)456-789
(123)456-(78901)
etc.
Note: I'm strictly looking for a regex and that is a hard constraint.
======================================
Edit: Other constraints
I am looking for a solution of the form:
regex = re.compile(r'?????????')
where:
regex.findall(s)
... returns a non-empty array for s in ['(123)456-7890','(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
and returns an empty array for s in ['(123)456-789', '(123)456-(78901)']
The regex ^\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*\d\D*$ will find all the matches. Changing this to work for n digits use "^"+"\D*\d"*n+"\D*$"
import re
n=10
regex = "^"+"\D*\d"*n+"\D*$"
numbers='''(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
(123)456-789
(123)456-(78901)'''
matches=re.findall(regex,numbers,re.M)
print(matches)
Or for a single match
pattern = re.compile("^"+"\D*\d"*n+"\D*$")
print(pattern.match('(123)456-7890').group(0)) #(123)456-7890 or AttributeError if no match so wrap in try except
Simply by replacing all non-digit characters from an input string:
import re
def ensure_digits(s, limit=10):
return len(re.sub(r'\D+', '', s)) == limit
print(ensure_digits('(123)456-(7890)', 10)) # True
print(ensure_digits('a1b2c3ddd4e5ff6g7h8i9jj0k', 10)) # True
print(ensure_digits('(123)456-(78901)', 10)) # False
\D+ - matches one or more non-digit characters
Version for a list of words:
def ensure_digits(words_lst, limit=10):
pat = re.compile(r'\D+')
return [w for w in words_lst if len(pat.sub('', w)) == limit]
print(ensure_digits(['(123)456-7890','(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k'], 10))
print(ensure_digits(['(123)456-789', '(123)456-(78901)'], 10))
prints consecutively:
['(123)456-7890', '(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
[]
You can use string formatting to inject in your pattern the amount of numbers n you want. Also, you need to use the flag MULTILINE.
import re
txt = """(123)456-7890
(123)456-(7890)
a1b2c3ddd4e5ff6g7h8i9jj0k
(123)456-789
(123)456-(78901)"""
n = 10
rgx = re.compile(r"^(?:\D*\d\D*){%d}$" % n, re.MULTILINE)
result = rgx.findall(txt)
print(result)
Prints:
['(123)456-7890', '(123)456-(7890)', 'a1b2c3ddd4e5ff6g7h8i9jj0k']
This expression might likely validate the 10 digits:
^(?:\D*\d|\d\D*){10}\D*$
which we can simply replace 10 with an n var.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Test
import re
print(re.findall(r"^(?:\D*\d|\d\D*){10}\D*$", "a1b2c3ddd4e5ff6g7h8i9jj0k"))
Output
['a1b2c3ddd4e5ff6g7h8i9jj0k']

Match string using regular expression except specific string combinations python

In a list I need to match specific instances, except for a specific combination of strings:
let's say I have a list of strings like the following:
l = [
'PSSTFRPPLYO',
'BNTETNTT',
'DE52 5055 0020 0005 9287 29',
'210-0601001-41',
'BSABESBBXXX',
'COMMERZBANK'
]
I need to match all the words that points to a swift / bic code, this code has the following form:
6 letters followed by
2 letters/digits followed by
3 optional letters / digits
hence I have written the following regex to match such specific pattern
import re
regex = re.compile(r'(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)')
for item in l:
match = regex.search(item)
if match:
print('found a match, the matched string {} the match {}'.format( item, item[match.start() : match.end()]
else:
print('found no match in {}'.format(item)
I need the following cases to be macthed:
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX' ]
rather I get
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX', 'COMMERZBANK' ]
so what I need is to match only the strings that don't contain the word 'bank'
to do so I have refined my regex to :
regex = re.compile((?<!bank/i)(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)(?!bank/i))
simply I have used negative look behind and ahead for more information about theses two concepts refer to link
My regex doesn't do the filtration intended to do, what did I miss?
You can try this:
import re
final_vals = [i for i in l if re.findall('^[a-zA-Z]{6}\w{2}|(^[a-zA-Z]{6}\w{2}\w{3})', i) and not re.findall('BANK', i, re.IGNORECASE)]
Output:
['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX']

Using regex to extract information from string

I am trying to write a regex in Python to extract some information from a string.
Given:
"Only in Api_git/Api/folder A: new.txt"
I would like to print:
Folder Path: Api_git/Api/folder A
Filename: new.txt
After having a look at some examples on the re manual page, I'm still a bit stuck.
This is what I've tried so far
m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
Can anybody point me in the right direction??
Get the matched group from index 1 and 2 using capturing groups.
^Only in ([^:]*): (.*)$
Here is demo
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
re.findall(p, test_str)
If you want to print in the below format then try with substitution.
Folder Path: Api_git/Api/folder A
Filename: new.txt
DEMO
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"
result = re.sub(p, subst, test_str)
Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.
The ?P construct is only valid as the first bit inside a parenthesized expression,
so we need this.
(Only in (?P<folder_path>\w+):(?P<filename>\w+))
The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:
(Only in (?P<folder_path>.+):(?P<filename>.+))
The colon has a space after it in your example text. We need to match it:
(Only in (?P<folder_path>.+): (?P<filename>.+))
The outermost parentheses are not needed. They aren't wrong, just not needed:
Only in (?P<folder_path>.+): (?P<filename>.+)
It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.
Consider this code segment:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
m = re.match(regex, line)
...
For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
m = re.match(regex, line)
...
Now, your original program should look like this:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:
import re
regex = re.compile(r'''(?x) # Verbose
Only\ in\ # Literal match
(?P<folder_path>.+) # match longest sequence of anything, and put in 'folder_path'
:\ # Literal match
(?P<filename>.+) # match longest sequence of anything and put in 'filename'
''')
with open('diff.out') as input_file:
for line in input_file:
m = re.match(regex, line)
if m:
print m.group('folder_path')
print m.group('filename')
It really depends on the limitation of the input, if this is the only input this will do the trick.
^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

Python Regular Expression - right-to-left

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.###.0100.exr
I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?
Cheers
finditer will return an iterator "over all non-overlapping matches in the string".
In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.
To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:
>>> pattern = re.compile(r'\.(\d+)(?=\.)')
>>> pattern.findall(eg1)
['0100']
>>> pattern.findall(eg2)
['123', '0100']
>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']
My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
path = path[::-1]
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)[::-1]
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.####.exr
print(eg1)
print(eg2)
I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".
Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).
If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

Categories