I have a huge collection of files that I am trying to rename in bulk. The patterns of these filenames are somewhat consistent but there are few bumps that render my basic regex knowledge inadequate.
The filenames usually go like this:
1050327473 {913EDD51} 1st Filename [2nd Edition].txt
I could remove the strings between {}, [], and few other special characters with this piece of code:
new_file_name = re.sub(r'{.+?}', '', filename)
new_file_name = re.sub(r'\[.+?]', '', new_file_name)
new_file_name = ((new_file_name.split(" .pdf", 1)[0]) + '.pdf').translate({ord(i):None for i in '/\:*?"<>|_'})
and it successfully outputs this:
1050327473 1st Filename
However some of the original filenames are different than the pattern and I still have to remove the 10 digit number. Few of the other patterns are like this:
785723041X, 4844004976 {2C5ACB07} 1st Filename.txt
0383948600 {6A7528B5} 2nd Filename.txt
3263031418, 7966530910, 8070331430 {DCBAD13B} 3rd Filename.txt
The expect output is
1st Filename.txt
2nd Filename.txt
3rd Filename.txt
Now, I could remove every bit of number characters but the file name would also lose a meaningful part of it and become st Filename.txt. Taking a certain part of the string array with something like [10:] would also not work because the length of this digit is interchangeable.
I thought the most logical thing would be to remove every 10 digit character but some of the 10 digit number sequences end with an X instead of the 10th digit, like 785723041X. Also, if the 10 digit sequence is followed by a comma that should be removed too.
What would be the best approach to solve this problem? Is it doable with regex only?
With specific regex pattern:
import re
filenames = ['785723041X, 4844004976 {2C5ACB07} 1st Filename.txt',
'0383948600 {6A7528B5} 2nd Filename.txt',
'3263031418, 7966530910, 8070331430 {DCBAD13B} 3rd Filename.txt']
pat = re.compile(r'\{[^{}]+\}|\[[^[]]+\]|\b\d{9}[\dX],?')
filenames = [pat.sub('', f).strip() for f in filenames]
print(filenames)
The output:
['1st Filename.txt', '2nd Filename.txt', '3rd Filename.txt']
Regex details:
..|..|.. - alternation group (to match a single regular expression out of several possible regular expressions)
\{[^{}]+\} - match any characters enclosed with {} (except themselves, ensured by character class [^{}]+)
\[[^[]]+\] - match any characters enclosed with [] (except themselves, ensured by character class [^[]]+)
\b\d{9}[\dX],? - match 9-digit sequence followed either by 10th digit or X char and optional trailing , char
Related
I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']
I need to extract a string from a document with the following regex pattern in python.
string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order)
This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs
something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall.
I want conditions to be following to extract only 1 string "AK3418CPMP":
string should start with AK or BK
It should followed by letters and numbers or numbers and letters
It can contain "-" or "/"
How can I only extract "AK3418CPMP"
You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b A word boundary to prevent a partial match
[AB]K Match either AK or BK
[A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
[0-9] Match at least a single digit
[A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit
Regex demo
You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)
You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})
I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.
This is my first post and I am a newbie to Python. I am trying to get this to work.
string 1 = [1/0/1, 1/0/2]
string 2 = [1/1, 1/2]
Trying to check the string if I see two / then I just need to replace the 0 with 1 so it becomes 1/1/1 and 1/1/2.
If I don't have two / then I need to add one in along with a 1 and change it to the format 1/1/1 and 1/1/2 so string 2 becomes [1/1/1,1/1/2]
Ultimate goal is to get all strings match the pattern x/1/x. Thanks for all the Input on this.I tried this and it seems to work
for a in Port:
if re.search(r'././', a):
z.append(a.replace('/0/','/1/') )
else:
t1= a.split('/')
if len(t1)>1 :
t2= t1[0] + "/1/" + t1[1]
z.append(t2)
few lines are there to take care of some exceptions but seems to do the job.
The regex pattern for identifying a / is just \/
This could be solved rather simply using the built in string functions without having to add all of the overhead and additional computational time caused by using the RegEx engine.
For example:
# The string to test:
sTest = '1/0/2'
# Test the string:
if(sTest.count('/') == 2):
# There are two forward slashes in the string
# If the middle number is a 0, we'll replace it with a one:
sTest = sTest.replace('/0/', '/1/')
elif(sTest.count('/') == 1):
# One forward slash in string
# Insert a 1 between first portion and the last portion:
sTest = sTest.replace('/', '/1/')
else:
print('Error: Test string is of an unknown format.')
# End If
If you really want to use RegEx, though, you could simply match the string against these two patterns: \d+/0/\d+ and \d+/\d+(?!/) If matching against the first pattern fails, then attempt to match against the second pattern. Then, you can use a either grouping, splitting, or simply calling .replace() (like I'm doing above) to format the string as you need.
EDIT: for clarification, I'll explain the two patterns:
Pattern 1: \d+/0/\d+ could essentially be read as "match any number (consisting of one (1) or more digits) followed by a forward slash, a zero (0), another forward slash and then followed by any number (consisting of one (1) or more digits).
Pattern 2: \d+/\d+(?!/) could be read as "match any number (consisting of one (1) or more digits) followed by a forward slash and any other number (consisting of one (1) or more digits) which is then NOT followed by another forward slash." The last part in this pattern could be a little confusing because it uses the negative lookahead abilities of the RegEx engine.
If you wanted to add stricter rules to these patterns to make sure there are not any leading or trailing non-digit characters, you could add ^ to the start of the patterns and $ to the end, to signify the start of the string and the end of the string respectively. This would also allow you to remove the lookahead expression from the second pattern ((?!/)). As such, you would end up with the following patterns: ^\d+/0/\d+$ and ^\d+/\d+$.
https://regex101.com/r/rE6oN2/1
Click code generator on the left side. You get:
import re
p = re.compile(ur'\d/1/\d')
test_str = u"1/1/2"
re.search(p, test_str)
I have a pattern which is looking for word1 followed by word2 followed by word3 with any number of characters in between.
My file however contains many random newline and other white space characters - which means that between word 1 and 2 or word 2 and 3 there could be 0 or more words and/or 0 or more newlines randomly
Why isn't this code working? (Its not matching anything)
strings = re.findall('word1[.\s]*word2[.\s]*word3', f.read())
[.\s]* - What I mean by this - find either '.'(any char) or '\s'(newline char) multiple times(*)
The reason why your reg ex is not working is because reg ex-es only try to match on a single line. They stop when they find a new line character (\n) and try to match the pattern on the new line starting from the beginning of the pattern.
In order to make the reg ex ignore the newline character you must add re.DOTALL as a third parameter to the findall function:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
You have two problems:
1) . doesn't mean anything special inside brackets [].
Change your [] to use () instead, like this: (.|\s)
2) \ doesn't mean what you think it does inside regular strings.
Try using raw strings:
re.findall(r'word1 ..blah..')
Notice the r prefix of the string.
Putting them together:
strings = re.findall(r'word1(.|\s)*word2(.|\s)*word3', f.read())
However, do note that this changes the returned list.