I have a whole set of files (10.000+) that include the date and time in the filename. The problem is that the date and time are not zero padded, causing problems with sorting.
The filenames are in the format: output 5-11-2018 9h0m.xml
What I would like is it to be in the format: output 05-11-2018 09h00m.xml
I've searched for different solutions, but most seem to use splitting strings and then recombining them. That seems pretty cumbersome, since in my case day, month, hour and minute then need to be seperate, padded and then recombined.
I thought regex might give me some better solution, but I can't quite figure it out.
I've edited my original code based on the suggestion of Wiktor Stribiżew that you can't use regex in the replacement and to use groups instead:
import os
import glob
import re
old_format = 'output [1-9]-11-2018 [1-2]?[1-9]h[0-9]m.xml'
dir = r'D:\Gebruikers\<user>\Documents\datatest\'
old_pattern = re.compile(r'([1-9])-11-2018 ([1-2][1-9])h([0-9])m')
filelist = glob.glob(os.path.join(dir, old_format))
for file in filelist:
print file
newfile = re.sub(old_pattern, r'0\1-11-2018 \2h0\3m', file)
os.rename(file, newfile)
But this still doesn't function completely as I would like, since it wouldn't change hours under 10. What else could I try?
You can pad the numbers in your file names with .zfill(2) using a lambda expression passed as the replacement argument to the re.sub method.
Also, fix the regex pattern to allow 1 or 2 digits: (3[01]|[12][0-9]|0?[1-9]) for a date, (2[0-3]|[10]?\d) for an hour (24h), and ([0-5]?[0-9]) for minutes:
old_pattern = re.compile(r'\b(3[01]|[12][0-9]|0?[1-9])-11-2018 (2[0-3]|[10]?\d)h([0-5]?[0-9])m')
See the regex demo.
Then use:
for file in filelist:
newfile = re.sub(old_pattern, lambda x: '{}-11-2018 {}h{}m'.format(x.group(1).zfill(2), x.group(2).zfill(2), x.group(3).zfill(2)), file)
os.rename(file, newfile)
See Python re.sub docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
I suggest going more generic with old_pattern for simplicity, assuming your filenames are only misbehaving with digits:
Because combinations of filenames matching a single-digit field that needs converting in any position but are double digits in other fields would need a long regex to list out more explicitly, I suggest this much simpler one to match the files to rename, which makes assumptions that there are only this matching type of file in the directory as it opens it up more widely in order to be simpler to write and read at a glance - find any single digit field in the filename (one or more of) - ie. non-digit, digit, non-digit:
old_format = r'output\.*\D\d\D.*\.xml'
The fixing re.sub statement could then be:
newfile = re.sub(r'\D(\d)[hm-]', lambda x: x.group()[0]+x.group()[1].zfill(2)+x.group()[2], file)
This would also catch unicode non-ascii digits unless the appropriate re module flags are set.
If the year (2018 in example) might be given as just '18' then it would need special handling for that - could be separate case, and also adding a space into the re.sub regex pattern set (ie [-hm ]).
Related
I'm really new to python and programming in general, and to practice I'm doing projects where I try to tackle problems from my day to day work, so please excuse me if this may be a silly question. I'm trying to combine a group of files located on a remote folder into a single monthly one based on the date, I've already combined files based on date so I think I can do that, but I'm having trouble with the regex to pick the date from the file name string, the string with the filepath is as follows
\\machinename123\main folder\subfolder\2021-01-24.csv
The file name will always have the same format since it's and automated process, only changing the date on the name, I was trying to pick the date from this string using a regex to select the text between the last \ of the string and the . from the format, so I can get a 2021-01-24 as a result but at the level I'm at, regex are like witchcraft and I don't really know what I'm doing, I've been trying for a few hours to no success, so far this is the closest I can get by trial and error (?:[0-9\-]) but this selects all the numbers on the string, including the ones on the machine name, besides the issue of not knowing why it works the way it works (for example I know that the ?: works by testing, but I don't understand the theory behind it so I couldn't replicate it in the future).
How can I make it ignore the other numbers on the string, or more specifically pick only the text between the last \ and the . from the csv, xlsx or whatever the format is?
I'd like the former option better, since it would allow me to learn how to make it do what I need it to do and not get the result by coincidence.
Thanks for any help
Use re.search() to find a pattern: <4 digits>-<2 digits>-<2 digits>.
s = r'\\machinename123\main folder\subfolder\2021-01-24.csv'
m = re.search(r'\d{4}-\d{2}-\d{2}', s).group(0)
You can use the following regex, that summarizes the structure of your full path:
import re
filename_regex = re.compile(r'^\\\\[(?:\w| )+\\]+((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
m = filename_regex.match(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
if m is not None:
print(f'File found with date: {m.groups()[0]}')
else:
print('Filename not of interest')
The output will be:
File found with date: 2021-01-24
filename_regex accepts string starting with \\, followed by a repetition of characters (alphanumeric and underscores) and spaces followed by \, but with the final part corresponding to 4 digits, followed by a minus, then 2 digits, a minus again, 2 digits and the string .csv. The regular expression used here to match the date is very simple, but you can use a more complex one if you prefer.
Another simpler approach would be using ntpath library, extracting the name of the file from the full path and applying the regular expression only to the name of the file:
import ntpath
import re
filename_regex = re.compile(r'^((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
filename = ntpath.basename(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
m = filename_regex.match(filename)
m here will have the same value as before.
I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")
I'm trying to filter out strings in file names that appear in a for loop
if search == "List":
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for i in onlyfiles:
print(i)
now it will output all the filenames, as expected and wanted, but I want to filter out the .json at the end of the file as well as a few other elements in the name of the file so that I can just see the file name.
For example: filename-IDENTIFIER.json
I want to filter out "-IDENTIFIER.json" out from the for loop's output
Thanks for any help
There are a few approaches here, based on how much your data can vary:
So let's try to build a get_filename(f) method
Quick and dirty
If you know that f always ends in exactly the same way, then you can directly try to remove those characters. So here we have to remove the last 16 characters. It's useful to know that in Python, a string can be considered as an (immutable) array of characters, so you can use list indexing as well.
get_filename(f: str):
return f[:-16]
This will however fail if the Identifier or suffix changes in length.
Varying lenghts
If the suffix changes based on the length, then you should split the string on a fixed delimiter and return the relevant part. In this case you want to split on -.
get_filename(f: str):
return f.split("-")[0]
Note however that this will fail if the filename also contains a -.
You can fix that by dropping the last part and rejoining all the earlier pieces, in the following way.
get_filename(f: str):
return "-".join(f.split("-")[:-1])
Using regexes to match the format
The most general approach would be to use python regexes to select the relevant part. These allow you to very specifically target a specific pattern. The exact regex that you'll need will depend on the complexity of your strings.
Split the string on "-" and get the first element:
filename = f.split("-")[0]
This will get messed up case filename contains "-" though.
This should work:
i.split('-')[0].split('.')[0]
Case 1: filename-IDENTIFIER.json
It takes the substring before the dash, so output will become filename
Case 2: filename.json
There is no dash in the string, so the first split does nothing (full string will be in the 0th element), then it takes the substring before the point. Output will be filename
Case 3: filename
Nothing to split, output will be filename
If it's always .json and -IDENTIFIER, then it's safer to use this:
i.split('-IDENTIFIER')[0].split('.json')[0]
Case 4: filename-blabla.json
If the filename has an extra dash in it, it won't be a problem, output will be filename-blabla
(First post... very new to programming)
I needed to rename a bunch of files from 'This is a filename-123456.ext' to '123456-This is a filename.ext'
I managed to solve the problem using Python with the code below. I had to make 2 scripts because sometimes there are 5 numbers, but for the most part 6.
import os
for filename in os.listdir('.'): #not sure how to rename recursive sub-directories
if filename != 'ren6.py': #included to not rename the script file
start = filename[:-11]
number = filename[-10:-4]
ext = filename[-4:]
newname = str(number) + '-' + str(start)+str(ext) #Unnecessary variable creation?
os.rename(filename,newname)
I'm still learning and very curious of more efficient and elegant examples of to accomplish the same thing.
It may be safer and more powerful to use regular expressions. This will only rename files that match the given pattern, which is [ANY SEQUENCE OF CHARACTERS][A DASH][NUMBERS][EXTENSION]
An added benefit to using this method is that you can run it multiple times on the same directory and it won't affect already renamed files.
You might also want to do a check to make sure the file you're renaming it to doesn't already exist (so that you don't overwrite an existing file).
import re
for filename in os.listdir('.')
m = re.match(r'^(?P<name>.+)-(?P<num>\d+)(?P<ext>\.\w+)$', filename)
if m:
newname = '{num}-{name}{ext}'.format(**m.groupdict())
if not os.path.exists(newname):
os.rename(filename, newname)
I'll break down the regular expression
^(?P<name>.+)
The ^ indicates we will start matching at the beginning of the filename (as opposed to matching a middle part of the filename). The () make this a regex group, so that we can access just that one part of the string match. The ?P<name> is just a way to apply a label to a particular group, so that we can refer it to by name later on. In this case, we've given this group a label of name.
. will match any character, and + tells it to match 1 or more characters.
-
This will only match the - character
(?P<num>\d+)
Again, we've made this a group and given it a label of num. \d will only match numbers and the + means it will match 1 or more numbers.
(?P<ext>\.\w+)$
Another group, another label. The \. will only match a . and the \w will match word characters (i.e. letters, numbers, underscores). Again, the + means it will match 1 or more characters. The $ ensures it matches all the way to the end of the string.
How about this?
import os
for filename in os.listdir('.')
name, extension = os.path.splitext(filename)
if '-' not in name:
continue
part1, part2 = filename.split('-')
os.rename(filename, "{1}-{0}{2}".format(part2, part1, extension))
As an input, I have a filename (e.g. "bla150420.txt") containing a date in a specific format. I need to look into a given folder (containing many files) and find the latest version of my file. (And I have to do it automatically - many times for different files in different folders.)
Example directory (dirname):
...
bla150420.txt
bla150425.txt
bla150510.txt
Example output:
bla150510.txt
How can I do so? My original approach was to parse a date out of the file name, substitute the date with RE pattern and to search this pattern in the list of all filenames. This doesn't seem to work. Any idea? Or different approach?
def get_date(file_name):
DATE_RE = re.compile('([0-9]{6})') #EDITED - TYPO
try:
match = DATE_RE.search(fname).group()
except AttributeError:
sys.stderr.write('ERROR! No date matches string!\n\t' + match)
else:
date = datetime.datetime.strptime(match, '%Y%m%d')
return match, date
date_string, current_date = get_date(fname)
# fname is a given file name (e.g. bla150420.txt)
pattern1 = re.compile(re.sub(date_string, '(.*)', fname))
# pattern1 returns value 'kds_docs-(.*).zip'
pattern2 = re.compile('kds_docs-(.*).zip')
if os.path.isdir(dirname):
matching_files = [x for x in os.listdir(dirname)
if pattern1.search(x)]
It is a wonder to me, my program works with pattern2, but not with pattern1. If I print those two (using .pattern), it looks like the same result, if I compare it with '==' it returns False. I have no idea whether it is because of encoding/whitespace/something else nor how to find the difference. Could you please help?
I think you're merely having issues with producing a working regex in an automated fashion.
Serge pointed out that your code as presented should get tripped up by the fact that your dates seem to have 6 digits instead of 8, but that first regex requires 8 digits - correct or explain that if it's more than a typo.
I suppose you're looking to verify that any string of digits actually is a date, but this seems unnecessary, since a filename could have a string of digits which parses as a date, but isn't the date you're looking for - not ideal. Let me know if it MUST be a date.
I'm unfamiliar with the intricacies of Python, but I would suggest simplifying your regex production by not even using your function:
pattern1 = re.compile(re.sub('([0-9]{6})', '(.*)', fname))
Just make the replacement directly. I would say it would probably be safer to go a little further like so:
pattern1 = re.compile(re.sub('([0-9]{6})', '(\d{6})', fname))
...and if there are any other possible restrictions, you could limit 6 digit matches further. For instance, the 6 digit string might always be at the end of the file name, just before the extension:
pattern1 = re.compile(re.sub('([0-9]{6})(?=\..*$)', '(\d{6})', fname))
# should turn 'kds_docs-120501-151023.zip' into 'kds_docs-150510-(\d{6}).zip'