I have files with names like below and i need to change that to the right side format.
CK-123443-1.dft - CK-123443.dft
CK-123344-A.dft - CK-123344.dft
123322-B.dft - 123322.dft
I tried using split('-') but this is not working for all files because some files have two hipen and some have one. can I get any other solution for this problem?
My Code with re:
i am not sure about the re-expression
import re
new = re.sub('-', '.', old)
If you are sure that every filename in the directory has a hyphen that needs to be removed, you can split at hyphens and only exclude the last split part.
So, something like this:
name, ext = file_name.split('.') # Get the 'dft' part aside
new_name = ''.join(name.split('-')[:-1]) + f'.{ext}'
SOLUTION:
import re
#assuming that your file is called file_name
new_name = re.sub('-[A-Za-z0-9]\.', '.', file_name)
#this replaces the characters after the last hyphen and before the extension.
I think you can easily do it using regex - before I tell you what pattern to match though, I need some more clarity on how you want the name to change - do you want to maintain any leading alphabet characters and remove all trailing characters after a hyphen and before the extension?
Related
I'm trying to filter out strings in file names that appear in a for loop
if search == "List":
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for i in onlyfiles:
print(i)
now it will output all the filenames, as expected and wanted, but I want to filter out the .json at the end of the file as well as a few other elements in the name of the file so that I can just see the file name.
For example: filename-IDENTIFIER.json
I want to filter out "-IDENTIFIER.json" out from the for loop's output
Thanks for any help
There are a few approaches here, based on how much your data can vary:
So let's try to build a get_filename(f) method
Quick and dirty
If you know that f always ends in exactly the same way, then you can directly try to remove those characters. So here we have to remove the last 16 characters. It's useful to know that in Python, a string can be considered as an (immutable) array of characters, so you can use list indexing as well.
get_filename(f: str):
return f[:-16]
This will however fail if the Identifier or suffix changes in length.
Varying lenghts
If the suffix changes based on the length, then you should split the string on a fixed delimiter and return the relevant part. In this case you want to split on -.
get_filename(f: str):
return f.split("-")[0]
Note however that this will fail if the filename also contains a -.
You can fix that by dropping the last part and rejoining all the earlier pieces, in the following way.
get_filename(f: str):
return "-".join(f.split("-")[:-1])
Using regexes to match the format
The most general approach would be to use python regexes to select the relevant part. These allow you to very specifically target a specific pattern. The exact regex that you'll need will depend on the complexity of your strings.
Split the string on "-" and get the first element:
filename = f.split("-")[0]
This will get messed up case filename contains "-" though.
This should work:
i.split('-')[0].split('.')[0]
Case 1: filename-IDENTIFIER.json
It takes the substring before the dash, so output will become filename
Case 2: filename.json
There is no dash in the string, so the first split does nothing (full string will be in the 0th element), then it takes the substring before the point. Output will be filename
Case 3: filename
Nothing to split, output will be filename
If it's always .json and -IDENTIFIER, then it's safer to use this:
i.split('-IDENTIFIER')[0].split('.json')[0]
Case 4: filename-blabla.json
If the filename has an extra dash in it, it won't be a problem, output will be filename-blabla
I have a whole set of files (10.000+) that include the date and time in the filename. The problem is that the date and time are not zero padded, causing problems with sorting.
The filenames are in the format: output 5-11-2018 9h0m.xml
What I would like is it to be in the format: output 05-11-2018 09h00m.xml
I've searched for different solutions, but most seem to use splitting strings and then recombining them. That seems pretty cumbersome, since in my case day, month, hour and minute then need to be seperate, padded and then recombined.
I thought regex might give me some better solution, but I can't quite figure it out.
I've edited my original code based on the suggestion of Wiktor Stribiżew that you can't use regex in the replacement and to use groups instead:
import os
import glob
import re
old_format = 'output [1-9]-11-2018 [1-2]?[1-9]h[0-9]m.xml'
dir = r'D:\Gebruikers\<user>\Documents\datatest\'
old_pattern = re.compile(r'([1-9])-11-2018 ([1-2][1-9])h([0-9])m')
filelist = glob.glob(os.path.join(dir, old_format))
for file in filelist:
print file
newfile = re.sub(old_pattern, r'0\1-11-2018 \2h0\3m', file)
os.rename(file, newfile)
But this still doesn't function completely as I would like, since it wouldn't change hours under 10. What else could I try?
You can pad the numbers in your file names with .zfill(2) using a lambda expression passed as the replacement argument to the re.sub method.
Also, fix the regex pattern to allow 1 or 2 digits: (3[01]|[12][0-9]|0?[1-9]) for a date, (2[0-3]|[10]?\d) for an hour (24h), and ([0-5]?[0-9]) for minutes:
old_pattern = re.compile(r'\b(3[01]|[12][0-9]|0?[1-9])-11-2018 (2[0-3]|[10]?\d)h([0-5]?[0-9])m')
See the regex demo.
Then use:
for file in filelist:
newfile = re.sub(old_pattern, lambda x: '{}-11-2018 {}h{}m'.format(x.group(1).zfill(2), x.group(2).zfill(2), x.group(3).zfill(2)), file)
os.rename(file, newfile)
See Python re.sub docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
I suggest going more generic with old_pattern for simplicity, assuming your filenames are only misbehaving with digits:
Because combinations of filenames matching a single-digit field that needs converting in any position but are double digits in other fields would need a long regex to list out more explicitly, I suggest this much simpler one to match the files to rename, which makes assumptions that there are only this matching type of file in the directory as it opens it up more widely in order to be simpler to write and read at a glance - find any single digit field in the filename (one or more of) - ie. non-digit, digit, non-digit:
old_format = r'output\.*\D\d\D.*\.xml'
The fixing re.sub statement could then be:
newfile = re.sub(r'\D(\d)[hm-]', lambda x: x.group()[0]+x.group()[1].zfill(2)+x.group()[2], file)
This would also catch unicode non-ascii digits unless the appropriate re module flags are set.
If the year (2018 in example) might be given as just '18' then it would need special handling for that - could be separate case, and also adding a space into the re.sub regex pattern set (ie [-hm ]).
I am creating a regex that matches a web url that ends in a filename with an image extension. The base url, everything before the filename, will be dynamic. Here's what I got:
import re
text = 'google.com/dsadasd/dsd.jpg'
dynamic_url = 'google.com/dsadasd'
regex = '{}/(.*)(.gif|.jpg|.jpeg|.tiff|.png)'.format(dynamic_url)
re.search(regex, text)
This works, but passes, and should be failing, with the following url:
text = 'google.com/dsadasd/.jpg'
It should only match if there is a filename for the image file. Any way to account for this?
If there are any improvements in this approach that you think could make the regular expression capture other edge cases that I missed based on initial requirements def feel free to say so. Additionally, if there are alternative approaches to this that do not leverage regex, those are appreciated as well (maybe a url parse?). The two most important things to me are performance and clarity (speed performance foremost).
You may also directly apply os.path.splitext():
In [1]: import os
In [2]: text = 'google.com/dsadasd/dsd.jpg'
In [3]: _, extension = os.path.splitext(text)
In [4]: extension
Out[4]: '.jpg'
Then, you may check the extension against a set of supported file extensions.
You could try this: (.*)(\w+)(.gif|.jpg|.jpeg|.tiff|.png)'. Just adds a check for something before the ending .whatever.
What you might do is to use anchors to assert the begin ^ and the end $ of the line or use a word boundary \b
To prevent matching for example .jpg right after the forward / slash, you could add a character class and add the characters you want to allow for the filename.
In this example I have added one or more word characters and a hyphen [\w-]+ but you can update that to your requirements
The regex part of your code could look like:
^{}/[\w-]+\.(?:gif|jpg|jpeg|tiff|png)$
Test Python
I have a filename as "Planning_Group_20180108.ind". i only want Planning_Group out of it. File name can also be like Soldto_20180108, that case the output should be Soldto only.
A solution without using reg ex is more preferable as it is easier to read for a person who haven't used regex yet
The following should work for you
s="Planning_Group_20180108.ind"
'_'.join(s.split('_')[:-1])
This way you create a list which is the string split at the _. With the [:-1] you remove the last part. '_'.join() combines your list elements in the resulting list.
First I would extract filename itself. I'd split it from the extension.
You can go easy way by doing:
path = "Planning_Group_20180108.ind"
filename, ext = path.split(".")
It is assuming that path is actually only a filename and extension. If I'd want to stay safe and platform independent, I'd use os module for that:
fullpath = "this/could/be/a/full/path/Planning_Group_20180108.ind"
path, filename = os.path.split(fullpath)
And then extract "root" and extension:
root, ext = os.path.splitext(filename)
That should leave me with Planning_Group_20180108 as root.
To discard "_20180108" we need to split string by "_" delimiter, going from the right end, and do it only once. I would use .rsplit() method of string, which lets me specify delimiter, and number of times I want to make splits.
what_i_want, the_rest = root.rsplit("_", 1)
what_i_want should contain left side of Planning_Group_20180108 cut in place of first "_" counting from right side, so it should be Planning_Group
The more compact way of writing the same, but not that easy to read, would be:
what_i_want = os.path.splitext(os.path.split("/my/path/to/Planning_Group_20180108.ind")[1])[0].rsplit("_", 1)
PS.
You may skip the part with extracting root and extension if you're sure, that extension will not contain underscore. If you're unsure of that, this step will be necessary. Also you need to think of case with multiple extensions, like /path/to/file/which_has_a.lot.of.periods.and_extentions. In that case would you like to get which_has_a.lot.of.periods.and, or which_has?
Think of it while planning your app. If you need latter, you may want to extract root by doing filename.split(".", 1) instead of using os.path.splitext()
reference:
os.path.split(path),
os.path.splitext(path)
str.rsplit(sep=None, maxsplit=-1)
print("Planning_Group_20180108.ind".rsplit("_", 1)[0])
print("Soldto_20180108".rsplit("_", 1)[0])
rsplit allow you to split X times from the end when "_" is detected. In your case, it will split it in an array of two string ["Planning_Group", "20180108.ind"] and you just need to take the first element [0] (http://python-reference.readthedocs.io/en/latest/docs/str/rsplit.html)
You can use re:
import re
s = ["Planning_Group_20180108.ind", 'Soldto_20180108']
new_s = list(map(lambda x:re.findall('[a-zA-Z_]+(?=_\d)', x)[0], s))
Output:
['Planning_Group', 'Soldto']
Using regex here is pretty pythonic.
import re
newname = re.sub(r'_[0-9]+', '', 'Planning_Group_20180108.ind"')
Results in:
'Planning_Group.ind'
And the same regex produces 'SoldTo' from 'Soldto_20180108'.
(First post... very new to programming)
I needed to rename a bunch of files from 'This is a filename-123456.ext' to '123456-This is a filename.ext'
I managed to solve the problem using Python with the code below. I had to make 2 scripts because sometimes there are 5 numbers, but for the most part 6.
import os
for filename in os.listdir('.'): #not sure how to rename recursive sub-directories
if filename != 'ren6.py': #included to not rename the script file
start = filename[:-11]
number = filename[-10:-4]
ext = filename[-4:]
newname = str(number) + '-' + str(start)+str(ext) #Unnecessary variable creation?
os.rename(filename,newname)
I'm still learning and very curious of more efficient and elegant examples of to accomplish the same thing.
It may be safer and more powerful to use regular expressions. This will only rename files that match the given pattern, which is [ANY SEQUENCE OF CHARACTERS][A DASH][NUMBERS][EXTENSION]
An added benefit to using this method is that you can run it multiple times on the same directory and it won't affect already renamed files.
You might also want to do a check to make sure the file you're renaming it to doesn't already exist (so that you don't overwrite an existing file).
import re
for filename in os.listdir('.')
m = re.match(r'^(?P<name>.+)-(?P<num>\d+)(?P<ext>\.\w+)$', filename)
if m:
newname = '{num}-{name}{ext}'.format(**m.groupdict())
if not os.path.exists(newname):
os.rename(filename, newname)
I'll break down the regular expression
^(?P<name>.+)
The ^ indicates we will start matching at the beginning of the filename (as opposed to matching a middle part of the filename). The () make this a regex group, so that we can access just that one part of the string match. The ?P<name> is just a way to apply a label to a particular group, so that we can refer it to by name later on. In this case, we've given this group a label of name.
. will match any character, and + tells it to match 1 or more characters.
-
This will only match the - character
(?P<num>\d+)
Again, we've made this a group and given it a label of num. \d will only match numbers and the + means it will match 1 or more numbers.
(?P<ext>\.\w+)$
Another group, another label. The \. will only match a . and the \w will match word characters (i.e. letters, numbers, underscores). Again, the + means it will match 1 or more characters. The $ ensures it matches all the way to the end of the string.
How about this?
import os
for filename in os.listdir('.')
name, extension = os.path.splitext(filename)
if '-' not in name:
continue
part1, part2 = filename.split('-')
os.rename(filename, "{1}-{0}{2}".format(part2, part1, extension))