Split a filename python on underscore - python

I have a filename as "Planning_Group_20180108.ind". i only want Planning_Group out of it. File name can also be like Soldto_20180108, that case the output should be Soldto only.
A solution without using reg ex is more preferable as it is easier to read for a person who haven't used regex yet

The following should work for you
s="Planning_Group_20180108.ind"
'_'.join(s.split('_')[:-1])
This way you create a list which is the string split at the _. With the [:-1] you remove the last part. '_'.join() combines your list elements in the resulting list.

First I would extract filename itself. I'd split it from the extension.
You can go easy way by doing:
path = "Planning_Group_20180108.ind"
filename, ext = path.split(".")
It is assuming that path is actually only a filename and extension. If I'd want to stay safe and platform independent, I'd use os module for that:
fullpath = "this/could/be/a/full/path/Planning_Group_20180108.ind"
path, filename = os.path.split(fullpath)
And then extract "root" and extension:
root, ext = os.path.splitext(filename)
That should leave me with Planning_Group_20180108 as root.
To discard "_20180108" we need to split string by "_" delimiter, going from the right end, and do it only once. I would use .rsplit() method of string, which lets me specify delimiter, and number of times I want to make splits.
what_i_want, the_rest = root.rsplit("_", 1)
what_i_want should contain left side of Planning_Group_20180108 cut in place of first "_" counting from right side, so it should be Planning_Group
The more compact way of writing the same, but not that easy to read, would be:
what_i_want = os.path.splitext(os.path.split("/my/path/to/Planning_Group_20180108.ind")[1])[0].rsplit("_", 1)
PS.
You may skip the part with extracting root and extension if you're sure, that extension will not contain underscore. If you're unsure of that, this step will be necessary. Also you need to think of case with multiple extensions, like /path/to/file/which_has_a.lot.of.periods.and_extentions. In that case would you like to get which_has_a.lot.of.periods.and, or which_has?
Think of it while planning your app. If you need latter, you may want to extract root by doing filename.split(".", 1) instead of using os.path.splitext()
reference:
os.path.split(path),
os.path.splitext(path)
str.rsplit(sep=None, maxsplit=-1)

print("Planning_Group_20180108.ind".rsplit("_", 1)[0])
print("Soldto_20180108".rsplit("_", 1)[0])
rsplit allow you to split X times from the end when "_" is detected. In your case, it will split it in an array of two string ["Planning_Group", "20180108.ind"] and you just need to take the first element [0] (http://python-reference.readthedocs.io/en/latest/docs/str/rsplit.html)

You can use re:
import re
s = ["Planning_Group_20180108.ind", 'Soldto_20180108']
new_s = list(map(lambda x:re.findall('[a-zA-Z_]+(?=_\d)', x)[0], s))
Output:
['Planning_Group', 'Soldto']

Using regex here is pretty pythonic.
import re
newname = re.sub(r'_[0-9]+', '', 'Planning_Group_20180108.ind"')
Results in:
'Planning_Group.ind'
And the same regex produces 'SoldTo' from 'Soldto_20180108'.

Related

simplest way to rename multiple files in a folder?

I have files with names like below and i need to change that to the right side format.
CK-123443-1.dft - CK-123443.dft
CK-123344-A.dft - CK-123344.dft
123322-B.dft - 123322.dft
I tried using split('-') but this is not working for all files because some files have two hipen and some have one. can I get any other solution for this problem?
My Code with re:
i am not sure about the re-expression
import re
new = re.sub('-', '.', old)
If you are sure that every filename in the directory has a hyphen that needs to be removed, you can split at hyphens and only exclude the last split part.
So, something like this:
name, ext = file_name.split('.') # Get the 'dft' part aside
new_name = ''.join(name.split('-')[:-1]) + f'.{ext}'
SOLUTION:
import re
#assuming that your file is called file_name
new_name = re.sub('-[A-Za-z0-9]\.', '.', file_name)
#this replaces the characters after the last hyphen and before the extension.
I think you can easily do it using regex - before I tell you what pattern to match though, I need some more clarity on how you want the name to change - do you want to maintain any leading alphabet characters and remove all trailing characters after a hyphen and before the extension?

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

Trying to find a way to filter out parts of a string python

I'm trying to filter out strings in file names that appear in a for loop
if search == "List":
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for i in onlyfiles:
print(i)
now it will output all the filenames, as expected and wanted, but I want to filter out the .json at the end of the file as well as a few other elements in the name of the file so that I can just see the file name.
For example: filename-IDENTIFIER.json
I want to filter out "-IDENTIFIER.json" out from the for loop's output
Thanks for any help
There are a few approaches here, based on how much your data can vary:
So let's try to build a get_filename(f) method
Quick and dirty
If you know that f always ends in exactly the same way, then you can directly try to remove those characters. So here we have to remove the last 16 characters. It's useful to know that in Python, a string can be considered as an (immutable) array of characters, so you can use list indexing as well.
get_filename(f: str):
return f[:-16]
This will however fail if the Identifier or suffix changes in length.
Varying lenghts
If the suffix changes based on the length, then you should split the string on a fixed delimiter and return the relevant part. In this case you want to split on -.
get_filename(f: str):
return f.split("-")[0]
Note however that this will fail if the filename also contains a -.
You can fix that by dropping the last part and rejoining all the earlier pieces, in the following way.
get_filename(f: str):
return "-".join(f.split("-")[:-1])
Using regexes to match the format
The most general approach would be to use python regexes to select the relevant part. These allow you to very specifically target a specific pattern. The exact regex that you'll need will depend on the complexity of your strings.
Split the string on "-" and get the first element:
filename = f.split("-")[0]
This will get messed up case filename contains "-" though.
This should work:
i.split('-')[0].split('.')[0]
Case 1: filename-IDENTIFIER.json
It takes the substring before the dash, so output will become filename
Case 2: filename.json
There is no dash in the string, so the first split does nothing (full string will be in the 0th element), then it takes the substring before the point. Output will be filename
Case 3: filename
Nothing to split, output will be filename
If it's always .json and -IDENTIFIER, then it's safer to use this:
i.split('-IDENTIFIER')[0].split('.json')[0]
Case 4: filename-blabla.json
If the filename has an extra dash in it, it won't be a problem, output will be filename-blabla

(Beginner) Rename file using Python - Refactor for a better answer

(First post... very new to programming)
I needed to rename a bunch of files from 'This is a filename-123456.ext' to '123456-This is a filename.ext'
I managed to solve the problem using Python with the code below. I had to make 2 scripts because sometimes there are 5 numbers, but for the most part 6.
import os
for filename in os.listdir('.'): #not sure how to rename recursive sub-directories
if filename != 'ren6.py': #included to not rename the script file
start = filename[:-11]
number = filename[-10:-4]
ext = filename[-4:]
newname = str(number) + '-' + str(start)+str(ext) #Unnecessary variable creation?
os.rename(filename,newname)
I'm still learning and very curious of more efficient and elegant examples of to accomplish the same thing.
It may be safer and more powerful to use regular expressions. This will only rename files that match the given pattern, which is [ANY SEQUENCE OF CHARACTERS][A DASH][NUMBERS][EXTENSION]
An added benefit to using this method is that you can run it multiple times on the same directory and it won't affect already renamed files.
You might also want to do a check to make sure the file you're renaming it to doesn't already exist (so that you don't overwrite an existing file).
import re
for filename in os.listdir('.')
m = re.match(r'^(?P<name>.+)-(?P<num>\d+)(?P<ext>\.\w+)$', filename)
if m:
newname = '{num}-{name}{ext}'.format(**m.groupdict())
if not os.path.exists(newname):
os.rename(filename, newname)
I'll break down the regular expression
^(?P<name>.+)
The ^ indicates we will start matching at the beginning of the filename (as opposed to matching a middle part of the filename). The () make this a regex group, so that we can access just that one part of the string match. The ?P<name> is just a way to apply a label to a particular group, so that we can refer it to by name later on. In this case, we've given this group a label of name.
. will match any character, and + tells it to match 1 or more characters.
-
This will only match the - character
(?P<num>\d+)
Again, we've made this a group and given it a label of num. \d will only match numbers and the + means it will match 1 or more numbers.
(?P<ext>\.\w+)$
Another group, another label. The \. will only match a . and the \w will match word characters (i.e. letters, numbers, underscores). Again, the + means it will match 1 or more characters. The $ ensures it matches all the way to the end of the string.
How about this?
import os
for filename in os.listdir('.')
name, extension = os.path.splitext(filename)
if '-' not in name:
continue
part1, part2 = filename.split('-')
os.rename(filename, "{1}-{0}{2}".format(part2, part1, extension))

split twice in the same expression?

Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
You can get what you want platform independently by using os.path.basename to get the last part of a path and then use os.path.splitext to get the filename without extension.
from os.path import basename, splitext
pathname = "/adda/adas/sdas/hello.txt"
name, extension = splitext(basename(pathname))
print name # --> "hello"
Using os.path.basename and os.path.splitext instead of str.split, or re.split is more proper (and therefore received more points then any other answer) because it does not break down on other platforms that use different path separators (you would be surprised how varried this can be).
It also carries most points because it answers your question for "one line" precisely and is aesthetically more pleasing then your example (even though that is debatable as are all questions of taste)
Answering the question in the topic rather than trying to analyze the example...
You really want to use Florians solution if you want to split paths, but if you promise not to use this for path parsing...
You can use re.split() to split using several separators by or:ing them with a '|', have a look at this:
import re
inFile = "/adda/adas/sdas/hello.txt"
print re.split('\.|/', inFile)[-2]
>>> inFile = "/adda/adas/sdas/hello.txt"
>>> inFile.split('/')[-1]
'hello.txt'
>>> inFile.split('/')[-1].split('.')[0]
'hello'
if it is always going to be a path like the above you can use os.path.split and os.path.splitext
The following example will print just the hello
from os.path import split, splitext
path = "/adda/adas/sdas/hello.txt"
print splitext(split(path)[1])[0]
For more info see https://docs.python.org/library/os.path.html
I'm pretty sure some Regex-Ninja*, would give you a more or less sane way to do that (or as I now see others have posted: ways to write two expressions on one line...)
But I'm wondering why you want to do split it with just one expression?
For such a simple split, it's probably faster to do two than to create some advanced either-or logic. If you split twice it's safer too:
I guess you want to separate the path, the file name and the file extension, if you split on '/' first you know the filename should be in the last array index, then you can try to split just the last index to see if you can find the file extension or not. Then you don't need to care if ther is dots in the path names.
*(Any sane users of regular expressions, should not be offended. ;)

Categories