Getting file names without file extensions with glob - python

I'm searching for .txt files only
from glob import glob
result = glob('*.txt')
>> result
['text1.txt','text2.txt','text3.txt']
but I'd like result without the file extensions
>> result
['text1','text2','text3']
Is there a regex pattern that I can use with glob to exclude the file extensions from the output, or do I have to use a list comprehension on result?

There is no way to do that with glob(), You need to take the list given and then create a new one to store the values without the extension:
import os
from glob import glob
[os.path.splitext(val)[0] for val in glob('*.txt')]
os.path.splitext(val) splits the file names into file names and extensions. The [0] just returns the filenames.

Since you’re trying to split off a filename extension, not split an arbitrary string, it makes more sense to use os.path.splitext (or the pathlib module). While it’s true that the it makes no practical difference on the only platforms that currently matter (Windows and *nix), it’s still conceptually clearer what you’re doing. (And if you later start using path-like objects instead of strings, it will continue to work unchanged, to boot.)
So:
paths = [os.path.splitext(path)[0] for path in paths]
Meanwhile, if this really offends you for some reason, what glob does under the covers is just calling fnmatch to turn your glob expression into a regular expression and then applying that to all of the filenames. So, you can replace it by just replacing the regex yourself and using capture groups:
rtxt = re.compile(r'(.*?)\.txt')
files = (rtxt.match(file) for file in os.listdir(dirpath))
files = [match.group(1) for match in files if match]
This way, you’re not doing a listcomp on top of the one that’s already in glob; you’re doing one instead of the one that’s already in glob. I’m not sure if that’s a useful win or not, but since you seem to be interested in eliminating a listcomp…

This glob only selects files without an extension: **/*/!(*.*)

Use index slicing:
result = [i[:-4] for i in result]

Another way using rsplit:
>>> result = ['text1.txt','text2.txt.txt','text3.txt']
>>> [x.rsplit('.txt', 1)[0] for x in result]
['text1', 'text2.txt', 'text3']
You could do as a list-comprehension:
result = [x.rsplit(".txt", 1)[0] for x in glob('*.txt')]

Use str.split
>>> result = [r.split('.')[0] for r in glob('*.txt')]
>>> result
['text1', 'text2', 'text3']

Related

Python 3.6 - enumerate files

I am trying to loop a series of jpg files in a folder. I found example code of that:
for n, image_file in enumerate(os.scandir(image_folder)):
which will loop through the image files in image_folder. However, it seems like it is not following any sequence. I have my files name like 000001.jpg, 000002.jpg, 000003.jpg,... and so on. But when the code run, it did not follow the sequence:
000213.jpg
000012.jpg
000672.jpg
....
What seems to be the issue here?
Here's the relevant bit on os.scandir():
os.scandir​(path='.')
Return an iterator of os.DirEntry objects
corresponding to the entries in the directory given by path. The
entries are yielded in arbitrary order, and the special entries '.'
and '..' are not included.
You should not expect it to be in any particular order. The same goes for listdir() if you were considering this as an alternative.
If you strictly need them to be in order, consider sorting them first:
scanned = sorted([f for f in os.scandir(image_folder)], key=lambda f: f.name)
for n, image_file in enumerate(scanned):
# ... rest of your code
I prefer to use glob:
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are
returned in arbitrary order. No tilde expansion is done, but *, ?, and
character ranges expressed with [] will be correctly matched.
You will need this if you handle more complex file structures so starting with glob isnt that bad. For your case you also can use os.scandir() as mentioned above.
Reference: glob module
import glob
files = sorted(glob.glob(r"C:\Users\Fabian\Desktop\stack\img\*.jpg"))
for key, myfile in enumerate(files):
print(key, myfile)
notice even if there other files like .txt they wont be in your list
Output:
C:\Users\Fabian\Desktop\stack>python c:/Users/Fabian/Desktop/stack/img.py
0 C:\Users\Fabian\Desktop\stack\img\img0001.jpg
1 C:\Users\Fabian\Desktop\stack\img\img0002.jpg
2 C:\Users\Fabian\Desktop\stack\img\img0003.jpg
....

Grab part of filename with Python

Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

how can I save the output of a search for files matching *.txt to a variable?

I'm fairly new to python. I'd like to save the text that is printed by at this script as a variable. (The variable is meant to be written to a file later, if that matters.) How can I do that?
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
you can store it in variable like this:
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_var = file
# do your stuff
or you can store it in list for later use:
import fnmatch
import os
my_match = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_match.append(file) # append insert the value at end of list
# do stuff with my_match list
You can store it in a list:
import fnmatch
import os
matches = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
matches.append(file)
Both answers already provided are correct, but Python provides a nice alternative. Since iterating through an array and appending to a list is such a common pattern, the list comprehension was created as a one-stop shop for the process.
import fnmatch
import os
matches = [filename for filename in os.listdir("/Users/x/y") if fnmatch.fnmatch(filename, "*.txt")]
While NSU's answer and the others are all perfectly good, there may be a simpler way to get what you want.
Just as fnmatch tests whether a certain file matches a shell-style wildcard, glob lists all files matching a shell-style wildcard. In fact:
This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert…
So, you can do this:
import glob
matches = glob.glob("/Users/x/y/*.txt")
But notice that in this case, you're going to get full pathnames like '/Users/x/y/spam.txt' rather than just 'spam.txt', which may not be what you want. Often, it's easier to keep the full pathnames around and os.path.basename them when you want to display them, than to keep just the base names around and os.path.join them when you want to open them… but "often" isn't "always".
Also notice that I had to manually paste the "/Users/x/y/" and "*.txt" together into a single string, the way you would at the command line. That's fine here, but if, say, the first one came from a variable, rather than hardcoded into the source, you'd have to use os.path.join(basepath, "*.txt"), which isn't quite as nice.
By the way, if you're using Python 3.4 or later, you can get the same thing out of the higher-level pathlib library:
import pathlib
matches = list(pathlib.Path("/Users/x/y/").glob("*.txt"))
Maybe defining an utility function is the right path to follow...
def list_ext_in_dir(e,d):
"""e=extension, d= directory => list of matching filenames.
If the directory d cannot be listed returns None."""
from fnmatch import fnmatch
from os import listdir
try:
dirlist = os.listdir(d)
except OSError:
return None
return [fname for fname in dirlist if fnmatch(fname,e)]
I have put the dirlist inside a try except clause to catch the
possibility that we cannot list the directory (non-existent, read
permission, etc). The treatment of errors is a bit simplistic, but...
the list of matching filenames is built using a so called list comprehension, that is something that you should investigate as soon as possible if you're going to use python for your programs.
To close my post, an usage example
l_txtfiles = list_ext_in_dir('*.txt','/Users/x/y;)

Accounting for lower and uppercase in glob

I need to pass an extension to function, and then have that function pull all files with that extension in both lower and uppercase form.
For example, if I passed mov, I need to function to do:
videos = [file for file in glob.glob(os.path.join(dir, '*.[mM][oO][vV]'))]
How would I accomplish the lower + upper combination above given a lowercase input?
Something like this?
>>> def foo(extension):
... return '*.' + ''.join('[%s%s]' % (e.lower(), e.upper()) for e in extension)
...
>>> foo('mov')
'*.[mM][oO][vV]'
Since glob just calls os.listdir and fnmatch.fnmatch, you could just call listdir yourself, and do your own matching. If all you're looking for is a matching extension, it's a pretty simple test, and shouldn't be hard to write either as a regular expression or with [-3:]
You can convert strings between upper and lowercase easily:
>>> ext = 'mov'
>>> ext.upper()
'MOV'
So just use that in your function.
If you are running this on Unix, you can try to call this:
from subprocess import Popen, PIPE
#replace the "/tmp/test/" and "*.test" with your search path and extension
args = ["find", "/tmp/test/", "-iname", "*.test"]
files = Popen(args, stdout=PIPE).stdout.readlines()
>>> files
['/tmp/test/a.Test\n', '/tmp/test/a.TEST\n', '/tmp/test/a.TeST\n', '/tmp/test/a.test\n']
more detail on subprocess

batch search and replace strings in filenames with python

I am trying to write a small python script to rename a bunch of filenames by searching and replacing. For example:
Original filename:
MyMusic.Songname.Artist-mp3.iTunes.mp3
Intendet Result:
Songname.Artist.mp3
what i've got so far is:
#!/usr/bin/env python
from os import rename, listdir
mustgo = "MyMusic."
filenames = listdir('.')
for fname in fnames:
if fname.startswith(mustgo):
rename(fname, fname.replace(mustgo, '', 1))
(got it from this site as far as i can remember)
Anyway, this will only get rid of the String at the beginning, but not of those in the filename.
Also I would like to maybe use a seperate file (eg badwords.txt) containing all the strings that should be searched for and replaced, so that i can update them without having to edit the whole code.
Content of badwords.txt
MyMusic.
-mp3
-MP3
.iTunes
.itunes
I have been searching for quite some time now but havent found anything. Would appreciate any help!
Thank you!
import fnmatch
import re
import os
with open('badwords.txt','r') as f:
pat='|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())
for fname in os.listdir('.'):
new_fname=re.sub(pat,'',fname)
if fname != new_fname:
print('{o} --> {n}'.format(o=fname,n=new_fname))
os.rename(fname, new_fname)
# MyMusic.Songname.Artist-mp3.iTunes.mp3 --> Songname.Artist.mp3
Note that it is possible for some files to be overwritten (and thus
lost) if two names get reduced to the same shortened name after
badwords have been removed. A set of new fnames could be kept and
checked before calling os.rename to prevent losing data through
name collisions.
fnmatch.translate takes shell-style patterns and returns the
equivalent regular expression. It is used above to convert badwords
(e.g. '.iTunes') into regular expressions (e.g. r'\.iTunes').
Your badwords list seems to indicate you want to ignore case. You
could ignore case by adding '(?i)' to the beginning of pat:
with open('badwords.txt','r') as f:
pat='(?i)'+'|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())

Categories