Accounting for lower and uppercase in glob - python

I need to pass an extension to function, and then have that function pull all files with that extension in both lower and uppercase form.
For example, if I passed mov, I need to function to do:
videos = [file for file in glob.glob(os.path.join(dir, '*.[mM][oO][vV]'))]
How would I accomplish the lower + upper combination above given a lowercase input?

Something like this?
>>> def foo(extension):
... return '*.' + ''.join('[%s%s]' % (e.lower(), e.upper()) for e in extension)
...
>>> foo('mov')
'*.[mM][oO][vV]'

Since glob just calls os.listdir and fnmatch.fnmatch, you could just call listdir yourself, and do your own matching. If all you're looking for is a matching extension, it's a pretty simple test, and shouldn't be hard to write either as a regular expression or with [-3:]

You can convert strings between upper and lowercase easily:
>>> ext = 'mov'
>>> ext.upper()
'MOV'
So just use that in your function.

If you are running this on Unix, you can try to call this:
from subprocess import Popen, PIPE
#replace the "/tmp/test/" and "*.test" with your search path and extension
args = ["find", "/tmp/test/", "-iname", "*.test"]
files = Popen(args, stdout=PIPE).stdout.readlines()
>>> files
['/tmp/test/a.Test\n', '/tmp/test/a.TEST\n', '/tmp/test/a.TeST\n', '/tmp/test/a.test\n']
more detail on subprocess

Related

return found path in glob

If I have a glob('path/to/my/**/*.json', recursive = True), function returns something like:
path/to/my/subfolder1/subfolder2/file1.json
path/to/my/subfolder1/subfolder2/file2.json
path/to/my/subfolder1/subfolder2/file3.json
path/to/my/subfolder1/file4.json
path/to/my/file5.json
...
I'd like to get only part that starts after ** in the glob, so
subfolder1/subfolder2/file1.json
subfolder1/subfolder2/file2.json
subfolder1/subfolder2/file3.json
subfolder1/file4.json
file5.json
...
What is the best way to do it? Does glob support it natively? Glob input is provided as commmand line hence direct str-replace may be difficult.
Use os.path.commonprefix on the returned paths, then os.path.relpath using the common prefix to get paths relative to it.
An example from a Node.js project with a whole bunch of package.jsons.
>>> pkgs = glob.glob("node_modules/**/package.json", recursive=True)[:10]
['node_modules/queue-microtask/package.json', 'node_modules/callsites/package.json', 'node_modules/sourcemap-codec/package.json', 'node_modules/reusify/package.json', 'node_modules/is-bigint/package.json', 'node_modules/which-boxed-primitive/package.json', 'node_modules/jsesc/package.json', 'node_modules/#types/scheduler/package.json', 'node_modules/#types/react-dom/package.json', 'node_modules/#types/prop-types/package.json']
>>> pfx = os.path.commonprefix(pkgs)
'node_modules/'
>>> [os.path.relpath(pkg, pfx) for pkg in pkgs]
['queue-microtask/package.json', 'callsites/package.json', 'sourcemap-codec/package.json', 'reusify/package.json', 'is-bigint/package.json', 'which-boxed-primitive/package.json', 'jsesc/package.json', '#types/scheduler/package.json', '#types/react-dom/package.json', '#types/prop-types/package.json']
>>>

Python string alphabet removal?

So in my program, I am reading in files and processing them.
My output should say just the file name and then display some data
When I am looping through files and printing output by their name and data,
it displays for example: myfile.txt. I don't want the .txt part. just myfile.
how can I remove the .txt from the end of this string?
The best way to do it is in the example
import os
filename = 'myfile.txt'
print(filename)
print(os.path.splitext(filename))
print(os.path.splitext(filename)[0])
More info about this very useful builtin module
https://docs.python.org/3.8/library/os.path.html
The answers given are totally right, but if you have other possible extensions, or don't want to import anything, try this:
name = file_name.rsplit(".", 1)[0]
You can use pathlib.Path which has a stem attribute that returns the filename without the suffix.
>>> from pathlib import Path
>>> Path('myfile.txt').stem
'myfile'
Well if you only have .txt files you can do this
file_name = "myfile.txt"
file_name.replace('.txt', '')
This uses the built in replace functionality. You can find more info on it here!

Getting file names without file extensions with glob

I'm searching for .txt files only
from glob import glob
result = glob('*.txt')
>> result
['text1.txt','text2.txt','text3.txt']
but I'd like result without the file extensions
>> result
['text1','text2','text3']
Is there a regex pattern that I can use with glob to exclude the file extensions from the output, or do I have to use a list comprehension on result?
There is no way to do that with glob(), You need to take the list given and then create a new one to store the values without the extension:
import os
from glob import glob
[os.path.splitext(val)[0] for val in glob('*.txt')]
os.path.splitext(val) splits the file names into file names and extensions. The [0] just returns the filenames.
Since you’re trying to split off a filename extension, not split an arbitrary string, it makes more sense to use os.path.splitext (or the pathlib module). While it’s true that the it makes no practical difference on the only platforms that currently matter (Windows and *nix), it’s still conceptually clearer what you’re doing. (And if you later start using path-like objects instead of strings, it will continue to work unchanged, to boot.)
So:
paths = [os.path.splitext(path)[0] for path in paths]
Meanwhile, if this really offends you for some reason, what glob does under the covers is just calling fnmatch to turn your glob expression into a regular expression and then applying that to all of the filenames. So, you can replace it by just replacing the regex yourself and using capture groups:
rtxt = re.compile(r'(.*?)\.txt')
files = (rtxt.match(file) for file in os.listdir(dirpath))
files = [match.group(1) for match in files if match]
This way, you’re not doing a listcomp on top of the one that’s already in glob; you’re doing one instead of the one that’s already in glob. I’m not sure if that’s a useful win or not, but since you seem to be interested in eliminating a listcomp…
This glob only selects files without an extension: **/*/!(*.*)
Use index slicing:
result = [i[:-4] for i in result]
Another way using rsplit:
>>> result = ['text1.txt','text2.txt.txt','text3.txt']
>>> [x.rsplit('.txt', 1)[0] for x in result]
['text1', 'text2.txt', 'text3']
You could do as a list-comprehension:
result = [x.rsplit(".txt", 1)[0] for x in glob('*.txt')]
Use str.split
>>> result = [r.split('.')[0] for r in glob('*.txt')]
>>> result
['text1', 'text2', 'text3']

Grab part of filename with Python

Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

python:extract certain part of string

I have a string from which I would like to extract certain part. The string looks like :
E:/test/my_code/content/dir/disp_temp_2.hgx
This is a path on a machine for a specific file with extension hgx
I would exactly like to capture "disp_temp_2". The problem is that I used strip function, does not work for me correctly as there are many '/'. Another problem is that, that the above location will change always on the computer.
Is there any method so that I can capture the exact string between the last '/' and '.'
My code looks like:
path = path.split('.')
.. now I cannot split based on the last '/'.
Any ideas how to do this?
Thanks
Use the os.path module:
import os.path
filename = "E:/test/my_code/content/dir/disp_temp_2.hgx"
name = os.path.basename(filename).split('.')[0]
Python comes with the os.path module, which gives you much better tools for handling paths and filenames:
>>> import os.path
>>> p = "E:/test/my_code/content/dir/disp_temp_2.hgx"
>>> head, tail = os.path.split(p)
>>> tail
'disp_temp_2.hgx'
>>> os.path.splitext(tail)
('disp_temp_2', '.hgx')
Standard libs are cool:
>>> from os import path
>>> f = "E:/test/my_code/content/dir/disp_temp_2.hgx"
>>> path.split(f)[1].rsplit('.', 1)[0]
'disp_temp_2'
Try this:
path=path.rsplit('/',1)[1].split('.')[0]
path = path.split('/')[-1].split('.')[0] works.
You can use the split on the other part :
path = path.split('/')[-1].split('.')[0]

Categories