Glob to match files except certain extension - python

Newbie to python! I'm trying to use glob in conjunction with max to find the last modified file in a folder but excluding one type with extension pdf. I've tried:
Without the exclude i have this which is working fine:
crshLogs = glob.glob(homePath+crshLogPath+'*.*')
currCrshLog = max(crshLogs , key = os.path.getmtime)
To try and exclude the pdf I've tried:
crshLogs = glob.glob(homePath+crshLogPath+'!(*.pdf)')
and also
crshLogs = glob.glob(homePath+crshLogPath+'*.*') - glob.glob(homePath+crshLogPath+'*.pdf')
But in both cases the next line of code fails with ValueError: max() arg is an empty sequence so presumably nothing is being returned.
Any help would be gratefully received!

[filename for filename in glob.glob(homePath+crshLogPath+'*.*') if not filename.endswith('pdf')]
Also I would change
crshLogs = glob.glob(homePath+crshLogPath+'.')
to
crshLogs = glob.glob(os.path.join(homePath, crshLogPath, *.*')
This will take care of shitty edgecases, like homePath ending not in / and crshLog path not starting with / which would make a mess

You could create an array and not put pdfs in it:
file_list=[]
for filename in glob.glob(homePath+crshLogPath+'*.*'):
if ".pdf" not in filename:
file_list.append(filename)
And then get your filenames from that array.

Related

Python glob.glob can't find the file or returns empty list

I am trying this code:
import glob
temp_path = glob.glob("/FileStorage/user/final/**/**/sample.json")[-1][5:]
sample = (spark.read.json(f"FileStorage:{temp_path}"))
However, when I run this command in databricks, the error message is:
IndexError: list index out of range
I try to print the:
glob.glob("/FileStorage/user/final/**/**/sample.json") the result is an empty list.
The issue is that "/FileStorage/user/final/** /**/sample.json" is probably not the correct pathname for what you are trying to express. What you probably want is:
glob.glob("/FileStorage/user/final/**/sample.json", recursive=True)
You need to remove the space from the pathname and add recursive=True.
import glob
path = r"C:User/FileStorage/user/final/*" # path to directory + '*'
for file in glob.iglob(path, recursive=True):
print(file)
#if you want to filter json format file
if file.endswith(".json"):
print(file)
I think that the issue is with the '**'.
If you were trying to use a relative path then you should use only one '*' ("/FileStorage/user/final/*/*/sample.json").
If you wanted the search to be recursive/include hidden files, then you need to remove the space after the first '**' and set recursive=True or include_hidden=True (according to what you want) when calling glob.glob (for example: glob.glob("/FileStorage/user/final/**/**/sample.json", include_hidden=True) will return hidden files that are in this path)
If recursive is true, the pattern “**” will match any files and zero
or more directories, subdirectories and symbolic links to directories.
If the pattern is followed by an os.sep or os.altsep then files will
not match.
If include_hidden is true, “**” pattern will match hidden directories.
see documentation here:
https://docs.python.org/3/library/glob.html
Edit:
If this does not work, validate the path in your file system

Run only if "if " statement is true.!

So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))

Read only the first file from a given image sequence path

I have an image sequence path that is as follows : /host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg
In a pythonic way, is it possible for me to code and have it read the first file based on the above file path given?
If not, can I have it list the entire sequence of the sequence but only of that naming? Assuming that there is another sequence called basecolor_default_beta.*.jpgin the same directory
For #2, if I used os.listdir('/host_server/master/images/set01a/env_basecolor_default_v001'), it will be listing out files of the both image sequences
The simplest solution seems to be to use several functions.
1) To get ALL of the full filepaths, use
main_path = "/host_server/master/images/set01a/env_basecolor_default_v001/"
all_files = [os.path.join(main_path, filename) for filename in os.listdir(main_path)]
2) To choose only those of a certain kind, use a filter.
beta_files = list(filter(lambda x: "beta" in x, all_files))
beta_files.sort()
read the first file based on the above file path given?
With effective glob.iglob(pathname, recursive=False) (if you need the name/path of the 1st found file):
import glob
path = '/host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg'
it = glob.iglob(path)
first = next(it)
glob.iglob() - Return an iterator which yields the same values as
glob() without actually storing them all simultaneously.
Try using glob. Something like:
import glob
import os
path = '/host_server/master/images/set01a/env_basecolor_default_v001'
pattern = 'basecolor_default.*.jpg'
filenames = glob.glob(os.path.join(path, pattern))
# read filenames[0]

Python: rename files based on a Python list

I have a directory that contains a thousand or so txt files. I've read and explored these files with glob (searching for specific strings within the file) and I've appended the filenames of the files of interest to a list in Python, like so:
list_of_chosen_files = ['file2.txt', 'file10.txt', 'file17.txt', ...]
I'll be using this list for other things as well, but now I'm trying to figure out how to use the OS module to cross-reference the filenames in the directory against the list above and, if the filename is in that list, to add "1-" to the beginning of the filename. I've saved "1-" in a variable for reuse as well. Here's what I have so far: -
var = "1-"
import os
for filename in os.listdir("."):
if filename == list_of_chosen_files[:]:
os.rename(filename, var+filename)
print filename
It's running without any errors in anaconda, but nothing's printing and none of the files are getting renamed. I feel like it should be such an easy fix, but I'm concerned about poking around directories with the OS module if I don't really know what I'm doing yet.
Any help would be greatly appreciated! Thanks!
Your issue is this line:
if filename == list_of_chosen_files[:]:
You're comparing a single string (filename) to an entire list (list_of_chosen_files[:] just gives you back the whole list). If you want to check if the filename is in the list, use this:
if filename in list_of_chosen_files:
This will check to see if list_of_chosen_files contains filename.
Error : if filename == list_of_chosen_files[:]:
os.listdir(".")is only giving you back the basename results. Not full path. They don't exist in your current working directory. You would need to join them back with the root:
root = 'full Path of your directory'
for item in os.listdir(root):
fullpath = os.path.join(root, item)
os.rename(fullpath, fullpath.replace('filename', 'var+filename'))

glob function in python with one wildcard

I have a problem with the glob.glob function in Python.
This line works perfectly for me getting all text files with the name 002 in the two subsequent folders of Models:
All_txt = glob.glob("C:\Users\EDV\Desktop\Peter\Models\*\*\002.txt")
But going into one subfolder and asking the same:
All_txt = glob.glob('C:\Users\EDV\Desktop\Peter\Models\Texte\*\002.txt')
results in an empty list. Does anybody know what the problem here is (or knows another function which expresses the same)?
I double-checked the folder paths and that all folders contain these text-files.
Try putting an r in front of the string to make a raw string: glob.glob(r'C:\Users\EDV\Desktop\Peter\Models\Texte\*\002.txt'). This will make it so the backslashes arent used for escaping the next character.
You could also do it without glob like so:
import os
all_txt = []
root = r'C:\Users\EDV\Desktop\Peter\Models\Texte'
for d in os.listdir(root):
abs_d = os.path.join(root, d)
if os.path.isdir(abs_d):
txt = os.path.join(abs_d, '002.txt')
if os.path.isfile(txt):
all_txt.append(txt)

Categories