How to iterate over files in specific directories? - python

I'd like to iterate over files in two folders in a directory only, and ignore any other files/directories.
e.g in path: "dirA/subdirA/folder1" and "dirA/subdirA/folder2"
I tried passing both to pathlib as:
root_dir_A = "dirA/subdirA/folder1"
root_dir_B = "dirA/subdirA/folder2"
for file in Path(root_dir_A,root_dir_B).glob('**/*.json'):
json_data = open(file, encoding="utf8")
...
But it only iterates over the 2nd path in Path(root_dir_A,root_dir_B).

You can't pass two separate directories to Path(). You'll need to loop over them.
for dirpath in (root_dir_A, root_dir_B):
for file in Path(dirpath).glob('**/*.json'):
...
According to the documentation, Path("foo", "bar") should produce "foo/bar"; but it seems to actually use only the second path segment if it is absolute. Either way, it doesn't do what you seemed to hope it would.

Please check the output of Path(root_dir_A,root_dir_B) to see if it returns what you want.
In your specific case this should work:
path_root = Path('dirA')
for path in path_root.glob('subdirA/folder[12]/*/*.json'):
...
If your paths aren't homogeneous enough you might have to chain generators. I. e.:
from itertools import chain
content_dir_A = Path(root_dir_A).glob('**/*.json')
content_dir_B = Path(root_dir_B).glob('**/*.json')
content_all = chain(content_dir_A, content_dir_B)
for path in content_all:
...

Related

Python 3.6 - enumerate files

I am trying to loop a series of jpg files in a folder. I found example code of that:
for n, image_file in enumerate(os.scandir(image_folder)):
which will loop through the image files in image_folder. However, it seems like it is not following any sequence. I have my files name like 000001.jpg, 000002.jpg, 000003.jpg,... and so on. But when the code run, it did not follow the sequence:
000213.jpg
000012.jpg
000672.jpg
....
What seems to be the issue here?
Here's the relevant bit on os.scandir():
os.scandir​(path='.')
Return an iterator of os.DirEntry objects
corresponding to the entries in the directory given by path. The
entries are yielded in arbitrary order, and the special entries '.'
and '..' are not included.
You should not expect it to be in any particular order. The same goes for listdir() if you were considering this as an alternative.
If you strictly need them to be in order, consider sorting them first:
scanned = sorted([f for f in os.scandir(image_folder)], key=lambda f: f.name)
for n, image_file in enumerate(scanned):
# ... rest of your code
I prefer to use glob:
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are
returned in arbitrary order. No tilde expansion is done, but *, ?, and
character ranges expressed with [] will be correctly matched.
You will need this if you handle more complex file structures so starting with glob isnt that bad. For your case you also can use os.scandir() as mentioned above.
Reference: glob module
import glob
files = sorted(glob.glob(r"C:\Users\Fabian\Desktop\stack\img\*.jpg"))
for key, myfile in enumerate(files):
print(key, myfile)
notice even if there other files like .txt they wont be in your list
Output:
C:\Users\Fabian\Desktop\stack>python c:/Users/Fabian/Desktop/stack/img.py
0 C:\Users\Fabian\Desktop\stack\img\img0001.jpg
1 C:\Users\Fabian\Desktop\stack\img\img0002.jpg
2 C:\Users\Fabian\Desktop\stack\img\img0003.jpg
....

Read only the first file from a given image sequence path

I have an image sequence path that is as follows : /host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg
In a pythonic way, is it possible for me to code and have it read the first file based on the above file path given?
If not, can I have it list the entire sequence of the sequence but only of that naming? Assuming that there is another sequence called basecolor_default_beta.*.jpgin the same directory
For #2, if I used os.listdir('/host_server/master/images/set01a/env_basecolor_default_v001'), it will be listing out files of the both image sequences
The simplest solution seems to be to use several functions.
1) To get ALL of the full filepaths, use
main_path = "/host_server/master/images/set01a/env_basecolor_default_v001/"
all_files = [os.path.join(main_path, filename) for filename in os.listdir(main_path)]
2) To choose only those of a certain kind, use a filter.
beta_files = list(filter(lambda x: "beta" in x, all_files))
beta_files.sort()
read the first file based on the above file path given?
With effective glob.iglob(pathname, recursive=False) (if you need the name/path of the 1st found file):
import glob
path = '/host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg'
it = glob.iglob(path)
first = next(it)
glob.iglob() - Return an iterator which yields the same values as
glob() without actually storing them all simultaneously.
Try using glob. Something like:
import glob
import os
path = '/host_server/master/images/set01a/env_basecolor_default_v001'
pattern = 'basecolor_default.*.jpg'
filenames = glob.glob(os.path.join(path, pattern))
# read filenames[0]

move folders from folder list to other folder list using python

hello I want to move or copy many folders from some folder list to other folder list I use glob and shutil libraries for this work.
first I create a folder list :
import glob
#paths from source folder
sourcepath='C:/my/store/path/*'
paths = glob.glob(sourcepath)
my_file='10'
selected_path = filter(lambda x: my_file in x, paths)
#paths from destination folder
destpath='C:/my/store/path/*'
paths2 = glob.glob(destpath)
my_file1='20'
selected_path1 = filter(lambda x: my_file1 in x, paths2)
and now I have two lists from paths(selected_path,selected_path1)
now I want to movie or copy folder from first list(selected_path) to second list(selected_path1)
finaly I try this code to move folders but without success :
import shutil
for I,j in zip(selected_path,selected_path1)
shutil.move(i, j)
but that cant work,any ide how to do my code to work ?
First, Obviously your use of lambda isn't useful, glob function can perform this filtering. This is what glob really does, so you're basically littering your code with more unnecessary function call, which is quite expensive in terms of performance.
Look at this example, identical to yours:
import glob
# Find all .py files
sourcepath= 'C:/my/store/path/*.py'
paths = glob.glob(sourcepath)
# Find files that end with 'codes'
destpath= 'C:/my/store/path/*codes'
paths2 = glob.glob(destpath)
Second, the second glob function call may or may not return a list of directories to move your directories/files to. This makes your code dependent on what C:/my/store/pathcontains. That is, you must guarantee that 'C:/my/store/path must contain only directories and never files, so glob will return only directories to be used in shutil.move. If the user later added files not folders to C:/my/store/path that happened to end with the name 'codes' and they didn't specify any extensions (e.g, codes.txt, codes.py...) then you'll find this file in the returned list of glob in paths2. Of course, guaranteeing a directory to contain only subdirectories is problematic and not a good idea, not at all. You can test for directories through os.path.isdir
Notice something, you're using lambda with the help of filter to filter out any string that doesn't contain 10 in your first call to filter, something you can achieve with glob itself:
glob.glob('C:/my/store/path/*10*')
Now any file or subdirectory of C:/my/store/path that contains 10 in its name will be collected in the returned list of the glob function.
Third, zip truncates to the shortest iterable in its argument list. In other words, if you would like to move every path in paths to every path in paths2, you need len(paths) == len(paths2) so each file or directory in paths has a directory to be moved to in paths2.
Fourth, You missed the semicolon for the for loop and in the call for shutil.move you used i instead of I. Python is a case-sensitive language, and I uppercase isn't exactly the same as i lowercase:
import shutil
for I,j in zip(selected_path,selected_path1) # missing :
shutil.move(i, j) # i not I
Corrected code:
import shutil
for I,j in zip(selected_path,selected_path1) # missing :
shutil.move(I, j) # i not I
Presumably, paths2 contains only subdirectories of C:/my/store/path directory, this is a better approach to write your code, but definitely not the best:
import glob
#paths from source folder
sourcepath='C:/my/store/path/*10*'
paths = glob.glob(sourcepath)
#paths from destination folder
destpath='C:/my/store/path/*20*'
paths2 = glob.glob(destpath)
import shutil
for i,j in zip(paths,paths2):
shutil.move(i, j)
*Still some of the previous issues that I mentioned above apply to this code.
And now that you finished the long marathon of reading this answer, what would you like to do to improve your code? I'll be glad to help if you still find something ambiguous.
Good luck :)

Find first occurrence of file in list of directories

I have a list of directories. In this list I want to find the first directory with a certain file and return the abspath of the file. I currently have the following code that works:
from os.path import exists, join, abspath
path = ["/some/where", "/some/where/else", "/another/location"]
file_name = "foo.bar"
try:
file = [abspath(join(d, file_name)) for d in path if exists(join(d, file_name))][0]
except IndexError:
file = ""
How can I do this more elegant? What i in particular dislike are the two joins.
You could pull the join out into a genexp:
>>> paths = ["/some/where", "/some/where/else", "/another/location", "/tmp"]
>>> file_name = "foo.bar"
>>> joined = (join(p, file_name) for p in paths)
>>> next((abspath(f) for f in joined if exists(f)), '')
'/tmp/foo.bar'
(You could trivially make this a one-liner if you wanted by inlining it.)
Note that this differs from your code because it stops after finding the first one, whereas your code finds them all.
Even if you joined the the directories with the filename before hand to avoid joining twice, you are still joining all directories. For example, if your list has 10 directories, you will call os.path.join() 10 times, even if the directory which contains the file may be first in the list. Worse yet, when you have to do this several thousands or millions of times, it adds up.
I could not see any elegant solution using list comprehension, so I designed an iterative one. In my solution, as soon as we find a directory which contains the file, we immediately return the full, absolute path to that file and do not process any further. This solution is not elegant, but it is faster.
The downside of this solution is the overhead of calling a function. If what you find is at the end of the list, my solution might be slower than the list comprehension solution.
import os
def find_first(directories, filename):
'''
Given a list of directories and a file name, find first existent
occurrence.
'''
for directory in directories:
fullpath = os.path.abspath(os.path.join(directory, filename))
if os.path.exists(fullpath):
return fullpath
return False
directories = ['/foo', '/bin', '/usr/bin']
filename = 'bash'
print find_first(directories, filename) # /bin/bash

Python - Efficiently building a dictionary

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.
/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat
I have a couple problems at the moment:
Switching between directories
I know that I have 86 servers, but the number of files per server may vary. I am using:
for i in range(1,86):
basedir='/data/server%02d' % i
for file in glob.glob(basedir+'*.dat'):
Do reading and sorting here
but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory
Checking if key already exists
I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value
BTW i am using Python 2.6.6
Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.
The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:
import os
data_path = '/data'
# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
for f in files) # can use [] instead of ()
# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
for f in files:
data_files.append(os.path.join(root, f))
for data_file in data_files:
# ... process data_file ...
**Docs for list comprehensions.
I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:
>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0
I'm assuming these 'directories' are remotely mounted shares?
Couple of things:
I'd use os.path.join instead of 'basedir' + '*.dat'
For FS related stuff I've had very good results parallelizing the computation using
multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.
import os
import glob
import multiprocessing as mp
def processDir(path):
results = {}
for file in glob.iglob(os.path.join(path,'*.dat')):
results.update(add to the results here)
return results
dirpaths = ['/data/server%02d'%i for i in range(1,87)]
_results = mp.Pool(8).map(processDir,dirpaths)
results = combine _results here...
For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?
def addresult(results,key,subkey,subsubkey,value):
if key not in results:
results[key] = {}
if subkey not in results[key]:
results[key][subkey] = {}
if subsubkey not in results[key][subkey]:
results[key][subkey][subsubkey] = value
There are almost certainly more efficient ways to accomplish this, but that's a start.

Categories