removing file names from a list python - python

I have all filenames of a directory in a list named files. And I want to filter it so only the files with the .php extension remain.
for x in files:
if x.find(".php") == -1:
files.remove(x)
But this seems to skip filenames. What can I do about this?

How about a simple list comprehension?
files = [f for f in files if f.endswith('.php')]
Or if you prefer a generator as a result:
files = (f for f in files if f.endswith('.php'))

>>> files = ['a.php', 'b.txt', 'c.html', 'd.php']
>>> [f for f in files if f.endswith('.php')]
['a.php', 'd.php']

Most of the answers provided give list / generator comprehensions, which are probably the way you want to go 90% of the time, especially if you don't want to modify the original list.
However, for those situations where (say for size reasons) you want to modify the original list in place, I generally use the following snippet:
idx = 0
while idx < len(files):
if files[idx].find(".php") == -1:
del files[idx]
else:
idx += 1
As to why your original code wasn't working - it's changing the list as you iterator over it... the "for x in files" is implicitly creating an iterator, just like if you'd done "for x in iter(files)", and deleting elements in the list confuses the iterator about what position it is at. For such situations, I generally use the above code, or if it happens a lot in a project, factor it out into a function, eg:
def filter_in_place(func, target):
idx = 0
while idx < len(target):
if func(target[idx)):
idx += 1
else:
del target[idx]

Just stumbled across this old question. Many solutions here will do the job but they ignore a case where filename could be just ".php". I suspect that the question was about how to filter PHP scripts and ".php" may not be a php script. Solution that I propose is as follows:
>>> import os.path
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php', '.php']
>>> [f for f in files if os.path.splitext(f)[1] == ".php"]

Related

Is there a better way to do this? Counting Files, and directories via for loop vs map

Folks,
I'm trying to optimize this to help speed up the process...
What I am doing is creating a dictionary of scandir entries...
e.g.
fs_data = {}
for item in Path(fqpn).iterdir():
# snipped out a bunch of normalization code
fs_data[item.name.title().strip()] = item
{'file1': <file1 scandisk data>, etc}
and then later using a function to gather the count of files, and directories in the data.
Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.
But I can't think of a way to optimize it to only have to run once.
Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)
But I might be over optimizing at this point?
Any feedback would be welcome.
def new_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
def counter(fs_entry):
return (fs_entry.is_file(), not fs_entry.is_file())
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
return (files, dirs)
vs
def old_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
return (files, dirs)
map is fast here if you map the is_file function directly:
files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files
(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)
Okay, map is definitely not the right answer here.
This morning I got up and created a test using timeit...
and it was a bit of a splash of reality to the face.
Without optimizations, new vs old, the new map code was roughly 2x the time.
New : 0.023185124970041215
old : 0.011841499945148826
I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.
For the sake of completeness.
from timeit import timeit
import os
new = '''
def counter(fs_entry):
files = fs_entry.is_file()
return (files, not files)
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)
old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
'''
fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
fs_data[item.name] = item
print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))
And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)
New : 0.10864979098550975
old : 0.08246175001841038
It is clear that the for loop solution is easier to read, faster, and just simpler.
The speed difference between new and old, doesn't seem to be map specifically.
The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...

python print directories listing from input list of files

I want to print a high level directory structure (without duplication) from given input list of files.
Ex: Input files list is,
li=['a/b/c.txt','a/b/d/cc.txt','a/e/f.txt', 'g/h/i.txt','j/k.txt','l/m.txt']
and output to be like
a
+----b
+----d
+----e
g
+----h
j
l
I did go through similar posts on stack overflow (before posting this question), but most of the posts had inputs with no duplicates or tree like structure or directory listing from local (and none of those cases match with the problem I'm looking at)
Here is the solution that worked for me.
def generate_nested_dirs(dir_list):
nested_dirs={}
for d in dir_list:
temp=nested_dirs
for sub_dir in d.split("/"):
if temp.get(sub_dir) is None:
temp[sub_dir]={}
temp=temp[sub_dir]
return nested_dirs
def print_dirs(input_dict,indent):
for dir in list(input_dict):
if indent == 0:
print(dir)
else:
print('\t'*indent,'+--->',dir)
if input_dict[d]:
print(input_dict[dir], indent+1)
And finally calling the above two functions,
li=['a/b/c.txt','a/b/d/cc.txt','a/e/f.txt', 'g/h/i.txt','j/k.txt','l/m.txt']
all_dirs=[]
for f in li:
all_dirs.append("/".join(f.split("/")[1:-1]))
all_dirs=sorted(set(all_dirs))
to_print=generate_nested_dirs(all_dirs)
print_dirs(to_print)
And output will be
a
+----b
+----d
+----e
g
+----h
j
l
Note: Part of the solution was through 'trie' approach

Speed up file matching based on names of files

so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.
What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.
I am asking some help of how to speed up the procedure.
My approach is:
csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]
for i in range(0,len(csvlist)):
csv_base = os.path.basename(csvList[i])
#eg 1001
csv_id = os.path.splitext(fits_base)[0].split("_")[0]
for j in range(0, len(pngList)):
png_base = os.path.basename(pngList[j])
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
DO SOMETHING
more over I tried fnmatch something like:
for csv_file in csvList:
try:
csv_base = os.path.basename(csv_file)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
rel_path = "/path/to/file"
pattern = "*" + csv_id + "*.png"
reg_match = fnmatch.filter(pngList, pattern)
reg_match=" ".join(str(x) for x in reg_match)
if reg_match:
DO something
It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?
first of all, optimize syntax on your existing loop like this
for csv in csvlist:
csv_base = os.path.basename(csv)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
for png in pnglist:
png_base = os.path.basename(png)
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
#do something here
nested loops are very slow because you need to run png loop n2 times
Then you can use list comprehension and array index to speed it up more
## create lists of processed values
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]
## run a single loop with list.index to find matching pair and record base values array
csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
for png_id,png_base in zip(png_id_list,png_base_list)\
if png_id in csv_id_list]
## csv_png_base contains a tuple contianing (csv_base,png_base)
this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
list comprehension is slightly faster than normal loop
You can loop through the list and do something with the values
eg
for csv_base,png_base in csv_png_base:
#do something
pandas will do the job much much faster though because it will run the loop using a C library
You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:
from os.path import basename, splitext
png_lookup = {
splitext(basename(png_path))[0] : png_path
for png_path in pngList
}
This allows you to directly look up the png file corresponding to each csv file:
for csv_file in csvList:
csv_id = splitext(basename(csv_file)[0]
try:
png_file = png_lookup[csv_id]
except KeyError:
pass
else:
# do something
In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).

Recursively list all files in directory (Unix)

I am trying to list all files a directory recursively using python. I saw many solutions using os.walk. But I don't want to use os.walk. Instead I want to implement recursion myself.
import os
fi = []
def files(a):
f = [i for i in os.listdir(a) if os.path.isfile(i)]
if len(os.listdir(a)) == 0:
return
if len(f) > 0:
fi.extend(f)
for j in [i for i in os.listdir(a) if os.path.isdir(i)]:
files(j)
files('.')
print fi
I am trying to learn recursion. I saw following Q?A, but I am not able to implement correctly it in my code.
Python recursive directory reading without os.walk
os.listdir return only the filename (without the full path)
so I think calling files(j) will not work correctly.
try using files(os.path.join(dirName,j))
or something like this:
def files(a):
entries = [os.path.join(a,i) for i in os.listdir(a)]
f = [i for i in entries if os.path.isfile(i)]
if len(os.listdir(a)) == 0:
return
if len(f) > 0:
fi.extend(f)
for j in [i for i in entries if os.path.isdir(i)]:
files(j)
I tried to stay close to your structure. However, I would write it with only one loop over the entries, something like that:
def files(a):
entries = [os.path.join(a,i) for i in os.listdir(a)]
if len(entries) == 0:
return
for e in entries:
if os.path.isfile(e):
fi.append(e)
elif os.path.isdir(e):
files(e)
Another way is not to use a global variable. This can be done using the following. Just modified the previous answer a little bit. I think this might be a little more readable ...
def files(a):
entries = [os.path.join(a,i) for i in os.listdir(a)]
folders = filter(os.path.isdir, entries)
normalFiles = filter(os.path.isfile, entries)
for f in folders:
normalFiles += files(f)
return normalFiles

How to match a python formatstring to elements in a list?

I use Python and there's a list of file names of different file types. Text files may look like these:
01.txt
02.txt
03.txt
...
Let's assume the text files are all numbered in this manner. Now I want to get all the text files with the number ranging from 1 to 25. So I would like to provide a formatstring like %02i.txt via GUI in order to identify all the matching file names.
My solution so far is a nested for loop. The outer loop iterates over the whole list and the inner loop counts from 1 to 25 for every file:
fmt = '%02i.txt'
for f in files:
for i in range(1, 25+1):
if f == fmt % i:
# do stuff
This nested loop doesn't look very pretty and the complexity is O(n²). So it could take a while on very long lists. Is there a smarter/pythonic way of doing this?
Well, yes, I could use a regular expression like ^\d{2}\.txt$, but a formatstring with % is way easier to type.
You can use a set:
fmt = '%02i.txt'
targets = {fmt % i for i in range(1, 25+1)}
then
for f in files:
if f in targets:
# do stuff
A more pythonic way to iterate through files is through use of the glob module.
>>> import glob
>>> for f in glob.iglob('[0-9][0-9].txt'):
print f
01.txt
02.txt
03.txt

Categories