Python 3.6 - enumerate files - python

I am trying to loop a series of jpg files in a folder. I found example code of that:
for n, image_file in enumerate(os.scandir(image_folder)):
which will loop through the image files in image_folder. However, it seems like it is not following any sequence. I have my files name like 000001.jpg, 000002.jpg, 000003.jpg,... and so on. But when the code run, it did not follow the sequence:
000213.jpg
000012.jpg
000672.jpg
....
What seems to be the issue here?

Here's the relevant bit on os.scandir():
os.scandir​(path='.')
Return an iterator of os.DirEntry objects
corresponding to the entries in the directory given by path. The
entries are yielded in arbitrary order, and the special entries '.'
and '..' are not included.
You should not expect it to be in any particular order. The same goes for listdir() if you were considering this as an alternative.
If you strictly need them to be in order, consider sorting them first:
scanned = sorted([f for f in os.scandir(image_folder)], key=lambda f: f.name)
for n, image_file in enumerate(scanned):
# ... rest of your code

I prefer to use glob:
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are
returned in arbitrary order. No tilde expansion is done, but *, ?, and
character ranges expressed with [] will be correctly matched.
You will need this if you handle more complex file structures so starting with glob isnt that bad. For your case you also can use os.scandir() as mentioned above.
Reference: glob module
import glob
files = sorted(glob.glob(r"C:\Users\Fabian\Desktop\stack\img\*.jpg"))
for key, myfile in enumerate(files):
print(key, myfile)
notice even if there other files like .txt they wont be in your list
Output:
C:\Users\Fabian\Desktop\stack>python c:/Users/Fabian/Desktop/stack/img.py
0 C:\Users\Fabian\Desktop\stack\img\img0001.jpg
1 C:\Users\Fabian\Desktop\stack\img\img0002.jpg
2 C:\Users\Fabian\Desktop\stack\img\img0003.jpg
....

Related

How to iterate over files in specific directories?

I'd like to iterate over files in two folders in a directory only, and ignore any other files/directories.
e.g in path: "dirA/subdirA/folder1" and "dirA/subdirA/folder2"
I tried passing both to pathlib as:
root_dir_A = "dirA/subdirA/folder1"
root_dir_B = "dirA/subdirA/folder2"
for file in Path(root_dir_A,root_dir_B).glob('**/*.json'):
json_data = open(file, encoding="utf8")
...
But it only iterates over the 2nd path in Path(root_dir_A,root_dir_B).
You can't pass two separate directories to Path(). You'll need to loop over them.
for dirpath in (root_dir_A, root_dir_B):
for file in Path(dirpath).glob('**/*.json'):
...
According to the documentation, Path("foo", "bar") should produce "foo/bar"; but it seems to actually use only the second path segment if it is absolute. Either way, it doesn't do what you seemed to hope it would.
Please check the output of Path(root_dir_A,root_dir_B) to see if it returns what you want.
In your specific case this should work:
path_root = Path('dirA')
for path in path_root.glob('subdirA/folder[12]/*/*.json'):
...
If your paths aren't homogeneous enough you might have to chain generators. I. e.:
from itertools import chain
content_dir_A = Path(root_dir_A).glob('**/*.json')
content_dir_B = Path(root_dir_B).glob('**/*.json')
content_all = chain(content_dir_A, content_dir_B)
for path in content_all:
...

Grouping and deleting Files

I have to come up with a solution to delete all files but the newest 2 in a directory stucture of our owncloud. The be exact - its the file versioning folder. There are files in one folder with the following structure:
Filename.Ext.v[random_Number]
The hard part is that there are different files in one folder I need to keep.
IE: Content of folder A:
HelloWorld.txt.v123
HelloWorld.txt.v555
HelloWorld.txt.v666
OtherFile.pdf.v143
OtherFile.pdf.v1453
OtherFile.pdf.v123
OtherFile.pdf.v14345
YetOtherFile.docx.v11113
In this case we have 3 "basefiles". And I would have to keep the newest 2 files of each "basefile".
I tried Python3 with os.walk and regex to filter out the basename. I tried build in Linux tools like find with -ctime. I could use also bash.
But my real problem is more the logic. How would you approach this task?
EDIT 2:
Here my progress:
import os
from itertools import groupby
directory = 'C:\\Users\\x41\\Desktop\\Test\\'
def sorted_ls(directory):
mtime = lambda f: os.stat(os.path.join(directory, f)).st_mtime
return list(sorted(os.listdir(directory), key=mtime))
print(sorted_ls(directory))
for basename, group in groupby(sorted_ls(directory), lambda x: x.rsplit('.')[0]):
for i in basename:
finallist = []
for a in group:
finallist.append(a)
print(finallist[:-2])
I am almost there. The function sorts the files in the directory based on the mtime value. The suggested groupby() function calls my custom sort function.
Now the problem here is that I have to dump the sort() before the groupby() because this would reset my custom sort. But it now also returns more groups than anticipated.
If my sorted list looks like this:
['A.txt.1', 'B.txt.2', 'B.txt.1', 'B.txt.3', 'A.txt.2']
I would get 3 groups. A, B, and A again.
Any suggestions?
FINAL RESULT
Here is my final version with added recursiveness:
import os
from itertools import groupby
directory = r'C:\Users\x41\Desktop\Test'
for dirpath, dirs, files in os.walk(directory):
output = []
for basename, group in groupby(sorted(files), lambda x: x.rsplit('.')[0]):
output.extend(sorted(group, key=lambda x: os.stat(os.path.join(dirpath, x)).st_mtime)[:-2])
for file in output:
os.remove(dirpath + "\\" + file)
You need to do a simple sort first on the file names so that they are in alphabetical order to allow the groupby function to work correctly.
With each of the resulting file groups, you can then sort using your os.stat key as follows:
import os
from itertools import groupby
directory = r'C:\Users\x41\Desktop\Test'
output = []
for basename, group in groupby(sorted(os.listdir(directory)), lambda x: x.rsplit('.')[0]):
output.extend(sorted(group, key=lambda x: os.stat(os.path.join(directory, x)).st_mtime)[-2:])
print output
This will produce a single list containing the latest two files from each group.
The logic isn't extremely hard here, if that's the only thing you're looking for.
You'd group files by base name, in a python dictionary for example, where the key is your "base filename" such as "HelloWorld.txt" and the value is a list of all files with the same basename sorted by ctime (or some other metric of time depending on how you define newest), and then you delete all files in the list from index 2 onwards accordingly.

Iterate over infinite files in a directory in Python

I'm using Python 3.3.
If I'm manipulating potentially infinite files in a directory (bear with me; just pretend I have a filesystem that supports that), how do I do that without encountering a MemoryError? I only want the string name of one file to be in memory at a time. I don't want them all in an iterable as that would cause a memory error when there are too many.
Will os.walk() work just fine, since it returns a generator? Or, do generators not work like that?
Is this possible?
If you have a system for naming the files that can be figured out computationally, you can do such as this (this iterates over any number of numbered txt files, with only one in memory at a time; you could convert to another calculable system to get shorter filenames for large numbers):
import os
def infinite_files(path):
num=0;
while 1:
if not os.path.exists(os.path.join(path, str(num)+".txt")):
break
else:
num+=1 #perform operations on the file: str(num)+".txt"
[My old inapplicable answer is below]
glob.iglob seems to do exactly what the question asks for. [EDIT: It doesn't. It actually seems less efficient than listdir(), but see my alternative solution above.] From the official documentation:
glob.glob(pathname, *, recursive=False)
Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).
glob.iglob(pathname, *, recursive=False)
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
iglob returns an "iterator which yields" or-- more concisely-- a generator.
Since glob.iglob has the same behavior as glob.glob, you can search with wildcard characters:
import glob
for x glob.iglob("/home/me/Desktop/*.txt"):
print(x) #prints all txt files in that directory
I don't see a way for it to differentiate between files and directories without doing it manually. That is certainly possible, however.

Scanning duplicate file names

Imagine several folders such as
d:\myfolder\abc
d:\myfolder\ard
d:\myfolder\kjes
...
And in each folder, there are files such as
0023.txt, 0025.txt, 9932.txt in d:\myfolder\abc
2763.txt, 1872.txt, 0023.txt, 7623.txt in d:\myfolder\ard
2763.txt, 2873.txt, 0023.txt in d:\myfolder\kjes
So, there are three 0023.txt files, and two 2763.txt files.
I want to create a file (say, d:\myfolder\dup.txt) which contains the following information:
0023 3
0025 1
9932 1
2763 2
1872 1
7623 1
2873 1
How can I implement that in Python? Thanks.
Not extensively tested, but this works:
import os, os.path
dupnames={}
for root, dirs, files in os.walk('myfolder'):
for file in files:
fulpath=os.path.join(root,file)
if file in dupnames:
dupnames[file].append(fulpath)
else:
dupnames[file]=[fulpath]
for name in sorted(dupnames):
print name, len(dupnames[name])
This works in the following way:
Creates an empty dict;
Walks the file hierarchy;
Creates a entries in a dict of lists (or append an existing list) with the base name: [path to file].
After the os.walk you will have a dict like so:
{0023.txt: ['d:\myfolder\abc', 'd:\myfolder\kjes'], 0025.txt: ['d:\myfolder\abc']}
So to get your output, just iterate over the sorted dict and count the entries in the list. You can either redirect the output of this to a file or open your output file directly in Python.
You show your output with the extension stripped -- 0023 vs 0023.txt. What should happen if you have 0023.txt and 0023.py? Same file or different? To the OS they are different files so I kept the extension. It is easily stripped if that is your desired output.
Step 1: use glob.glob to find all the files
Step 2: create a dictionary with each filename's last portion (after the last divider)
Step 3: go through the list of filepaths and find all duplicates.
import os
import collections
path = "d:\myfolder"
filelist = []
for (path, dirs, files) in os.walk(path):
filelist.extend(files)
filecount = collections.Counter(filelist)
This isn't precisely what you've asked for, but it might work for you without writing a line of code, albeit at a bit of a performance penalty. As a bonus, it'll group together files that have the same content but different filenames:
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html
The latest version is almost always O(n), without sacrificing even a little bit of accuracy.

batch search and replace strings in filenames with python

I am trying to write a small python script to rename a bunch of filenames by searching and replacing. For example:
Original filename:
MyMusic.Songname.Artist-mp3.iTunes.mp3
Intendet Result:
Songname.Artist.mp3
what i've got so far is:
#!/usr/bin/env python
from os import rename, listdir
mustgo = "MyMusic."
filenames = listdir('.')
for fname in fnames:
if fname.startswith(mustgo):
rename(fname, fname.replace(mustgo, '', 1))
(got it from this site as far as i can remember)
Anyway, this will only get rid of the String at the beginning, but not of those in the filename.
Also I would like to maybe use a seperate file (eg badwords.txt) containing all the strings that should be searched for and replaced, so that i can update them without having to edit the whole code.
Content of badwords.txt
MyMusic.
-mp3
-MP3
.iTunes
.itunes
I have been searching for quite some time now but havent found anything. Would appreciate any help!
Thank you!
import fnmatch
import re
import os
with open('badwords.txt','r') as f:
pat='|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())
for fname in os.listdir('.'):
new_fname=re.sub(pat,'',fname)
if fname != new_fname:
print('{o} --> {n}'.format(o=fname,n=new_fname))
os.rename(fname, new_fname)
# MyMusic.Songname.Artist-mp3.iTunes.mp3 --> Songname.Artist.mp3
Note that it is possible for some files to be overwritten (and thus
lost) if two names get reduced to the same shortened name after
badwords have been removed. A set of new fnames could be kept and
checked before calling os.rename to prevent losing data through
name collisions.
fnmatch.translate takes shell-style patterns and returns the
equivalent regular expression. It is used above to convert badwords
(e.g. '.iTunes') into regular expressions (e.g. r'\.iTunes').
Your badwords list seems to indicate you want to ignore case. You
could ignore case by adding '(?i)' to the beginning of pat:
with open('badwords.txt','r') as f:
pat='(?i)'+'|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())

Categories