Why does appending to a list take forever? - python

I wrote the following code:
import fnmatch
ll = []
for items in src:
for file in os.listdir('/Users/swaghccc/Downloads/PulledImages/'):
if fnmatch.fnmatch(file, items.split('/')[-1]):
print file
ll.append(file)
my src list contains paths to images.
something like:
/path/to/image.jpg
These images are a subset of the images contained in the directory PulledImages.
The printing of the matched images works correctly.
But when I try to put those imagesnames into a list ll it takes forever.
What on earth am I doing wrong?

Appending doesn't take forever. Searching through a list, however, takes more time the longer your list is; and os.listdir(), being an operating system call, can be unavoidably slow when running against a large directory.
To avoid that, use a dictionary or set, not a list, to track the names you want to compare against -- and build that set only once, outside your loop.
# run os.listdir only once, storing results in a set for constant-time lookup
import sets
files = sets.Set(os.listdir('/Users/swaghccc/Downloads/PulledImages/'))
ll = []
for item in src:
if item.split('/')[-1] in files:
ll.append(file)
Community Wiki because I don't believe this question to be within topic guidelines without a MCVE; thus, not taking rep/credit for this answer.

Related

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

How can I improve performance of finding all files in a folder created at a certain date?

There are 10,000 files in a folder. Few files are created on 2018-06-01, few on 2018-06-09, like that.
I need to find all files which are created on 2018-06-09. But it is taking to much time (almost 2 hours) to read each file and get the file creation date and then get the files which are created on 2018-06-09.
for file in os.scandir(Path):
if file.is_file():
file_ctime = datetime.fromtimestamp(os.path.getctime(file)).strftime('%Y- %m- %d %H:%M:%S')
if file_ctime[0:4] == '2018-06-09':
# ...
You could try using os.listdir(path) to get all the files and dirs from the given path.
Once you have all the files and directories you could use filter and a lambda function to create a new list of only the files with the desired timestamp.
You could then iterate through that list to do what work you need to on the correct files.
Let's start with the most basic thing - why are you building a datetime only to re-format it as string and then do a string comparison?
Then there is the whole point of using os.scandir() over os.listdir() - os.scandir() returns a os.DirEntry which caches file stats through the os.DirEntry.stat() call.
In dependence of checks you need to perform, os.listdir() might even perform better if you expect to do a lot of filtering on the filename as then you won't need to build up a whole os.DirEntry just to discard it.
So, to optimize your loop, if you don't expect a lot of filtering on the name:
for entry in os.scandir(Path):
if entry.is_file() and 1528495200 <= entry.stat().st_ctime < 1528581600:
pass # do whatever you need with it
If you do, then better stick with os.listdir() as:
import stat
for entry in os.listdir(Path):
# do your filtering on the entry name first...
path = os.path.join(Path, entry) # build path to the listed entry...
stats = os.stat(path) # cache the file entry statistics
if stat.S_ISREG(stats.st_mode) and 1528495200 <= stats.st_ctime < 1528581600:
pass # do whatever you need with it
If you want to be flexible with the timestamps, use datetime.datetime.timestamp() beforehand to get the POSIX timestamps and then you can compare them against what stat_result.st_ctime returns directly without conversion.
However, even your original, non-optimized approach should be significantly faster than 2 hours for a mere 10k entries. I'd check the underlying filesystem, too, something seems wrong there.

Merging Two Lists Where one is a Sublist Created From Another List

True to the adage that weeks of programming will save you hours of work I started writing a little .py to backup my project files by going through all the network drives, locating a root project folder, having a look inside to see if there is a settings folder and if there is, to back it up off site.
I am firstly creating a list of the folders I am interested in:
Ant = "I:"
antD = os.listdir(Ant+'\\')
# antD is the list of all stuff on ANT's D:\
antDpath = (Ant+'\\')
namingRegex = re.compile( r"\d\d\d\d_\w[^.]*$")
# the above regex returns all folders that begin with four digits
# and a _ but do not contain a . so as not to include files
toCopyAnt = filter(namingRegex.match, antD)
#List of folders to look in
So there I have a nice list of folders I am interested in in this form:
>>>toCopyAnt
['9001_Project44_IDSF', '5015_Project0_517', '8012_Project_whatever']
Next up I go on and make this mess:
inAnt = []
for i in range(len(toCopyAnt)):
dirAnt = (antDpath+toCopyAnt[i])
inAnt.append(dirAnt)
This part above is merely to add '\' to the list so that when os.dirlist below scans the directories I get proper output
antList = []
for i in range(len(toCopyAnt)):
antL = os.listdir(inAnt[i])
antList.append(antL)
This part here goes through each of the directories I have narrowed down and lists the contents.
The list antList looks like this:
>>>antList
[['Data', 'MagicStuff', 'Info', 'settings'], ['Data', 'processing', 'Settings', 'modularTakeOff'], ['Blue_CCheese', 'Data', 'Rubbishbin', 'songBird', 'images', 'Settings', 'Cakes'], ['canDelete', 'Backup']]
it is a list of some lists... ugh.
My aim it to join these two fellas together such that I have another list which gives me the full path names for each sub directory like so in the case of toCopyAnt[2] and antList[2]
['I:\8012_Project_whatever\canDelete','I:\8012_Project_whatever\Backup']
With the final goal being to use this new list, remove strings I don't need and then parse that to tree_copy to make my life better in every way.
I figure there is likely a far more efficient way to do this and I would love to know but at the same time I would also like to tackle this issue as I have outlined it as it would increase my yield of lists greatly, I've only been at this python stuff for a few days now so I haven't got all the tools out the box just yet.
If what I am doing here is as painful and laboured as I suspect it is please point me in the right direction and I'll start over.
Ngiyabonga!
Ingwe
###########NEXT DAY EDIT###########
So I slept on this and have been attacking the logic again. Still convinced that this is messy and clumsy but I must soldier on.
Below is the first part of what I had working last night with some differenct coments:
inant = []
for i in range(len(toCopyAnt)):
dirant = (Ant+'\\'+toCopyAnt[i]) ##makes full path of higher folder
inant.append(dirant)
antList = []
for i in range(len(toCopyAnt)):
antL = os.listdir(inant[i]) ##provides names of all subfolders
antList.append(antL)
for i in range(len(antList)):
antList[i] = filter(settingRegex.match, antList[i]) ##filters out only inpho project folders
## This mess here next builds a list of full path folders to backup.
preAntpath = []
postAntpath = []
for i in range(len(inant)):
try:
preAntpath=("%s\\%s\\"%(inant[i],antList[i][0]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][1]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][2]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][3]))
postAntpath.append(preAntpath)
except:
pass
Ok, so the reason there is 4 iterations is to just make sure that I include the backup copies of the folders I want to keep. Sometimes there is a 'settings' and a 'settings - copy'. I want both.
##
##
## some debuggering next...
for i in range(len(postAntpath)):
print postAntpath[i]
It works. I get a new list with merged data from the two lists.
so where inant[x] and antList[x] match (because they are always the same length) antList[x] will merge with inant[x][0-3].
If, as is often the case, there is no settings folder in the project folder the try: seems to take care of it. Although I will need to test some situations where the project folder is 1st or 2nd in the list. In my current case it is last. I don't know if try: tries each iteration or stops when it hits the first error...
So that is how I made my two list (one with list in a list) a single usable list.
Next up is to try and figure out how to copy the settings folder as is and not just all of its contents. Copytree is driving me mad.
Again, any comments or advice would be welcome.
Ingwe.

Data structures suitable for searching

I have common problem. I have some data and I want search in them. My issue is, that I dont know a proper data structures and algorhitm suitable for this situation.
There are two kind of objects - Process and Package. Both have some properties, but they are only data structures (dont have any methods). Next, there are PackageManager and something what can be called ProcessManager, which both have function returning list of files that belongs to some Package or files that is used by some Process.
So semantically, we can imagine these data as
Packages:
Package_1
file_1
_ file_2
file_3
Package_2
file_4
file_5
file_6
Actually file that belongs to Package_k can not belong to Package_l for k != l :-)
Processes:
Process_1
file_2
file_3
Process_2
file_1
Files used by processes corresponds to files owned by packages. Also, there the rule doesn't applies on this as for packages - that means, n processes can use one same file at the same time.
Now what is the task. I need to find some match between processes and packages - for given list of packages, I need to find list of processes which uses any of files owned by packages.
My temporary solution was making list of [package_name, package_files] and list of [process_name, process_files] and for every file from every package I was searching through every file of every process searching for match, but of course it could be only temporary solution vzhledem to horrible time complexity (even when I sort the files and use binary search to it).
What you can recommend me for this kind of searching please?
(I am coding it in python)
Doing the matching with sets should be faster:
watched_packages = [Package_1, Package_3] # Packages to consider
watched_files = { # set comprehension
file_
for package in watched_packages
for file_ in package.list_of_files
}
watched_processes = [
process
for process in all_processes
if any(
file_ in watched_files
for file_ in process.list_of_files
)
]
Based on my understanding of what you are trying to do - given a file name, you want to find a list of all the processes that use that file, this snippet of code should help:
from collections import defaultdict
# First make a dictionary that contains a file, and all processes it is a member of.
file_process_map=defaultdict(list)
[file_process_map[fn].append(p) for p in processes for fn in p.file_list]
Basically, we're converting your existing structure (where a process has one or more files) into a structure where we have a filename, and a list of processes that use it.
Now when you have a file you need to search for (in the processes) just look it up in the "file_process_map" dictionary and you'll have a list of all the processes that use the given file.
It is assumed here that "processes" is a list of objects, and each object has a file_list attribute that contains a list of associated files. Obviously, depending on your data structure you might need to alter the code..

Limitation to Python's glob?

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.
Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name
One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.
If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Categories