I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()
There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.
Related
os.walk has a helpful example:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print(root, "consumes", end=" ")
print(sum(getsize(join(root, name)) for name in files), end=" ")
print("bytes in", len(files), "non-directory files")
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
Despite the note that os.walk got faster in Python 3.5 by switching to os.scandir, this doesn't mention that it's still a sub-optimal implementation on Windows.
https://www.python.org/dev/peps/pep-0471/ does describe this & gets it almost right. However, it recommends using recursion. When dealing with arbitrary folder structures, this doesn't work so well as you'll quickly hit Python recursion limits (you'll only be able to iterate a folder structure up to 1000 folders deep, which if you're starting at the root of the filesystem isn't necessarily unrealistic. The real limit isn't actually 1000. It's 1000 - your Python call depth when you go to run this function. If you're doing this in response to a web service request through Django with lots of business logic layers, it wouldn't be unrealistic to get close to this limit easily.
The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.
def get_tree_size(path):
total_size = 0
dirs = [path]
while dirs:
next_dir = dirs.pop()
with os.scandir(next_dir) as it:
for entry in it:
if entry.is_dir(follow_symlinks=False):
dirs.append(entry.path)
else:
total_size += entry.stat(follow_symlinks=False).st_size
return total_size
It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.
I have been using OneDrive to store a large amount of images and now I need to process those, so I have sync'd my OneDrive folder to my computer, which takes relatively no space on disk. However, since I have to open() them in my code, they all get downloaded, which would take much more than the available memory on my computer. I can manually use the Free up space action in the right-click contextual menu, which keeps them sync'd without taking space.
I'm looking for a way to do the same thing but in my code instead, after every image I process.
Trying to find how to get the commands of contextual menu items led me to these two places in the registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Directory\shell
HKEY_LOCAL_MACHINE\SOFTWARE\Classes*\shellex\ContextMenuHandlers
However I couldn't find anything related to it and those trees have way too many keys to check blindly. Also this forum post (outside link) shows a few ways to free up space automatically, but it seems to affect all files and is limited to full days intervals.
So is there any way to either access that command or to free up the space in python ?
According to this microsoft post it is possible to call Attrib.exe to do that sort of manipulation on files.
This little snippet does the job for a per-file usage. As shown in the linked post, it's also possible to do it on the full contents of a folder using the /s argument, and much more.
import subprocess
def process_image(path):
# Open the file, which downloads it automatically
with open(path, 'r') as img:
print(img)
# Free up space (OneDrive) after usage
subprocess.run('attrib +U -P "' + path + '"')
The download and freeing up space are fairly quick, but in the case of running this heavily in parallel, it is possible that some disk space will be consumed for a short amount of time. In general though, this is pretty instantaneous.
In addition to Mat's answer. If you are working on a Mac then you can replace Attrib.exe with "/Applications/OneDrive.App/Contents/MacOS/OneDrive /unpin" to make the file online only.
import subprocess
path = "/Users/OneDrive/file.png"
subprocess.run(["/Applications/OneDrive.App/Contents/MacOS/OneDrive", "/unpin", path])
Free up space for multiples files.
import os
import subprocess
path = r"C:\Users\yourUser\Folder"
diret = os.listdir(path)
for di in diret:
dir_atual = path + "\\" + di
for root, dirs, files in os.walk(dir_atual):
for file in files:
arquivos = (os.path.join(root, file))
print (arquivos)
subprocess.run('attrib +U -P "' + arquivos + '"')
I programmed a scanner that looks for certain files on all hard drives of a system that gets scanned. Some of these systems are pretty old, running Windows 2000 with 256 or 512 MB of RAM but the file system structure is complex as some of them serve as file servers.
I use os.walk() in my script to parse all directories and files.
Unfortunately we noticed that the scanner consumes a lot of RAM after some time of scanning and we figured out that the os.walk function alone uses about 50 MB of RAM after 2h of walk over the file system. This RAM usage increases over the time. We had about 90 MB of RAM after 4 hours of scanning.
Is there a way to avoid this behaviour? We also tried "betterwalk.walk()" and "scandir.walk()". The result was the same.
Do we have to write our own walk function that removes already scanned directory and file objects from memory so that the garbage collector can remove them from time to time?
Thanks
have you tried the glob module?
import os, glob
def globit(srchDir):
srchDir = os.path.join(srchDir, "*")
for file in glob.glob(srchDir):
print file
globit(file)
if __name__ == '__main__':
dir = r'C:\working'
globit(dir)
If you are running in the os.walk loop, del() everything that you don't need anymore. And try running gc.collect() at the end of every iteration of os.walk.
Generators are better solutions as they do lazy computations
here is one example of implementation.
import os
import fnmatch
#this may or may not be implemented
def list_dir(path):
for name in os.listdir(path):
yield os.path.join(path, name)
#modify this to take some pattern as input
def os_walker(top):
for root,dlist,flist in os.walk(top):
for name in fnmatch.filter(flist, '*.py'):
yield os.path.join(root, name)
all_dirs = list_dir("D:\\tuts\\pycharm")
for l in all_dirs:
for name in os_walker(l):
print(name)
Thanks to David Beazley
I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.
Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name
One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.
If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).
I'm writing two Python scripts that both parse files. One is a standard unix logfile and the other is a binary file. I'm trying to figure out the best way to monitor these so I can read data as soon as they're updated. Most of the solutions I've found thus far are linux specific, but I need this to work in FreeBSD.
Obviously one approach would be to just run my script every X amount of time, but this seems gross and inefficient. If I want my Python app running continuously in the background monitoring a file and acting on it once it's changed/updated, what's my best bet?
Have you tried KQueue events?
http://docs.python.org/library/select.html#kqueue-objects
kqueue is the FreeBSD / OS version of inotify (file change notification service). I haven't used this, but I think it's what you want.
I once did to make a sort of daemon process for a parser built in Python. I needed to watch a series of files and process them in Python, and it had to be a truly multi-OS solution (Windows & Linux in this case). I wrote a program that watches over a list of files by checking their modification time. The program sleeps for a while and then checks the modification time of the files being watched. If the modification time is newer than the one previously registered, then the file has changed and so stuff has to be done with this file.
Something like this:
import os
import time
path = os.path.dirname(__file__)
print "Looking for files in", path, "..."
# get interesting files
files = [{"file" : f} for f in os.listdir(path) if os.path.isfile(f) and os.path.splitext(f)[1].lower() == ".src"]
for f in files:
f["output"] = os.path.splitext(f["file"])[0] + ".out"
f["modtime"] = os.path.getmtime(f["file"]) - 10
print " watching:", f["file"]
while True:
# sleep for a while
time.sleep(0.5)
# check if anything changed
for f in files:
# is mod time of file is newer than the one registered?
if os.path.getmtime(f["file"]) > f["modtime"]:
# store new time and...
f["modtime"] = os.path.getmtime(f["file"])
print f["file"], "has changed..."
# do your stuff here
It does not look like top notch code, but it works quite well.
There are other SO questions about this, but I don't know if they'll provide a direct answer to your question:
How to implement a pythonic equivalent of tail -F?
How do I watch a file for changes?
How can I "watch" a file for modification / change?
Hope this helps!