Limitation to Python's glob? - python

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Related

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.
Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

Windows disk usage issues with python

I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()
There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.

How to loop through the list of .tar.gz files using linux command in python

Using python 2.7
I have a list of *.tat.gz files on a linux box. Using python, I want to loop through the files and extract those files in a different location, under their respective folders.
For example: if my file name is ~/TargetData/zip/1440198002317590001.tar.gz
then I want to untar and ungzip this file in a different location under its
respective folder name i.e. ~/TargetData/unzip/1440198002317590001.
I have written some code but I am not able to loop through the files. In a command line I am able to untar using $ tar -czf 1440198002317590001.tar.gz 1440198002317590001 command. But I want to be able to loop through the .tar.gz files. The code is mentioned below. Here, I’m not able to loop just the files Or print only the files. Can you please help?
import os
inF = []
inF = str(os.system('ls ~/TargetData/zip/*.tar.gz'))
#print(inF)
if inF is not None:
for files in inF[:-1]:
print files
"""
os.system('tar -czf files /unzip/files[:-7]')
# This is what i am expecting here files = "1440198002317590001.tar.gz" and files[:-7]= "1440198002317590001"
"""
Have you ever worked on this type of use case? Your help is greatly appreciated!! Thank you!
I think you misunderstood the meaning of os.system(), that will do the job, but its return value was not expected by you, it returns 0 for successful done, you can not directly assign its output to a variable. You may consider the module [subprocess], see doc here. However, I DO NOT recommend that way to list files (actually, it returns string instead of list, see doc find the detail by yourself).
The best way I think would be glob module, see doc here. Use glob.glob(pattern), you can put all files match the pattern in a list, then you can loop it easily.
Of course, if you are familiar with os module, you also can use os.listdir(), os.path.join(), or even os.paht.expanduser() to do this. (Unlike glob, it only put filenames without fully path into a list, you need to reconstruct file path).
By the way, for you purpose here, there is no need to declare an empty list first (i.e. inF = [])
For unzip file part, you can do it by os.system, but I also recommend to use subprocess module instead of os.system, you will find the reason in the doc of subprocess.
DO NOT see the following code, ONLY see them after you really can not solve this by yourself.
import os
import glob
inF = glob.glob('~/TargetData/zip/*.tar.gz')
if inF:
for files in inF:
# consider subprocess.call() instead of os.system
unzip_name = files.replace('zip', 'unzip')[:-7]
# get directory name and make sure it exists, otherwise create it
unzip_dir = os.path.dirname(unzip_name)
if not os.path.exists(unzip_dir):
os.mkdir(unzip_dir)
subprocess.call(['tar -xzf', files, '-C', unzip_name])
# os.system('tar -czf files /unzip/files[:-7]')

Reading updated files on the fly in Python

I'm writing two Python scripts that both parse files. One is a standard unix logfile and the other is a binary file. I'm trying to figure out the best way to monitor these so I can read data as soon as they're updated. Most of the solutions I've found thus far are linux specific, but I need this to work in FreeBSD.
Obviously one approach would be to just run my script every X amount of time, but this seems gross and inefficient. If I want my Python app running continuously in the background monitoring a file and acting on it once it's changed/updated, what's my best bet?
Have you tried KQueue events?
http://docs.python.org/library/select.html#kqueue-objects
kqueue is the FreeBSD / OS version of inotify (file change notification service). I haven't used this, but I think it's what you want.
I once did to make a sort of daemon process for a parser built in Python. I needed to watch a series of files and process them in Python, and it had to be a truly multi-OS solution (Windows & Linux in this case). I wrote a program that watches over a list of files by checking their modification time. The program sleeps for a while and then checks the modification time of the files being watched. If the modification time is newer than the one previously registered, then the file has changed and so stuff has to be done with this file.
Something like this:
import os
import time
path = os.path.dirname(__file__)
print "Looking for files in", path, "..."
# get interesting files
files = [{"file" : f} for f in os.listdir(path) if os.path.isfile(f) and os.path.splitext(f)[1].lower() == ".src"]
for f in files:
f["output"] = os.path.splitext(f["file"])[0] + ".out"
f["modtime"] = os.path.getmtime(f["file"]) - 10
print " watching:", f["file"]
while True:
# sleep for a while
time.sleep(0.5)
# check if anything changed
for f in files:
# is mod time of file is newer than the one registered?
if os.path.getmtime(f["file"]) > f["modtime"]:
# store new time and...
f["modtime"] = os.path.getmtime(f["file"])
print f["file"], "has changed..."
# do your stuff here
It does not look like top notch code, but it works quite well.
There are other SO questions about this, but I don't know if they'll provide a direct answer to your question:
How to implement a pythonic equivalent of tail -F?
How do I watch a file for changes?
How can I "watch" a file for modification / change?
Hope this helps!

Simple way to storing data from multiple processes

I have a Python script that does something along the line of:
def MyScript(input_filename1, input_filename2):
return val;
i.e. for every pair of input, I calculate some float value. Note that val is a simple double/float.
Since this computation is very intensive, I will be running them across different processes (might be on the same computer, might be on multiple computers).
What I did before was I output this value to a text file: input1_input2.txt . Then I will have 1000000 files that I need to reduce into one file. This process is not very fast since OS doesn't like folders that have too many files.
How do I efficiently get all these data into one single computer? Perhaps having MongoDB running on a computer and all the processes send the data along?
I want something easy. I know that I can do this in MPI but I think it is overkill for such a simple task.
If the inputs have a natural order to them, and each worker can find out "which" input it's working on, you can get away with one file per machine. Since Python floats are 8 bytes long, each worker would write the result to its own 8-byte slot in the file.
import struct
RESULT_FORMAT = 'd' # Double-precision float.
RESULT_SIZE = struct.calcsize(RESULT_FORMAT)
RESULT_FILE = '/tmp/results'
def worker(position, input_filename1, input_filename2):
val = MyScript(input_filename1, input_filename2)
with open(RESULT_FILE, 'rb+') as f:
f.seek(RESULT_SIZE * position)
f.write(struct.pack(RESULT_FORMAT, val))
Compared to writing a bunch of small files, this approach should also be a lot less I/O intensive, since many workers will be writing to the same pages in the OS cache.
(Note that on Windows, you may need some additional setup to allow sharing the file between processes.)
You can use python parallel processing support.
http://wiki.python.org/moin/ParallelProcessing
Specially, I would mention NetWorkSpaces.
http://www.drdobbs.com/web-development/200001971
You can generate a folder structure that contains generated sub folders that contain generated sub folders.
For example you have a main folder that contains 256 sub folder and each sub folder contains 256 sub folders. 3 levels deep will be enough. You can use sub strings of guids for generating unique folder names.
So guid AB67E4534678E4E53436E becomes folder AB that contains sub folder 67 and that folder contains folder E4534678E4E53436E.
Using 2 substrings of 2 characters makes it possible to genereate 256 * 256 folders. More than enough to store 1 million files.
You could run one program that collects the outputs, as example over XMLRPC.

Categories