HTCondor output files: obtain created directory

HTCondor output files: obtain created directory - python

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.

Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

Related

Writing a bash or python for loop with paired input files and multiple output files

I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.
Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.
I have two input files:
S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz
And the command generates five output files:
S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog
I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.
#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done
I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.

You can write a simple python script to loop over all the files in a folder.
Note : I have assumed that the output files will be generated in a folder named "example"
import glob
for file in glob.glob("*.fastq.gz"):
#here you'll unzip the file to a folder assuming "example"
for files in glob.glob("/example/*"):
#here you can parse all the files inside the output folder

Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:
#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12}
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)
#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2}
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done
This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.

Deploy files to a directory with a custom structure in Python

I have number of file objects which I would like to redeploy to a directory with a new structure based on requirements stated by the user.
As example could be having these file objects:
1)root\name\date\type\filename
2)root\name\date\type\filename
3)root\name\date\type\filename
...that I want to save (or a copy of them) in a new structure like below after the user defined a need to split type->date->name:
1)root\type\date\name\filename
2)root\type\date\name\filename
3)root\type\date\name\filename
... or even losing levels such as:
1)root\type\filename
2)root\type\filename
3)root\type\filename
I can only come up with the option to go the long way round, taking the initial list and though a process of filtering simply deploy in the new calculated folder structure using basic string operations.
I feel though that someone has probably done this before in a smart way and potentially that a library/module already exists to do this. Does anyone have any ideas?

Here is a solution using Python glob:
The current levels are: name, date, type and filename:
curr_levels = "name\\date\\type\\filename"
curr_levels = curr_levels.split("\\")
The user want other levels: type, date, name and filename:
user_levels = "type\\date\\name\\filename"
user_levels = user_levels.split("\\")
We can use glob.iglob to iterate the tree structure on 4 levels.
The glob pattern is something like: <src_dir>\*\*\*\* (but, we use a more generic way here).
The user structure can be defined with a simple string format.
For instance: {type}\{date}\{name}\{filename} on Windows.
Wee need to create the directory structure first and then copy (or move) the file.
pattern = os.path.join(source_dir, *("*" * len(curr_levels)))
fmt = os.sep.join(['{{{key}}}'.format(key=key) for key in user_levels])
for source_path in glob.iglob(pattern):
source_relpath = os.path.relpath(source_path, source_dir)
parts = source_relpath.split(os.sep)
values = dict(zip(curr_levels, parts))
target_relpath = fmt.format(**values)
target_path = os.path.join(target_dir, target_relpath)
parent_dir = os.path.dirname(target_path)
if not os.path.exists(parent_dir):
os.makedirs(parent_dir)
shutil.copy2(source_path, target_path)
Note: if your source_dir is the same as target_dir (the root in your question), you need to replace glob.iglob by glob.glob in order to store the whole list of files in memory before processing. This is required to avoid glob.iglob browse the directory tree you are creating…

If you are in UNIX environment, the simpler way to achieve this will be using shell script with cp command.
For example, for copying all files from: /root/name/date/type/filename as /root/date/filename; you need to just do:
cp /root/*/date/*/filename /root/date/filename
OR, if you want to move the file, use mv command:
mv /root/*/date/*/filename /root/date/filename
You may run these commands via Python as well using os.system() as:
import os
os.system("cp \root\*\date\*\filename root\date\filename")
For details, check: Calling an external command in Python
Edit based on comment. For copying /root/name/date/type/filename into /root/date/name/type/filename, you need to just do:
cp /root/name/date/type/filename /root/date/name/type/filename
But make sure that directory /root/date/name/type exists before doing it. In order to make sure it exists, and if not create a directory by using mkdir with -p option as:
mkdir -p /root/date/name/type

Why is my original folder not kept after compression? Why is my compression so slow? - python 3.4

The purpose of this program is to zip a directory or a folder as simply as possible, and write
the generated .tar.gz to one of my USB flash drives (or any other location), plans are to add a
function that will also use 'GnuPG' to encrypt the folder and another
that will allow user to input a time in order to perform this task
daily, weekly, monthly, etc. I also want the user to be able to choose
the destination of the zipped folder. Just wanted to post this up now
to see if it worked on first attempt and to get a bit of feedback.
My main question is why I lose the main folder upon extraction of the compressed files. For example, if I compress "Documents" which contains the two folders "Videos" and "Pictures" and the file "manual.txt". When I extract the file it does not dump "Documents" into the extraction point it dumps "Videos" and "Pictures" and "manual.txt". Which is fine and all, no data loss and everything is still intact, just creates a bit of clutter and I would like to keep the original directory.
Also wondering why in the world is this program taking so long to convert the file and when it does the conversion in some cases the .tar.gz file is just as large as the original folder, this happens with video files, it does seem to compress text files well, and much quicker.
Are video files just hard to compress? Or what, It takes like 5 minutes to process 2gb of video files and then they are the same as the original size? Kinda pointless.
Also would it make sense to use regex to validate user input in this case, I could just use a couple if statements instead no? like the preferred input in this program is 'root' not '/root'. Couldn't I just have it cut the '/' off if the input starts with a '/'.
I mainly want to see if this is the right, most efficient way of doing things, I'd rather not be given the answer in the usual stack overflow copy/paste way, lets get a discussion going.
So why is this program so slow when processing larger amounts of data? I expect a reduction in speed but not by that much
#!/usr/bin/env python3
'''
author: ryan st***
date: 12/5/2015
time: 18:55 Eastern time (GMT -5)
language: python 3.4
'''
# Import, import, import.
import os, subprocess, sys, zipfile, re
import shutil
import time
# Backup (zip) files
def zipDir():
try:
# Get file to be zipped and zip file destination from user
Dir = "~"
str1 = input ('Input directory to be zipped(eg. Douments, Dowloads, Desktop/programs): ')
# an input example that works "bin/mans"
str2 = input ('Zipped output directory (eg. root, myBackups): ')
# an output example that works "bin2/test"
zipName = input ("What would you like to name your zipped folder? ")
path1 = Dir, str1, "/"
path2 = Dir, str2, "/"
# Zip it up
# print (zipFile, ".tar.gz will be created from the folder ", path1[0]+path1[1]+path1[2])
#"and placed into the folder ", path2[0]+path2[1]+path2[2])
zipDirTo = os.path.expanduser(os.path.join(path2[0], path2[1]+path2[2], zipName))
zipDir = os.path.expanduser(os.path.join(path1[0], path1[1]))
print ('Directory "',zipDir,'" will be zipped and saved to the location: "' ,zipDirTo,'.tar.gz"')
shutil.make_archive(zipDirTo, 'gztar', zipDir)
print ("file zipped")
# In Case of mistake
except:
print ("Something went wrong in compression.\n",
"Ending Task, Please try again")
quit()
# Execute the program
def main():
print ("It will be a fucking miracle if this succeeds.")
zipDir()
print ("Success!!!!!!")
time.sleep(2)
quit()
# Wrap it all up
if __name__ == '__main__':
main()

Video files are normally compressed themselves and recompressing them doesn't help.for image and video file use tar only.

My main question is why I lose the main folder upon extraction of the compressed files
Because you're not storing that folder's name in the zip file. The paths you're using don't include Documents, they start with the name of the items inside Documents.
Are video files just hard to compress?
Any file that is already compressed, such as most video and audio formats, will be hard to compress further, and it will take quite a bit of time to find that out if the size is large. You might consider detecting compressed files and storing them in the zip file without further compression using the ZIP_STORED constant.
let[']s get a discussion going.
Stack Overflow's format is not really suited to discussions.

Limitation to Python's glob?

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Python programming - Windows focus and program process

I'm working on a python program that will automatically combine sets of files based on their names.
Being a newbie, I wasn't quite sure how to go about it, so I decided to just brute force it with the win32api.
So I'm attempting to do everything with virtual keys. So I run the script, it selects the top file (after arranging the by name), then sends a right click command,selects 'combine as adobe PDF', and then have it push enter. This launched the Acrobat combine window, where I send another 'enter' command. The here's where I hit the problem.
The folder where I'm converting these things loses focus and I'm unsure how to get it back. Sending alt+tab commands seems somewhat unreliable. It sometimes switches to the wrong thing.
A much bigger issue for me.. Different combination of files take different times to combine. though I haven't gotten this far in my code, my plan was to set some arbitrarily long time.sleep() command before it finally sent the last "enter" command to finish and confirm the file name completing the combination process. Is there a way to monitor another programs progress? Is there a way to have python not execute anymore code until something else has finished?

I would suggest using a command-line tool like pdftk http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ - it does exactly what you want, it's cross-platform, it's free, and it's a small download.
You can easily call it from python with (for example) subprocess.Popen
Edit: sample code as below:
import subprocess
import os
def combine_pdfs(infiles, outfile, basedir=''):
"""
Accept a list of pdf filenames,
merge the files,
save the result as outfile
#param infiles: list of string, names of PDF files to combine
#param outfile: string, name of merged PDF file to create
#param basedir: string, base directory for PDFs (if filenames are not absolute)
"""
# From the pdftk documentation:
# Merge Two or More PDFs into a New Document:
# pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
if basedir:
infiles = [os.path.join(basedir,i) for i in infiles]
outfile = [os.path.join(basedir,outfile)]
pdftk = [r'C:\Program Files (x86)\Pdftk\pdftk.exe'] # or wherever you installed it
op = ['cat']
outcmd = ['output']
args = pdftk + infiles + op + outcmd + outfile
res = subprocess.call(args)
combine_pdfs(
['p1.pdf', 'p2.pdf'],
'p_total.pdf',
'C:\\Users\\Me\\Downloads'
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.