Suppose one has the following script:
import csv
import re
#this script is designed to take a list of probe files and
#generate a corresponding swarm file for blasting them against
#some user defined database.
def __main__():
#args
infile = sys.argv[1] #list of sequences to run
outfile = sys.argv[2] #location of resulting swarm file
outdir = sys.argv[3] #location to store results from blast run
db = sys.argv[4] #database to query against
with open(infile) as fi:
data = [x[0].strip('\n') for x in list(csv.reader(fi))]
cmd = open(outfile, 'w')
blast = 'module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn '
db = ' -d ' + db
def f(x):
input = ' -i ' + str(x)
out = re.search('(?<=./)([^\/]*)(?=\.)', x).group(0)
out = ' -o ' + outdir + out + '.out'
cmd.write(blast + db + input + out + '\n')
map(f, data)
__main__()
If I run it with:
python blast-probes.py /data/cornishjp/array-annotations/agilent_4x44k_human/probe-seq-fasta-list.csv ./human.cmd ./ x
An example from human.cmd would be:
module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn -d x -i /data/corni
shjp/array-annotations/agilent_4x44k_human/probe-seqs/A_33_P3344603.fas -o ./A_3
3_P3344603.out
If I run it with:
python blast-probes.py /data/cornishjp/array-annotations/agilent_4x44k_mouse/probe-seq-fasta-list.csv ./mouse.cmd ./ x
An example from mouse.cmd would be:
module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn -d x -i /data/cornishp/array-annotations/agilent_4x44k_mouse/probe-seqs/A_51_P100327.fas -o ./A_51_P100327.out
The difference is when the ending of agilent_4x44k_ is human, the directory is written to file correctly with cornishjp. When the ending is mouse, the directory is written incorrectly as cornishp, the j is left out. I've swapped everything around (saving human as mouse.cmd, and so on) and I cannot for the life of me figure it out.
The only thing that comes to mind is that when I generate the arguments for the python script, I use tab to autocomplete (linux). Could this be the problem? It is correctly reading the input file, as the script would fail.
Related
I have a python script which counts the words for a given file and saves the output to a "result.txt" file after the execution. I want my docker container to do this as the container starts and display the output the console. Below is my docker file and python file
FROM python:3
RUN mkdir /home/data
RUN mkdir /home/output
RUN touch /home/output/result.txt
WORKDIR /home/code
COPY word_counter.py ./
CMD ["python", "word_counter.py"]
ENTRYPOINT cat ../output/result.txt
import glob
import os
from collections import OrderedDict
import socket
from pathlib import Path
dir_path = os.path.dirname(os.path.realpath(__file__))
# print(type(dir_path))
parent_path = Path(dir_path).parent
data_path = str(parent_path)+"/data"
# print(data_path)
os.chdir(data_path)
myFiles = glob.glob('*.txt')
output = open("../output/result.txt", "w")
output.write("files in home/data are : ")
output.write('\n')
for x in myFiles :
output.write(x)
output.write('\n')
output.close()
total_words = 0
for x in myFiles :
file = open(x, "r")
data = file.read()
words = data.split()
total_words = total_words + len(words)
file.close()
output = open("../output/result.txt", "a")
output.write("Total number of words in both the files : " + str(total_words))
output.write('\n')
output.close()
frequency = {}
for x in myFiles :
if x == "IF.txt" :
curr_file = x
document_text = open(curr_file, 'r')
text_string = document_text.read()
words = text_string.split()
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list_desc_order = sorted(frequency, key=frequency.get, reverse=True)
output = open("../output/result.txt", "a")
output.write("Top 3 words in IF.txt are :")
output.write('\n')
ip_addr = socket.gethostbyname(socket.gethostname())
for word in frequency_list_desc_order[:3]:
line = word + " : " + str(frequency[word])
output.write(line)
output.write('\n')
output.write("ip address of the machine : " + ip_addr + "\n")
output.close()
I am mapping a local directory which has two text files IF.txt and Limerick1.txt from the host machine to the directory "/home/data" inside the container and the python code inside the container reads the files and saves the output to result.txt in "home/output" inside the container.
I want my container to print the output in "result.txt" to the console when I start the container using the docker run command.
Issue: docker does not execute the following statement when starting a container using docker run.
CMD ["python", "word_counter.py"]
command to run the container:
docker run -it -v /Users/xyz/Desktop/project/docker:/home/data proj2docker bash
But when I run the same command "python word_counter.py" from within the container it executes perfectly fine.
can someone help me with this?
You have an entrypoint in your Dockerfile. This entrypoint will run and basically take the CMD as additional argument(s).
The final command that you run when starting the container looks like this
cat ../output/result.txt python word_counter.py
This is likely not what you want. I suggest removing that entrypoint. Or fix it according to your needs.
If you want to print that file and still execute that command, you can do something like the below.
CMD ["python", "word_counter.py"]
ENTRYPOINT ["/bin/sh", "-c", "cat ../output/result.txt; exec $#"]
It will run some command(s) as entrypoint, in this case printing the output of that file, and after that execute the CMD which is available as $# as its standard posix shell behaviour. In any shell script it would work the same to access all arguments that were passed to the script. The benefit of using exec here is that it will run python with process id 1, which is useful when you want to send signals into the container to the python process, for example kill.
Lastly, when you start the container with the command you show
docker run -it -v /Users/xyz/Desktop/project/docker:/home/data proj2docker bash
You are overriding the CMD in the Dockerfile. So in that case, it is expected that it doesn't run python. Even if your entrypoint didn't have the former mentioned issue.
If you want to always run the python program, then you need to make that part of the entrypoint. The problem you would have is that it would first run the entrypoint until it finishes and then your command, in this case bash.
You could run it in the background, if that's what you want. Note that there is no default CMD, but still the exec $# which will allow you to run an arbitrary command such as bash while python is running in the background.
ENTRYPOINT ["/bin/sh", "-c", "cat ../output/result.txt; python word_counter.py &; exec $#"]
If you do a lot of work in the entrypoint it is probably cleaner to move this to a dedicated script and run this script as entrypoint, you can still call exec $# at the end of your shell script.
According to your comment, you want to run python first and then cat on the file. You could drop the entrypoint and do it just with the command.
CMD ["/bin/sh", "-c", "python word_counter.py && cat ../output/result.txt"]
I'm having an issue running a simple python script that reads a helm command from a .sh script and outputs it.
When I run the command directly in the terminal, it runs fine:
helm list | grep prod- | cut -f5
# OUTPUT: prod-L2.0.3.258
But when I run python test.py (see below for whole source code of test.py), I get an error as if the command I'm running is helm list -f5 and not helm list | grep prod- | cut -f5:
user#node1:$ python test.py
# OUTPUT:
# Opening file 'helm_chart_version.sh' for reading...
# Running command 'helm list | grep prod- | cut -f5'...
# Error: unknown shorthand flag: 'f' in -f5
The test.py script:
import subprocess
# Open file for reading
file = "helm_chart_version.sh"
print("Opening file '" + file + "' for reading...")
bashCommand = ""
with open (file) as fh:
next(fh)
bashCommand = next(fh)
print("Running command '" + bashCommand + "'...")
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
if error is None:
print output
else:
print error
Contents of helm_chart_version.sh:
cat helm_chart_version.sh
# OUTPUT:
## !/bin/bash
## helm list | grep prod- | cut -f5
Try to avoid running complex shell pipelines from higher-level languages. Given the command you show, you can run helm list as a subprocess, and then do the post-processing on it in Python.
process = subprocess.run(["helm", "list"], capture_output=True, text=True, check=True)
for line in process.stdout.splitlines():
if 'prod-' not in line:
continue
words = line.split()
print(words[4])
The actual Python script you show doesn't seem to be semantically different from just directly running the shell script. You can use the sh -x option or the shell set -x command to cause it to print out each line as it executes.
I have a bash script that gets called at the end of Python script. If I call the script manually with a simple Python script (test.py below) it works fine. However when called from my actual script (long.py) it fails. So long.py runs, calling rename.sh at the end, and passing it a linux directory path, source_dir. rename.sh renames a file in said path. Here's a relevant excerpt of that script:
long.py
PATH = '/var/bigbluebutton/published/presentation/'
LOGS = '/var/log/bigbluebutton/download/'
source_dir = PATH + meetingId + "/"
...
def main():
...
try:
create_slideshow(dictionary, length, slideshow, bbb_version)
ffmpeg.trim_audio_start(dictionary, length, audio, audio_trimmed)
ffmpeg.mux_slideshow_audio(slideshow, audio_trimmed, result)
serve_webcams()
# zipdir('./download/')
copy_mp4(result, source_dir + meetingId + '.mp4')
finally:
print >> sys.stderr, source_dir
#PROBLEM LINE
subprocess.check_call(["/scripts/./rename.sh", str(source_dir)])
print >> sys.stderr, "Cleaning up temp files..."
cleanup()
print >> sys.stderr, "Done"
if __name__ == "__main__":
main()
Here's the problem:
long.py uses the above line to call rename.sh:
subprocess.check_call(["/scripts/./rename.sh", str(source_dir)])
it gives the error:
subprocess.CalledProcessError: Command '['/scripts/./rename.sh', '/var/bigbluebutton/published/presentation/5b64bdbe09fdefcc3004c987f22f163ca846f1ea-1574708322765/']' returned non-zero exit status 1
The script otherwise works perfectly.
test.py, a shortened version of long.py, contains only the following two lines:
test.py
source_dir = '/var/bigbluebutton/published/presentation/5b64bdbe09fdefcc3004c987f22f163ca846f1ea-1574708322765/'
subprocess.check_call(["/scripts/./rename.sh", str(source_dir)])
It does not encounter the error when ran using python test.py.
Here's the contents of rename.sh:
rename.sh
#!/bin/bash
i=$1
a=$(grep '<name>' $i/metadata.xml | sed -e 's/<name>\(.*\)<\/name>/\1/' | tr -d ' ')
b=$(grep '<meetingName>' $i/metadata.xml | sed -e 's/<meetingName>\(.*\)<\/meetingName>/\1/' | tr -d ' ')
c=$(ls -alF $i/*.mp4 | awk '{ gsub(":", "_"); print $6"-"$7"-"$8 }')
d=$(echo $b)_$(echo $c).mp4
cp $i/*.mp4 /root/mp4s/$d
test.py and long.py are in the same location.
I'm not executing long.py manually; it gets executed by another program.
print >> sys.stderr, source_dir
confirms that the exact same value as I define explicitly in test.py is getting passed by long.py to rename.sh
i am new in sankemake, i am trying to run this code but i have an error.
I have my input directories structured like this:
Library:
-MMETSP1:
SRR1_1.fastq.gz
SRR1_2.fastq.gz
-MMETSP2:
SRR2_1.fastq.gz
SRR2_2.fastq.gz
So what i want to do is to run the rule twice for each directory. For tht i have used the expand function in rule all and i have two jobs counted by snakemake. That is fine for me. But my probelm is not to retrieve the fasta file inside my directories. For that i have used regex in the execution of the command but it is not working.
Can someone help me please.
Thank you in advance !
#!/usr/bin/python
import os
import glob
import sys
SALMON_BY_LIBRARY_DIR = OUT_DIR + "salmon_by_library_out"
salmon = config["software"]["salmon"]
(LIBRARY, FASTQ, SENS) = glob_wildcards(LIBRARY_DIR + "{mmetsp}/{reads}_{type}.fastq.gz")
rule all:
input:
salmon_by_library_out = expand(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}", zip, mmetsp=LIBRARY),
rule salmon_by_library:
input:
transcript = TRINITY_DIR + "/Trinity.fasta",
fastq = LIBRARY_DIR + "{mmetsp}",
output:
salmon_out = directory(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}"),
log:
OUT_DIR + "{mmetsp}/salmon.log"
threads:
config["threads"]["salmon"]
params:
trimmomatic_dir = directory(TRIMMOMATIC_DIR)
run:
shell(""" mkdir -p {output.salmon_out}/index """)
shell("""
{salmon}
index \
-t {input.transcript} \
-i {output.salmon_out}/index \
--type quasi \
-k 31 \
-p {threads} > {log} &&
{salmon}
quant \
-i {output.salmon_out}/index \
-l A \
-1 {input.fastq}/*_1.fastq.gz \
-2 {input.fastq}/*_2.fastq.gz \
-o {output.salmon_out} \
-p {threads} > {log}
""")
You need to use an input function to retrieve the path to the fastq from the config. Did you do the official tutorial ( http://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html)? It covers exactly this use case. For real world best practices I suggest to further have a look at https://github.com/snakemake-workflows/docs.
I'm trying to prevent uploads to S3 in case any previous pipelined command will fail, unfortunately none of these two methods works as expected:
Shell pipeline
for database in sorted(databases):
cmd = "bash -o pipefail -o errexit -c 'mysqldump -B {database} | gpg -e -r {GPGRCPT} | gof3r put -b {S3_BUCKET} -k {database}.sql.e'".format(database = database, GPGRCPT = GPGRCPT, S3_BUCKET = S3_BUCKET)
try:
subprocess.check_call(cmd, shell = True, executable="/bin/bash")
except subprocess.CalledProcessError as e:
print e
Popen with PIPEs
for database in sorted(databases):
try:
cmd_mysqldump = "mysqldump {database}".format(database = database)
p_mysqldump = subprocess.Popen(shlex.split(cmd_mysqldump), stdout=subprocess.PIPE)
cmd_gpg = "gpg -a -e -r {GPGRCPT}".format(GPGRCPT = GPGRCPT)
p_gpg = subprocess.Popen(shlex.split(cmd_gpg), stdin=p_mysqldump.stdout, stdout=subprocess.PIPE)
p_mysqldump.stdout.close()
cmd_gof3r = "gof3r put -b {S3_BUCKET} -k {database}.sql.e".format(S3_BUCKET = S3_BUCKET, database = database)
p_gof3r = subprocess.Popen(shlex.split(cmd_gof3r), stdin=p_gpg.stdout, stderr=open("/dev/null"))
p_gpg.stdout.close()
except subprocess.CalledProcessError as e:
print e
I tried something like this with no luck:
....
if p_gpg.returncode == 0:
cmd_gof3r = "gof3r put -b {S3_BUCKET} -k {database}.sql.e".format(S3_BUCKET = S3_BUCKET, database = database)
p_gof3r = subprocess.Popen(shlex.split(cmd_gof3r), stdin=p_gpg.stdout, stderr=open("/dev/null"))
p_gpg.stdout.close()
...
Basically gof3r is streaming data to S3 even if there are errors, for instance when I intentionally change mysqldump -> mysqldumpp to generate an error.
I had the exact same question, and I managed it with:
cmd = "cat file | tr -d '\\n'"
subprocess.check_call( [ '/bin/bash' , '-o' , 'pipefail' , '-c' , cmd ] )
Thinking back, and searching in my code, I used another method too:
subprocess.check_call( "ssh -c 'make toto 2>&1 | tee log.txt ; exit ${PIPESTATUS[0]}'", shell=True )
All commands in a pipeline run concurrently e.g.:
$ nonexistent | echo it is run
the echo is always run even if nonexistent command does not exist.
pipefail affects the exit status of the pipeline as a whole -- it does not make gof3r exit any sooner
errexit has no effect because there is a single pipeline here.
If you meant that you don't want to start the next pipeline if the one from the previous iteration fails then put break after print e in the exception handler.
p_gpg.returncode is None while gpg is running. If you don't want gof3r to run if gpg fails then you have to save gpg's output somewhere else first e.g., in a file:
filename = 'gpg.out'
for database in sorted(databases):
pipeline_no_gof3r = ("bash -o pipefail -c 'mysqldump -B {database} | "
"gpg -e -r {GPGRCPT}'").format(**vars())
with open(filename, 'wb', 0) as file:
if subprocess.call(shlex.split(pipeline_no_gof3r), stdout=file):
break # don't upload to S3, don't run the next database pipeline
# upload the file on success
gof3r_cmd = 'gof3r put -b {S3_BUCKET} -k {database}.sql.e'.format(**vars())
with open(filename, 'rb', 0) as file:
if subprocess.call(shlex.split(gof3r_cmd), stdin=file):
break # don't run the next database pipeline