Snakemake integrate the multiple command lines in a rule - python

The output of my first command line "bcftools query -l {input.invcf} | head -n 1" prints the name of the first individual of vcf file (i.e. IND1). I want to use that output in selectvariants GATK in -sn IND1 option. How is it possible to integrate the 1st comamnd line in snakemake in order to use it's output in the next one?
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ind= ???
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
bcftools query -l {input.invcf} | head -n 1 > {params.ind}
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn {params.ind} -O {output.out}
"""

There are several options, but the easiest one is to store the results into a temporary bash variable:
rule selectvar:
...
shell:
"""
myparam=$(bcftools query -l {input.invcf} | head -n 1)
gatk -sn "$myparam" ...
"""
As noted by #dariober, if one modifies pipefail behaviour, there could be unexpected results, see the example in this answer.

When I have to do these things I prefer to use run instead of shell, and then shell out only at the end.
The reason for this is because it makes it possible for snakemake to lint the run statement, and to exit early if something goes wrong instead of following through with a broken shell command.
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
gatk_opts='--java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants'
output:
out="{family}.dn.vcf"
run:
opts = "{params.gatk_opts} -R {params.ref} -V {input.invcf} -O {output.out}"
sn_parameter = shell("bcftools query -l {input.invcf} | head -n 1")
# we could add a sanity check here if necessary, before shelling out
shell(f"gatk {options} {sn_parameter}")
"""

I think I found a solution:
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn `bcftools query -l {input.invcf} | head -n 1` -O {output.out}
"""

Related

How to convert a script that uses ssh to pbsdsh while using Ray?

I am stuck with converting my script that uses ssh to activate nodes to pbsdsh. I am using Ray for node communication. My script with ssh is:
#!/bin/bash
#PBS -N Experiment_1
#PBS -l select=2:ncpus=24:mpiprocs=24
#PBS -P CSCIxxxx
#PBS -q normal
#PBS -l walltime=01:30:00
#PBS -m abe
#PBS -M xxxxx#gmail.com
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR
jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
export thishostNport
echo "set up ray cluster..."
for n in `echo ${jobnodes}`
do
if [[ ${n} == "${thishost}" ]]
then
echo "first allocate node - use as headnode ..."
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/env1
ray start --head
sleep 5
else
ssh ${n} $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport}
sleep 10
fi
done
python -u example_trainer.py
rm $PBS_O_WORKDIR/$PBS_JOBID
#
where startWorkerNode.pbs is:
#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/poet
ray start --address="${param1}" --redis-password='5241590000000000'
and the example_trainer.py is:
from collections import Counter
import os
import socket
import sys
import time
import ray
num_cpus = int(sys.argv[1])
ray.init(address=os.environ["thishostNport"])
print("Nodes in the Ray cluster:")
print(ray.nodes()) # This should print all N nodes we are trying to access
#ray.remote
def f():
time.sleep(1)
return socket.gethostbyname(socket.gethostname()) + "--" + str(socket.gethostname())
# The following takes one second (assuming that
# ray was able to access all of the allocated nodes).
for i in range(60):
start = time.time()
ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
print("GOT IPs", ip_addresses)
print(Counter(ip_addresses))
end = time.time()
print(end - start)
This works perfectly and communicates across all nodes but when I try to change the command to pbsds it returns:
pbsdsh: task 0x00000000 exit status 254
pbsdsh: task 0x00000001 exit status 254
when mpiprocs=1 and if it is set to 24 it repeats 48 times.
As per the best of my knowledge, ray needs a host node and then worker nodes are connected to it and thus the for loop and if statement in it.
I have tried directly replacing pbsdsh in the script with/without identifying nodes. I have added pbsdsh out of the loop and tried a whole lot of possible combinations.
I have followed these questions but could not get my code to communicate throughout nodes:
PBS/TORQUE: how do I submit a parallel job on multiple nodes?
How to execute a script on every allocated node with PBS
Handle multiple nodes in one pbs job
I believe there might be something not too big that I am not able to implement. Your help and guidance will be highly appreciated!
there are a few main things that needed to change to solve this problem:
#PBS -l select=2:ncpus=24:mpiprocs=1 should be used as the selector line, specifically, change mpiprocs from 24 to 1, so that pbsdsh only launches one process per node instead of 24.
Inside jobscript.sh, inside the else, you can use pbsdsh -n $J -- $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport} & to run pbsdsh only on one node, and in the background. J is kept as a node index and is incremented at each iteration of the for loop. This results in the ray start being run on each node once.
Inside startWorkerNode.pbs, add this code at the end
# Here, sleep for the duration of the job, so ray does not stop
WALLTIME=$(qstat -f $PBS_JOBID | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p')
SECONDS=`echo $WALLTIME | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }'`
echo "SLEEPING FOR $SECONDS s"
sleep $SECONDS
This ensures that the ray start does not exit as soon as the pbsdsh command returns and is kept alive for the duration of the job. The & in the previous point is also necessary here, as pbsdsh will never return without it.
Here are the files for reference:
startWorkerNode.pbs
#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/poet
ray start --address="${param1}" --redis-password='5241590000000000'
# Here, sleep for the duration of the job, so ray does not stop
WALLTIME=$(qstat -f $PBS_JOBID | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p')
SECONDS=`echo $WALLTIME | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }'`
echo "SLEEPING FOR $SECONDS s"
sleep $SECONDS
jobscript.sh
#!/bin/bash
#PBS -N Experiment_1
#PBS -l select=2:ncpus=24:mpiprocs=1
#PBS -P CSCIxxxx
#PBS -q normal
#PBS -l walltime=01:30:00
#PBS -m abe
#PBS -M xxxxx#gmail.com
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR
jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
export thishostNport
echo "set up ray cluster..."
J=0
for n in `echo ${jobnodes}`
do
if [[ ${n} == "${thishost}" ]]
then
echo "first allocate node - use as headnode ..."
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/env1
ray start --head
sleep 5
else
# Run pbsdsh on the J'th node, and do it in the background.
pbsdsh -n $J -- $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport} &
sleep 10
fi
J=$((J+1))
done
python -u example_trainer.py 48
rm $PBS_O_WORKDIR/$PBS_JOBID

Snakemake use file content as shell command

I'm trying to automatize a shell command that use the software using snakemake :
./chopchop.py -G hg38 -o temp -Target chr16:46390060-46390782
In this command the 'chr16:46390060-46390782' input will change.
All different input are in a file in which I'll have to parse to get the appropiate input format
cat test.bed
chr16 46390060 46390782
chr21 33931554 33931728
I have a simple snakemake rule that run shell command
rule run_chopchop :
input:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
shell:'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
'''
How can I use the content of the file as input in snakemake and change the syntax of the line to get the appropriate format ? I have really no idea of the syntax. If some can help me.
Thanks
You can always replace the shell: section with the run: one. In this case you need to call the shell() function each time you need to run the script:
rule run_chopchop :
input: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
run:
# for each line in the file
input = ...
output = ...
shell(f'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
''')

Snakemake ambiguity

I have an ambiguity error and I can't figure out why and how to solve it.
Defining the wildcards:
rule all:
input:
xls = expand("reports/{sample}.xlsx", sample = config["samples"]),
runfolder_xls = expand("{runfolder}.xlsx", runfolder = config["runfolder"])
Actual rules:
rule sample_report:
input:
vcf = "vcfs/{sample}.annotated.vcf",
cov = "stats/{sample}.coverage.gz",
mod_bed = "tmp/mod_ref_{sample}.bed",
nirvana_g2t = "/mnt/storage/data/NGS/nirvana_genes2transcripts"
output:
"reports/{sample}.xlsx"
params:
get_nb_samples()
log:
"logs/{sample}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_sample_report.py -v {input.vcf} -c {input.cov} -r {input.mod_bed} -n {input.nirvana_g2t} -r {rule};
exitcode=$? ;
if [[ {params} > 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.sample}
elif [[ {params} == 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r sample_mode -n {wildcards.sample}
else
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e 1 -r {rule} -n {wildcards.sample}
fi
"""
rule runfolder_report:
input:
sample_sheet = "SampleSheet.csv"
output:
"{runfolder}.xlsx"
log:
"logs/{runfolder}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {wildcards.runfolder} -s {input.sample_sheet} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.runfolder}
"""
Config file:
runfolder: "CP0340"
samples: ['C014044p', 'C130157', 'C014040p', 'C014054b-1', 'C051198-A', 'C014042p', 'C052007W-C', 'C051198-B', 'C014038p', 'C052004-B', 'C051198-C', 'C052004-C', 'C052003-B', 'C052003-A', 'C052004-A', 'C052002-C', 'C052005-C', 'C052002-A', 'C130157N', 'C052006-B', 'C014063pW', 'C014054b-2', 'C052002-B', 'C052006-C', 'C052007W-B', 'C052003-C', 'C014064bW', 'C052005-B', 'C052006-A', 'C052005-A']
Error:
$ snakemake -n -s ../niles/Snakefile --configfile logs/CP0340_config.yaml
Building DAG of jobs...
AmbiguousRuleException:
Rules runfolder_report and sample_report are ambiguous for the file reports/C014044p.xlsx.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
runfolder_report: runfolder=reports/C014044p
sample_report: sample=C014044p
Expected input files:
runfolder_report: SampleSheet.csv
sample_report: vcfs/C014044p.annotated.vcf stats/C014044p.coverage.gz tmp/mod_ref_C014044p.bed /mnt/storage/data/NGS/nirvana_genes2transcriptsExpected output files:
runfolder_report: reports/C014044p.xlsx
sample_report: reports/C014044p.xlsx
If I understand Snakemake correctly, the wildcards in the rules are defined in my all rule so I don't understand why the runfolder_report rule tries to put reports/C014044p.xlsx as an output + how the output has a sample name instead of the runfolder name (as defined in the config file).
As the error message suggests, you could assign a distinct prefix to the output of each rule. So your original code will work if you replace {runfolder}.xlsx with, e.g., "runfolder/{runfolder}.xlsx" in rule all and in runfolder_report. Alternatively, constraint the wildcards (my preferred solution) by adding before rule all something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in config["samples"]]),
runfolder= re.escape(config["runfolder"]),
The reason for this is that snakemake matches input and output strings using regular expressions (the fine details of how it's done, I must admit, escape me...)
Ok here is my solution:
rule runfolder_report:
input:
"SampleSheet.csv"
output:
expand("{runfolder}.xlsx", runfolder = config["runfolder"])
params:
config["runfolder"]
log:
expand("logs/{runfolder}.log", runfolder = config["runfolder"])
shell: """
set +e ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {params} -s {input} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {params}
"""
However I still don't understand why it had errors and I know that it work previously.

Check 10000K+ URL

Well i want to check 100000k+ url in linux.
About those links those are actually OTA[zip] of my android.
Among those links there is only one valid link rest give 404 error.
So how to check all links in less time period in linux server or web server[apache].
structure of urls:
http://link.com/updateOTA_1.zip
http://link.com/updateOTA_2.zip
http://link.com/updateOTA_999999999.zip
Okay what i tried
i made this script but it is really slow. http://pastebin.com/KVxnzttA I also increase thread upto 500 then my server crashed :[
#!/bin/bash
for a in {1487054155500..1487055000000}
do
if [ $((a%50)) = 0 ]
then
curl -s -I http://link.com/updateOTA_$((a)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+1)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+2)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+3)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+4)).zip | head -n1 &
...
curl -s -I http://link.com/updateOTA_$((a+49)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+50)).zip | head -n1
wait
echo "$((a))"
fi
done
i tried with aria2, but highest thread on aria2 is 16, so again failed.
tried with some online tools, but they give me 100url restrictions.
Running curl 100,000+ times is going to be slow. Instead, write batches of URLs to a single instance of curl to reduce the overhead of starting curl.
# This loop doesn't require pre-generating a list of a million integers
for ((a=1487054155500; a<=1487055000000; a+=50)); do
for(k=0; k<50; k++)); do
printf 'url = %s\n' "http://link.com/updateOTA_$((a+k)).zip"
done | curl -I -K - -w 'result: %{http_code} %{url_effective}' | grep -F 'result:' > batch-$a.txt
done
The -w option is used to produce output associating each URL with its result, should you want that.
However i found solution using aria2c
now it scanning 7k url per minute.
thanks to all
aria2c -i url -s16 -x16 --max-concurrent-downloads=1000

Tail file till process exits

Going through the answers at superuser.
I'm trying to modify this to listen for multiple strings and echo custom messages such as ; 'Your server started successfully' etc
I'm also trying to tack it to another command i.e. pip
wait_str() {
local file="$1"; shift
local search_term="Successfully installed"; shift
local search_term2='Exception'
local wait_time="${1:-5m}"; shift # 5 minutes as default timeout
(timeout $wait_time tail -F -n0 "$file" &) | grep -q "$search_term" && echo 'Custom success message' && return 0 || || grep -q "$search_term2" && echo 'Custom success message' && return 0
echo "Timeout of $wait_time reached. Unable to find '$search_term' or '$search_term2' in '$file'"
return 1
}
The usage I have in mind is:
pip install -r requirements.txt > /var/log/pip/dump.log && wait_str /var/log/pip/dump.log
To clarify, I'd like to get wait_str to stop tailing when pip exits, whether successfully or not.
Following is general answer and tail could be replaced by any command that result in stream of lines.
IF different string needs different actions then use following:
tail -f var/log/pip/dump.log |awk '/condition1/ {action for condition-1} /condition-2/ {action for condition-2} .....'
If multiple conditions need same action them ,separate them using OR operator :
tail -f var/log/pip/dump.log |awk '/condition-1/ || /condition-2/ || /condition-n/ {take this action}'
Based on comments : Single awk can do this.
tail -f /path/to/file |awk '/Exception/{ print "Worked"} /compiler/{ print "worked"}'
or
tail -f /path/to/file | awk '/Exception/||/compiler/{ print "worked"}'
OR Exit if match is found
tail -f logfile |awk '/Exception/||/compiler/{ print "worked";exit}'

Categories