You have a pipeline where some samples have one input file and produce one output file while other samples have two input files and produce two output files. The typical case for those in bioinformatics is NGS sequencing libraries where some samples are paired-end and other samples are single-end sequenced. If you need to trim reads and align them you have to account for the variable number of input/output files.
What is the most appropriate way to handle this? I feel using checkpoints is overkill since in my opinion checkpoints are a bit cryptic, but I may be wrong...
Here's how I would do it for the case where the number of input/output of files can be only 1 or 2: Have an if-else in the run directive based first on the number of input files. If the number of inputs is 1 touch the second files. For the following rules have again an if-else this time checking whether the second file has 0 bytes.
Here's an example (not tested but it should be about right):
import os
samples = {'S1': ['s1.R1.fastq'],
'S2': ['s2.R1.fastq', 's2.R2.fastq']}
rule all:
input:
expand('bam/{sample_id}.bam', sample_id= samples),
rule cutadapt:
input:
fastq= lambda wc: samples[wc.sample_id],
output:
r1= 'cutadapt/{sample_id}.R1.fastq',
r2= touch('cutadapt/{sample_id}.R2.fastq'),
run:
if len(input.fastq) == 1:
# {output.r1} created, {output.r2} touched
shell('cutadapt ... -o {output.r1} {input.fastq}')
else:
shell('cutadapt ... -o {output.r1} -p {output.r2} {input.fastq}')
rule align:
input:
r1= 'cutadapt/{sample_id}.R1.fastq',
r2= 'cutadapt/{sample_id}.R2.fastq',
output:
'bam/{sample_id}.bam',
run:
if os.path.getsize(input.r2) == 0:
# or
# if len(samples[wildcards.sample_id]) == 1:
shell('hisat2 ... {input.r1} > {output.bam}')
else:
shell('hisat2 ... -1 {input.r1} -2 {input.r2} > {output.bam}')
This works, but I find it awkward to artificially create the second file just to keep the workflow happy. Are there better solutions?
There shall be two separate rules for single ended and pair ended case:
rule single_end:
input:
fastq = 's{n}.R1.fastq'
output:
r1 = 'cutadapt/S{n}.R1.fastq'
shell:
'cutadapt ... -o {output.r1} {input.fastq}'
rule pair_ends:
input:
r1 = 's{n}.R1.fastq',
r2 = 's{n}.R2.fastq'
output:
r1 = 'cutadapt/S{n}.R1.fastq',
r2 = 'cutadapt/S{n}.R2.fastq'
shell:
'cutadapt ... -o {output.r1} -p {output.r2} {input}'
Now there is an ambiguity: whenever the pair_ends can be applied, the single_end may be applied too. This problem can be solved with can be ruleorder:
ruleorder pair_ends > single_end
I agree that checkpoints would be overkill here.
You can compose arbitrary inputs and command lines using input functions and parameter functions. The trick is optional r2 output from your cutadapt rule, which may or may not be present, which complicates the workflow's DAG. In this case it's probably appropriate to use Snakemake's directories as output functionality:
CUTADAPT_R1 = 'cutadapt/{sample_id}/R1.fastq'
CUTADAPT_R2 = 'cutadapt/{sample_id}/R2.fastq'
def get_cutadapt_outargs(wldc, input):
r1 = CUTADAPT_R1.format(**wldc)
r2 = CUTADAPT_R2.format(**wldc)
ret = f"-o '{r1}'"
if len(input.fastq) > 1:
ret += f" -p '{r2}'"
return ret
def get_align_inargs(wldc):
r1 = CUTADAPT_R1.format(**wldc)
r2 = CUTADAPT_R2.format(**wldc)
if len(samples[wldc.sample_id]) > 1:
return "-1 '{}' -2 '{}'".format(r1, r2)
else:
return "'{}'".format(r1)
rule all:
input:
expand('bam/{sample_id}.bam', sample_id= samples),
rule cutadapt:
input:
fastq= lambda wc: samples[wc.sample_id],
output:
fq = directory('cutadapt/{sample_id}'),
params:
out_args = get_cutadapt_outargs,
shell:
"""
cutadapt ... {params.out_args} {input.fastq}
"""
rule align:
input:
fq = rules.cutadapt.output.fq,
output:
bam = 'bam/{sample_id}.bam',
params:
input_args = get_align_inargs,
shell:
"""
hisat2 ... {params.input_args} > {output.bam}
"""
Related
I'm trying to run the following Blockworld Planning Program in python:
#include <incmode>.
#program base.
% Define
location(table).
location(X) :- block(X).
holds(F,0) :- init(F).
#program step(t).
% Generate
{ move(X,Y,t) : block(X), location(Y), X != Y } = 1.
% Test
:- move(X,Y,t), holds(on(A,X),t-1).
:- move(X,Y,t), holds(on(B,Y),t-1), B != X, Y != table.
% Define
moved(X,t) :- move(X,Y,t).
holds(on(X,Y),t) :- move(X,Y,t).
holds(on(X,Z),t) :- holds(on(X,Z),t-1), not moved(X,t).
#program check(t).
% Test
:- query(t), goal(F), not holds(F,t).
% Display
#show move/3.
#program base.
% Sussman Anomaly
%
block(b0).
block(b1).
block(b2).
%
% initial state:
%
% 2
% 0 1
% -------
%
init(on(b1,table)).
init(on(b2,b0)).
init(on(b0,table)).
%
% goal state:
%
% 2
% 1
% 0
% -------
%
goal(on(b1,b0)).
goal(on(b2,b1)).
goal(on(b0,table)).
I do this by making use of the following simple function
def print_answer_sets(program):
control = clingo.Control()
control.add("base", [], program)
control.ground([("base", [])])
control.configuration.solve.models = 0
for model in control.solve(yield_=True):
sorted_model = [str(atom) for atom in model.symbols(shown=True)]
sorted_model.sort()
print("Answer set: {{{}}}".format(", ".join(sorted_model)))
To which I pass a long triple-quote string of the program above. I.e.
print_answer_sets("""
#include <incmode>.
#program base.
% Define
location(table).
location(X) :- bl
...etc.
""")
This has generally worked with other clingo programs.
However, I can't seem to produce an answer set, and I get an error
no atoms over signature occur in program:
move/3
Note that running this same program directly with clingo in the terminal or on their web interface works fine.
I believe this is because this string makes use of multiple "#program" statements and an "#include" statement, which I might not be parsing correctly (or at all) in print_answer_sets. Is there a way to properly handle program parts in python? I've tried looking at the docs and the user manual but its a bit dense.
Thanks!
I got help from my professor who managed to find a solution. I added some comments for myself and others and am posting below:
import clingo
asp_code_base = """
#include <incmode>.
% Define
location(table).
location(X) :- block(X).
holds(F,0) :- init(F).
block(b0).
block(b1).
block(b2).
init(on(b1,table)).
init(on(b2,b0)).
init(on(b0,table)).
goal(on(b1,b0)).
goal(on(b2,b1)).
goal(on(b0,table)).
#show move/3.
"""
asp_code_step = """
% Generate
{ move(X,Y,t) : block(X), location(Y), X != Y } = 1.
% Test
:- move(X,Y,t), holds(on(A,X),t-1).
:- move(X,Y,t), holds(on(B,Y),t-1), B != X, Y != table.
% Define
moved(X,t) :- move(X,Y,t).
holds(on(X,Y),t) :- move(X,Y,t).
holds(on(X,Z),t) :- holds(on(X,Z),t-1), not moved(X,t).
"""
asp_code_check = """
#external query(t).
% Test
:- query(t), goal(F), not holds(F,t).
"""
def on_model(model):
print("Found solution:", model)
# this is an arbitrary upper limit set to ensure process terminates
max_step = 5
control = clingo.Control()
control.configuration.solve.models = 1 # find one answer only
# add each #program
control.add("base", [], asp_code_base)
control.add("step", ["t"], asp_code_step)
control.add("check", ["t"], asp_code_check)
# for grounding, we make use of a parts array
parts = []
parts.append(("base", []))
control.ground(parts)
ret, step = None, 1
# try solving until max number of steps or until solved
while step <= max_step and (step == 1 or not ret.satisfiable):
parts = []
# handle #external call
control.release_external(clingo.Function("query", [clingo.Number(step - 1)]))
parts.append(("step", [clingo.Number(step)]))
parts.append(("check", [clingo.Number(step)]))
control.cleanup() # cleanup previous grounding call, so we can ground again
control.ground(parts)
# finish handling #external call
control.assign_external(clingo.Function("query", [clingo.Number(step)]), True)
print(f"Solving step: t={step}")
ret = control.solve(on_model=on_model)
print(f"Returned: {ret}")
step += 1
This outputs, as expected (omitting warning)
Solving step: t=1
Returned: UNSAT
Solving step: t=2
Returned: UNSAT
Solving step: t=3
Found solution: move(b2,table,1) move(b1,b0,2) move(b2,b1,3)
Returned: SAT
"TMalign..." is an executable file that I used to get data. How could I store the output into a variable so that I could extract target values from the output. The executable file is compiled from a long .cpp, so I do not think I could call the variable names from there.
import sys,os
os.system("./TMalign 3w4u.pdb 6bb5.pdb -u 139") #some command I have
The output is like, and I need to extract the TM-score values:
*********************************************************************
* TM-align (Version 20190822): protein structure alignment *
* References: Y Zhang, J Skolnick. Nucl Acids Res 33, 2302-9 (2005) *
* Please email comments and suggestions to yangzhanglab#umich.edu *
*********************************************************************
Name of Chain_1: 3w4u.pdb (to be superimposed onto Chain_2)
Name of Chain_2: 6bb5.pdb
Length of Chain_1: 141 residues
Length of Chain_2: 139 residues
Aligned length= 139, RMSD= 1.07, Seq_ID=n_identical/n_aligned= 0.590
TM-score= 0.94726 (if normalized by length of Chain_1, i.e., LN=141, d0=4.42)
TM-score= 0.96044 (if normalized by length of Chain_2, i.e., LN=139, d0=4.38)
TM-score= 0.96044 (if normalized by user-specified LN=139.00 and d0=4.38)
(You should use TM-score normalized by length of the reference structure)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::. .
-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK-Y
Total CPU time is 0.03 seconds
Thanks for help!
You should look toward the following approach:
import re
from subprocess import check_output
ret = check_output(['./TMalign', '3w4u.pdb', '6bb5.pdb', '-u', '139'])
tm_scores = []
for line in str(ret).split('\\n'):
if re.match(r'^TM-score=', line):
score = line.split()[1:2] # Extract the value
tm_scores.extend(score) # Saving only values
# tm_scores now contains: ['0.94726', '0.96044', '0.96044']
While it being somewhat elaborate, it is a flexible and tunable solution. Note, if it will be used among other code, it would be better to wrap this into a function.
My function wasn't that smart,I will let ouput write in to a file to do the follow
import os
cmd = './TMalign 3w4u.pdb 6bb5.pdb -u 139'
os.system(cmd + ">> 1.txt")
I want to create VennDiagramms with pybedtools. There is a special script using matplotlib called venn_mpl. It works perfectly when I try it out in my jupyter notebook. You can do it with python or using shell commands.
Unfortunately something wents wrong when I want to use it in my snakefile and I can’t really figure out what the problem is.
First of all, this is the script: venn_mpl.py
#!/gnu/store/3w3nz0h93h7jif9d9c3hdfyimgkpx1a4-python-wrapper-3.7.0/bin/python
"""
Given 3 files, creates a 3-way Venn diagram of intersections using matplotlib; \
see :mod:`pybedtools.contrib.venn_maker` for more flexibility.
Numbers are placed on the diagram. If you don't have matplotlib installed.
try venn_gchart.py to use the Google Chart API instead.
The values in the diagram assume:
* unstranded intersections
* no features that are nested inside larger features
"""
import argparse
import sys
import os
import pybedtools
def venn_mpl(a, b, c, colors=None, outfn='out.png', labels=None):
"""
*a*, *b*, and *c* are filenames to BED-like files.
*colors* is a list of matplotlib colors for the Venn diagram circles.
*outfn* is the resulting output file. This is passed directly to
fig.savefig(), so you can supply extensions of .png, .pdf, or whatever your
matplotlib installation supports.
*labels* is a list of labels to use for each of the files; by default the
labels are ['a','b','c']
"""
try:
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
except ImportError:
sys.stderr.write('matplotlib is required to make a Venn diagram with %s\n' % os.path.basename(sys.argv[0]))
sys.exit(1)
a = pybedtools.BedTool(a)
b = pybedtools.BedTool(b)
c = pybedtools.BedTool(c)
if colors is None:
colors = ['r','b','g']
radius = 6.0
center = 0.0
offset = radius / 2
if labels is None:
labels = ['a','b','c']
Then my code:
rule venndiagramm_data:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
run:
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(input.data[0], input.data[1], input.data[2], colors = col, labels = lab, outfn = output)
The error is:
SystemExit in line 62 of snakemake_generatingVennDiagramm.py:
1
The snakemake-log only gives me:
rule venndiagramm_data:
input: bed_files/A_peaks.narrowPeak,bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 2
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I already tried to add as written in the documentation:
rule error:
shell:
"""
set +e
somecommand ...
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
but this changed nothing.
Then my next idea was to just do it while using the shell command which I also tested before and which worked perfectly. But then I got a different but I think quite similar error message for which I didn’t found a proper solution too:
rule venndiagramm_data_shell:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]} --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data'"
The snakemake log:
[Thu May 23 16:37:27 2019]
rule venndiagramm_data_shell:
input: bed_files/A_peaks.narrowPeak, bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 1
[Thu May 23 16:37:29 2019]
Error in rule venndiagramm_data_shell:
jobid: 1
output: figures/Venn_PR1_PR2_GUI_data.png
RuleException:
CalledProcessError in line 45 of snakemake_generatingVennDiagramm.py:
Command ' set -euo pipefail; venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2] --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data' ' returned non-zero exit status 1.
Does anyone has an idea what could be the reason for this and how to fix it?
FYI: I said that I tested it, without running it with snakemake. This is my working code:
from snakemake.io import expand
import yaml
import pybedtools
from pybedtools.scripts.venn_mpl import venn_mpl
config_text_real = """
samples:
data:
- A
- B
- C
control:
- A_input
- B_input
- C_input
"""
config_vennDiagramm = yaml.load(config_text_real)
config = config_vennDiagramm
data = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(data[0], data[1], data[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_data.png')
control = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["control"])
lab = ['PR1_control','PR2_control','GUI_control']
venn_mpl(control[0], control[1], control[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_control.png')
and within my jupyter notebook for shell:
!A='../path/to/file/A_peaks.narrowPeak'
!B='../path/to/file/B_peaks.narrowPeak'
!C='../path/to/file/C_peaks.narrowPeak'
!col=g,k,b
!lab='PR1_data, PR2_data, GUI_data'
!venn_mpl.py -a ../path/to/file/A_peaks.narrowPeak -b ../path/to/file/B_peaks.narrowPeak -c ../path/to/file/C_peaks.narrowPeak --color "g,k,b" --labels "PR1_data, PR2_data, GUI_data"
The reason why I used the full path instead of the variable is, because for some reason the code didn't worked with calling the variable with "$A" .
Not sure if this fixes it, but one thing I notice is that:
shell:
"venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2]..."
probably should be:
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]}..."
Hi everyone / Python Gurus
I would like to know how to accomplish the following task, which so far I've been unable to do so.
Here's what I have:
Q1 = 20e-6
Now this is an exponential number that if you print(Q1) as is it will show: 2e-5 which is fine. Mathematically speaking.
However, here's what I want to do with it:
I want Q1 to print only the number 20. And based on the whether this is e-6 then print uC or if this e-9 the print nC.
Here's an example for better understanding:
Q1=20e-6
When I run print(Q1) show: 20uC.
Q2=20e-9
When I run print(Q2) show: 20nC.
Can you please help me figure this out?
just replace the exponent using str.replace:
q1 = 'XXXXXX'
q1 = q1.replace('e-9', 'nC').replace('e-6', 'uC')
print(q1)
I recommend you using si-prefix.
You can install it using pip:
sudo pip install si-prefix
Then you can use something like this:
from si_prefix import si_format
# precision after the point
# char is the unity's char to be used
def get_format(a, char='C', precision=2):
temp = si_format(a, precision)
try:
num, prefix = temp.split()
except ValueError:
num, prefix = temp , ''
if '.' in num:
aa, bb = num.split('.')
if int(bb) == 0:
num = aa
if prefix:
return num + ' ' + prefix + char
else:
return num
tests = [20e-6, 21.46e05, 33.32e-10, 0.5e03, 0.33e-2, 112.044e-6]
for k in tests:
print get_format(k)
Output:
20 uC
2.15 MC
3.33 nC
500
3.30 mC
112.04 uC
You can try by splitting the string:
'20e-9'.split('e')
gives
['20', '-9']
From there on, you can insert whatever you want in between those values:
('u' if int(a[1]) > 0 else 'n').join(a)
(with a = '20e-9'.split('e'))
You can not. The behaviour you are looking for is called "monkey patching". And this is not allowed for int and float.
You can refer to this stackoverflow question
The only way I can think of is to create a class that extends float and then implement a __str__ method that shows as per your requirement.
------- More explanation -----
if you type
Q1 = 20e-6
in python shell and then
type(Q1)
your will get a
float
So basically your Q1 is considered as float by python type system
when you type print(Q1)
the _str__ method of float is called
The process of extending core class is one example of "monkey patch" and that is what I was refereing to.
Now the problem is that you can not "monkey patch" (or extend if you prefer that) core classes in python (which you can in some languages like in Ruby).
[int, float etc are core classes and written in C for your most common python distribution.]
So how do you solve it?
you need to create a new class like this
class Exponent(float):
def init(self, value):
self.value = value
def __str__(self):
return "ok"
x = Exponent(10.0)
print(x) ==> "ok"
hope this helps
I've got about 3,500 files that consist of single line character strings. The files vary in size (from about 200b to 1mb). I'm trying to compare each file with each other file and find a common subsequence of length 20 characters between two files. Note that the subsequence is only common between two files during each comparison, and not common among all files.
I've stuggled with this problem a bit, and since I'm not an expert, I've ended up with a bit of an ad-hoc solution. I use itertools.combinations to build a list in Python that ends up with around 6,239,278 combinations. I then pass the files two at a time to a Perl script that acts a wrapper for a suffix tree library written in C called libstree. I've tried to avoid this type of solution but the only comparable C suffix tree wrapper in Python suffers from a memory leak.
So here's my problem. I've timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. So that means, it'll take around 3 days of continuous processing to complete the task. And then I have to do it all again to look at say 25 characters instead of 20. Please note that I'm way out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I'd ask it here and produce my code to see if anyone has any suggestion as to how I could complete this task faster.
Python code:
from itertools import combinations
import glob, subprocess
glist = glob.glob("Data/*.g")
i = 0
for a,b in combinations(glist, 2):
i += 1
p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
p = p.stdout.read()
a = a.split("/")
b = b.split("/")
a = a[1].split(".")
b = b[1].split(".")
print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
if p != "" and len(p) == 20:
with open("tmp.list", "a") as openf:
openf.write(a[0] + " " + b[0] + "\n")
Perl code:
use strict;
use Tree::Suffix;
open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };
open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };
my #g = ($a,$b);
my $st = Tree::Suffix->new(#g);
my ($c) = $st->lcs($ARGV[2],-1);
print "$c";
Rather than writing Python to call Perl to call C, I'm sure you would be better off dropping the Python code and writing it all in Perl.
If your files are certain to contain exactly one line then you can read the pairs more simply by writing just
my #g = <>;
I believe the program below performs the same function as your Python and Perl code combined, but I cannot test it as I am unable to install libstree at present.
But as ikegami has pointed out, it would be far better to calculate and store the longest common subsequence for each pair of files and put them into categories afterwards. I won't go on to code this as I don't know what information you need - whether it is just subsequence length or if you need the characters and/or the position of the subsequences as well.
use strict;
use warnings;
use Math::Combinatorics;
use Tree::Suffix;
my #glist = glob "Data/*.g";
my $iterator = Math::Combinatorics->new(count => 2, data => \#glist);
open my $fh, '>', 'tmp.list' or die $!;
my $n = 0;
while (my #pair = $iterator->next_combination) {
$n++;
#ARGV = #pair;
my #g = <>;
my $tree = Tree::Suffix->new(#g);
my $lcs = $tree->lcs;
#pair = map m|/(.+?)\.|, #pair;
print "$n: $pair[0] --- $pair[1]\n";
print $fh, "#pair\n" if $lcs and length $lcs >= 20;
}