I want to create VennDiagramms with pybedtools. There is a special script using matplotlib called venn_mpl. It works perfectly when I try it out in my jupyter notebook. You can do it with python or using shell commands.
Unfortunately something wents wrong when I want to use it in my snakefile and I can’t really figure out what the problem is.
First of all, this is the script: venn_mpl.py
#!/gnu/store/3w3nz0h93h7jif9d9c3hdfyimgkpx1a4-python-wrapper-3.7.0/bin/python
"""
Given 3 files, creates a 3-way Venn diagram of intersections using matplotlib; \
see :mod:`pybedtools.contrib.venn_maker` for more flexibility.
Numbers are placed on the diagram. If you don't have matplotlib installed.
try venn_gchart.py to use the Google Chart API instead.
The values in the diagram assume:
* unstranded intersections
* no features that are nested inside larger features
"""
import argparse
import sys
import os
import pybedtools
def venn_mpl(a, b, c, colors=None, outfn='out.png', labels=None):
"""
*a*, *b*, and *c* are filenames to BED-like files.
*colors* is a list of matplotlib colors for the Venn diagram circles.
*outfn* is the resulting output file. This is passed directly to
fig.savefig(), so you can supply extensions of .png, .pdf, or whatever your
matplotlib installation supports.
*labels* is a list of labels to use for each of the files; by default the
labels are ['a','b','c']
"""
try:
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
except ImportError:
sys.stderr.write('matplotlib is required to make a Venn diagram with %s\n' % os.path.basename(sys.argv[0]))
sys.exit(1)
a = pybedtools.BedTool(a)
b = pybedtools.BedTool(b)
c = pybedtools.BedTool(c)
if colors is None:
colors = ['r','b','g']
radius = 6.0
center = 0.0
offset = radius / 2
if labels is None:
labels = ['a','b','c']
Then my code:
rule venndiagramm_data:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
run:
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(input.data[0], input.data[1], input.data[2], colors = col, labels = lab, outfn = output)
The error is:
SystemExit in line 62 of snakemake_generatingVennDiagramm.py:
1
The snakemake-log only gives me:
rule venndiagramm_data:
input: bed_files/A_peaks.narrowPeak,bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 2
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I already tried to add as written in the documentation:
rule error:
shell:
"""
set +e
somecommand ...
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
but this changed nothing.
Then my next idea was to just do it while using the shell command which I also tested before and which worked perfectly. But then I got a different but I think quite similar error message for which I didn’t found a proper solution too:
rule venndiagramm_data_shell:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]} --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data'"
The snakemake log:
[Thu May 23 16:37:27 2019]
rule venndiagramm_data_shell:
input: bed_files/A_peaks.narrowPeak, bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 1
[Thu May 23 16:37:29 2019]
Error in rule venndiagramm_data_shell:
jobid: 1
output: figures/Venn_PR1_PR2_GUI_data.png
RuleException:
CalledProcessError in line 45 of snakemake_generatingVennDiagramm.py:
Command ' set -euo pipefail; venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2] --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data' ' returned non-zero exit status 1.
Does anyone has an idea what could be the reason for this and how to fix it?
FYI: I said that I tested it, without running it with snakemake. This is my working code:
from snakemake.io import expand
import yaml
import pybedtools
from pybedtools.scripts.venn_mpl import venn_mpl
config_text_real = """
samples:
data:
- A
- B
- C
control:
- A_input
- B_input
- C_input
"""
config_vennDiagramm = yaml.load(config_text_real)
config = config_vennDiagramm
data = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(data[0], data[1], data[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_data.png')
control = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["control"])
lab = ['PR1_control','PR2_control','GUI_control']
venn_mpl(control[0], control[1], control[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_control.png')
and within my jupyter notebook for shell:
!A='../path/to/file/A_peaks.narrowPeak'
!B='../path/to/file/B_peaks.narrowPeak'
!C='../path/to/file/C_peaks.narrowPeak'
!col=g,k,b
!lab='PR1_data, PR2_data, GUI_data'
!venn_mpl.py -a ../path/to/file/A_peaks.narrowPeak -b ../path/to/file/B_peaks.narrowPeak -c ../path/to/file/C_peaks.narrowPeak --color "g,k,b" --labels "PR1_data, PR2_data, GUI_data"
The reason why I used the full path instead of the variable is, because for some reason the code didn't worked with calling the variable with "$A" .
Not sure if this fixes it, but one thing I notice is that:
shell:
"venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2]..."
probably should be:
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]}..."
Related
You have a pipeline where some samples have one input file and produce one output file while other samples have two input files and produce two output files. The typical case for those in bioinformatics is NGS sequencing libraries where some samples are paired-end and other samples are single-end sequenced. If you need to trim reads and align them you have to account for the variable number of input/output files.
What is the most appropriate way to handle this? I feel using checkpoints is overkill since in my opinion checkpoints are a bit cryptic, but I may be wrong...
Here's how I would do it for the case where the number of input/output of files can be only 1 or 2: Have an if-else in the run directive based first on the number of input files. If the number of inputs is 1 touch the second files. For the following rules have again an if-else this time checking whether the second file has 0 bytes.
Here's an example (not tested but it should be about right):
import os
samples = {'S1': ['s1.R1.fastq'],
'S2': ['s2.R1.fastq', 's2.R2.fastq']}
rule all:
input:
expand('bam/{sample_id}.bam', sample_id= samples),
rule cutadapt:
input:
fastq= lambda wc: samples[wc.sample_id],
output:
r1= 'cutadapt/{sample_id}.R1.fastq',
r2= touch('cutadapt/{sample_id}.R2.fastq'),
run:
if len(input.fastq) == 1:
# {output.r1} created, {output.r2} touched
shell('cutadapt ... -o {output.r1} {input.fastq}')
else:
shell('cutadapt ... -o {output.r1} -p {output.r2} {input.fastq}')
rule align:
input:
r1= 'cutadapt/{sample_id}.R1.fastq',
r2= 'cutadapt/{sample_id}.R2.fastq',
output:
'bam/{sample_id}.bam',
run:
if os.path.getsize(input.r2) == 0:
# or
# if len(samples[wildcards.sample_id]) == 1:
shell('hisat2 ... {input.r1} > {output.bam}')
else:
shell('hisat2 ... -1 {input.r1} -2 {input.r2} > {output.bam}')
This works, but I find it awkward to artificially create the second file just to keep the workflow happy. Are there better solutions?
There shall be two separate rules for single ended and pair ended case:
rule single_end:
input:
fastq = 's{n}.R1.fastq'
output:
r1 = 'cutadapt/S{n}.R1.fastq'
shell:
'cutadapt ... -o {output.r1} {input.fastq}'
rule pair_ends:
input:
r1 = 's{n}.R1.fastq',
r2 = 's{n}.R2.fastq'
output:
r1 = 'cutadapt/S{n}.R1.fastq',
r2 = 'cutadapt/S{n}.R2.fastq'
shell:
'cutadapt ... -o {output.r1} -p {output.r2} {input}'
Now there is an ambiguity: whenever the pair_ends can be applied, the single_end may be applied too. This problem can be solved with can be ruleorder:
ruleorder pair_ends > single_end
I agree that checkpoints would be overkill here.
You can compose arbitrary inputs and command lines using input functions and parameter functions. The trick is optional r2 output from your cutadapt rule, which may or may not be present, which complicates the workflow's DAG. In this case it's probably appropriate to use Snakemake's directories as output functionality:
CUTADAPT_R1 = 'cutadapt/{sample_id}/R1.fastq'
CUTADAPT_R2 = 'cutadapt/{sample_id}/R2.fastq'
def get_cutadapt_outargs(wldc, input):
r1 = CUTADAPT_R1.format(**wldc)
r2 = CUTADAPT_R2.format(**wldc)
ret = f"-o '{r1}'"
if len(input.fastq) > 1:
ret += f" -p '{r2}'"
return ret
def get_align_inargs(wldc):
r1 = CUTADAPT_R1.format(**wldc)
r2 = CUTADAPT_R2.format(**wldc)
if len(samples[wldc.sample_id]) > 1:
return "-1 '{}' -2 '{}'".format(r1, r2)
else:
return "'{}'".format(r1)
rule all:
input:
expand('bam/{sample_id}.bam', sample_id= samples),
rule cutadapt:
input:
fastq= lambda wc: samples[wc.sample_id],
output:
fq = directory('cutadapt/{sample_id}'),
params:
out_args = get_cutadapt_outargs,
shell:
"""
cutadapt ... {params.out_args} {input.fastq}
"""
rule align:
input:
fq = rules.cutadapt.output.fq,
output:
bam = 'bam/{sample_id}.bam',
params:
input_args = get_align_inargs,
shell:
"""
hisat2 ... {params.input_args} > {output.bam}
"""
"TMalign..." is an executable file that I used to get data. How could I store the output into a variable so that I could extract target values from the output. The executable file is compiled from a long .cpp, so I do not think I could call the variable names from there.
import sys,os
os.system("./TMalign 3w4u.pdb 6bb5.pdb -u 139") #some command I have
The output is like, and I need to extract the TM-score values:
*********************************************************************
* TM-align (Version 20190822): protein structure alignment *
* References: Y Zhang, J Skolnick. Nucl Acids Res 33, 2302-9 (2005) *
* Please email comments and suggestions to yangzhanglab#umich.edu *
*********************************************************************
Name of Chain_1: 3w4u.pdb (to be superimposed onto Chain_2)
Name of Chain_2: 6bb5.pdb
Length of Chain_1: 141 residues
Length of Chain_2: 139 residues
Aligned length= 139, RMSD= 1.07, Seq_ID=n_identical/n_aligned= 0.590
TM-score= 0.94726 (if normalized by length of Chain_1, i.e., LN=141, d0=4.42)
TM-score= 0.96044 (if normalized by length of Chain_2, i.e., LN=139, d0=4.38)
TM-score= 0.96044 (if normalized by user-specified LN=139.00 and d0=4.38)
(You should use TM-score normalized by length of the reference structure)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::. .
-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK-Y
Total CPU time is 0.03 seconds
Thanks for help!
You should look toward the following approach:
import re
from subprocess import check_output
ret = check_output(['./TMalign', '3w4u.pdb', '6bb5.pdb', '-u', '139'])
tm_scores = []
for line in str(ret).split('\\n'):
if re.match(r'^TM-score=', line):
score = line.split()[1:2] # Extract the value
tm_scores.extend(score) # Saving only values
# tm_scores now contains: ['0.94726', '0.96044', '0.96044']
While it being somewhat elaborate, it is a flexible and tunable solution. Note, if it will be used among other code, it would be better to wrap this into a function.
My function wasn't that smart,I will let ouput write in to a file to do the follow
import os
cmd = './TMalign 3w4u.pdb 6bb5.pdb -u 139'
os.system(cmd + ">> 1.txt")
I am executing python scripts using cmd and have challenges in using other interfaces, so would be executing on CMD only. The python script has 113 lines of code. I want to run and test some selected subsetted line of codes before executing the complete script, without making new python scripts but from the parent script.
From example below (has 28 lines):
To run the parent script we say in cmd:
C:\Users\X> python myMasterDummyScript.py
Can we run just between lines 1 - 16
Dummy Example:
import numpy as np
from six.moves import range
from six.moves import cPickle as pickle
pickle_file = "C:\\A.pickle"
with open(pickle_file, 'rb') as f:
data = pickle.load(f, encoding ='latin1')
train_dataset = data['train_dataset']
test_dataset = data['test_dataset']
valid_dataset = data['valid_dataset']
train_labels = data['train_labels']
test_labels = data['test_labels']
valid_labels = data['valid_labels']
a = 28
b = 1
def reformat(dataset, labels):
dataset = dataset.reshape(-1, a, a, b).astype(np.float32)
labels = (np.arange(10)==labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
Open the parent script in an interpreter like PyCharm and select the lines you want to execute & then right click -> Execute selection in console.
It would theoretically be possible with a bit of work, however please note that this is not how scripts work in general. Instead, you should consider grouping coherent routine sequences into named functions, and call them from command line.
Among other issues, you'll have to modify all calling code to your script every time you shift the line numbers, you'll have to repeat any imports any subsection would potentially need and it's generally not a good idea. I am still going to address it after I make a case for refactoring though...
Refactoring the script and calling specific functions
Consider this answer to Python: Run function from the command line
Your python script:
import numpy as np
from six.moves import range
from six.moves import cPickle as pickle
def load_data()
pickle_file = "C:\\A.pickle"
with open(pickle_file, 'rb') as f:
data = pickle.load(f, encoding ='latin1')
train_dataset = data['train_dataset']
test_dataset = data['test_dataset']
valid_dataset = data['valid_dataset']
train_labels = data['train_labels']
test_labels = data['test_labels']
valid_labels = data['valid_labels']
def main():
a = 28
b = 1
def reformat(dataset, labels):
dataset = dataset.reshape(-1, a, a, b).astype(np.float32)
labels = (np.arange(10)==labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
Your cmd code would look like this:
REM
REM any logic to identify which function to call
REM
python -c "import myMasterDummyScript; myMasterDummyScript.load_data()"
It also enables you to pass arguments from cmd into the function call.
Now if you're really adamant about running an arbitrary subset of lines from an overall python script...
How to run specific lines from a script in cmd
cmd to read those lines out of the original script and write them to a temporary script
Look at a proposed answer for batch script - read line by line. Adapting it slightly without so much of error management (which would significantly bloat this answer):
#echo off
#enabledelayedexpansion
SET startline=$1
SET endline=$2
SET originalscript=$3
SET tempscript=tempscript.py
SET line=1
REM erase tempscript
echo. > %tempscript%
for /f "tokens=*" %%a in (%originalscript%) do (
if %line% GEQ %startline% (
if %line% LEQ %endline% (
echo %%a >> %tempscript%
)
)
set /a line+=1
)
python %tempscript%
pause
You'd call it like this:
C:\> runlines.cmd 1 16 myMasterDummyScript.py
You could use the command line debugger pdb. As an example, given the following script:
print('1')
print('2')
print('3')
print('4')
print('5')
print('6')
print('7')
print('8')
print('9')
print('10')
print('11')
print('12')
print('13')
print('14')
print('15')
Here's a debug session that runs only lines 5-9 by jumping to line 5, setting a breakpoint at line 10, gives a listing to see the current line to be executed and breakpoints, and continuing execution. Type help to see all the commands available.
C:\>py -m pdb test.py
> c:\test.py(1)<module>()
-> print('1')
(Pdb) jump 5
> c:\test.py(5)<module>()
-> print('5')
(Pdb) b 10
Breakpoint 1 at c:\test.py:10
(Pdb) longlist
1 print('1')
2 print('2')
3 print('3')
4 print('4')
5 -> print('5')
6 print('6')
7 print('7')
8 print('8')
9 print('9')
10 B print('10')
11 print('11')
12 print('12')
13 print('13')
14 print('14')
15 print('15')
(Pdb) cont
5
6
7
8
9
> c:\test.py(10)<module>()
-> print('10')
(Pdb)
Option 1 - You can use a debugger to know everything in any moment of the code execution. (python -m pdb myscript.py debug your code too)
Option 2 - You can create a main file and a sub-files with your pieces of scripts and import in the main script and execute the main file or any separated file to test
Option 3 - You can use arguments (Using the argparse for example)
I've not more options at the moment
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R?
>>> print 'hello world'
hello world
This only shows the output:
> system("python -c 'print \"hello world\"'")
hello world
Thanks!
BTW, I asked in r-help but have not got a response yet (if I do, I'll post the answer here).
Do you mean something like this?
export NUM=10
R -q -e "rnorm($NUM)"
You might also like to check out littler - http://dirk.eddelbuettel.com/code/littler.html
UPDATED
Following your comment below, I think I am beginning to understand your question better. You are asking about running python inside the R shell.
So here's an example:-
# code in a file named myfirstpythonfile.py
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
In your R shell, therefore, do this:
> system('python myfirstpythonfile.py')
1
19
3
Essentially, you can simply call python /path/to/your/python/file.py to execute a block of python code.
In my case, I can simply call python myfirstpythonfile.py assuming that I launched my R shell in the same directory (path) my python file resides.
FURTHER UPDATED
And if you really want to print out the source code, here's a brute force method that might be possible. In your R shell:-
> system('python -c "import sys; sys.stdout.write(file(\'myfirstpythonfile.py\', \'r\').read());"; python myfirstpythonfile.py')
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
1
19
3
AND FURTHER FURTHER UPDATED :-)
So if the purpose is to print the python code before the execution of a code, we can use the python trace module (reference: http://docs.python.org/library/trace.html). In command line, we use the -m option to call a python module and we specify the options for that python module following it.
So for my example above, it would be:-
$ python -m trace --trace myfirstpythonfile.py
--- modulename: myfirstpythonfile, funcname: <module>
myfirstpythonfile.py(1): a = 1
myfirstpythonfile.py(2): b = 19
myfirstpythonfile.py(3): c = 3
myfirstpythonfile.py(4): mylist = [a, b, c]
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
1
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
19
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
3
myfirstpythonfile.py(5): for item in mylist:
--- modulename: trace, funcname: _unsettrace
trace.py(80): sys.settrace(None)
Which as we can see, traces the exact line of python code, executes the result immediately after and outputs it into stdout.
The system command has an option called intern = FALSE. Make this TRUE and Whatever output was just visible before, will be stored in a variable.
Now run your system command with this option and you should get your output directly in your variable. Like this
tmp <- system("python -c 'print \"hello world\"'",intern=T)
My work around for this problem is defining my own functions that paste in parameters, write out a temporary .py file, and them execute the python file via a system call. Here is an example that calls ArcGIS's Euclidean Distance function:
py.EucDistance = function(poly_path,poly_name,snap_raster,out_raster_path_name,maximum_distance,mask){
py_path = 'G:/Faculty/Mann/EucDistance_temp.py'
poly_path_name = paste(poly_path,poly_name, sep='')
fileConn<-file(paste(py_path))
writeLines(c(
paste('import arcpy'),
paste('from arcpy import env'),
paste('from arcpy.sa import *'),
paste('arcpy.CheckOutExtension("spatial")'),
paste('out_raster_path_name = "',out_raster_path_name,'"',sep=""),
paste('snap_raster = "',snap_raster,'"',sep=""),
paste('cellsize =arcpy.GetRasterProperties_management(snap_raster,"CELLSIZEX")'),
paste('mask = "',mask,'"',sep=""),
paste('maximum_distance = "',maximum_distance,'"',sep=""),
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('arcpy.env.overwriteOutput = True'),
paste('arcpy.env.snapRaster = "',snap_raster,'"',sep=""),
paste('arcpy.env.mask = mask'),
paste('arcpy.env.scratchWorkspace ="G:/Faculty/Mann/Historic_BCM/Aggregated1080/Scratch.gdb"'),
paste('arcpy.env.outputCoordinateSystem = sr'),
# get spatial reference for raster and force output to that
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('py_projection = sr.exportToString()'),
paste('arcpy.env.extent = snap_raster'),
paste('poly_name = "',poly_name,'"',sep=""),
paste('poly_path_name = "',poly_path_name,'"',sep=""),
paste('holder = EucDistance(poly_path_name, maximum_distance, cellsize, "")'),
paste('holder = SetNull(holder < -9999, holder)'),
paste('holder.save(out_raster_path_name) ')
), fileConn, sep = "\n")
close(fileConn)
system(paste('C:\\Python27\\ArcGIS10.1\\python.exe', py_path))
}
i am using rpy2-2.0.7 (i need this to work with windows 7, and compiling the binaries for the newer rpy2 versions is a mess) to push a two-column dataframe into r, create a few layers in ggplot2, and output the image into a <.png>.
i have wasted countless hours fidgeting around with the syntax; i did manage to output the files i needed at one point, but (stupidly) did not notice and continued fidgeting around with my code ...
i would sincerely appreciate any help; below is a (trivial) example for demonstration. Thank you very much for your help!!! ~ Eric Butter
import rpy2.robjects as rob
from rpy2.robjects import r
import rpy2.rlike.container as rlc
from array import array
r.library("grDevices") # import r graphics package with rpy2
r.library("lattice")
r.library("ggplot2")
r.library("reshape")
picpath = 'foo.png'
d1 = ["cat","dog","mouse"]
d2 = array('f',[1.0,2.0,3.0])
nums = rob.RVector(d2)
name = rob.StrVector(d1)
tl = rlc.TaggedList([nums, name], tags = ('nums', 'name'))
dataf = rob.RDataFrame(tl)
## r['png'](file=picpath, width=300, height=300)
## r['ggplot'](data=dataf)+r['aes_string'](x='nums')+r['geom_bar'](fill='name')+r['stat_bin'](binwidth=0.1)
r['ggplot'](data=dataf)
r['aes_string'](x='nums')
r['geom_bar'](fill='name')
r['stat_bin'](binwidth=0.1)
r['ggsave']()
## r['dev.off']()
*The output is just a blank image (181 b).
here are a couple common errors R itself throws as I fiddle around in ggplot2:
r['png'](file=picpath, width=300, height=300)
r['ggplot']()
r['layer'](dataf, x=nums, fill=name, geom="bar")
r['geom_histogram']()
r['stat_bin'](binwidth=0.1)
r['ggsave'](file=picpath)
r['dev.off']()
*RRuntimeError: Error: No layers in plot
r['png'](file=picpath, width=300, height=300)
r['ggplot'](data=dataf)
r['aes'](geom="bar")
r['geom_bar'](x=nums, fill=name)
r['stat_bin'](binwidth=0.1)
r['ggsave'](file=picpath)
r['dev.off']()
*RRuntimeError: Error: When setting aesthetics, they may only take one value. Problems: fill,x
I use rpy2 solely through Nathaniel Smith's brilliant little module called rnumpy (see the "API" link at the rnumpy home page). With this you can do:
from rnumpy import *
r.library("ggplot2")
picpath = 'foo.png'
name = ["cat","dog","mouse"]
nums = [1.0,2.0,3.0]
r["dataf"] = r.data_frame(name=name, nums=nums)
r("p <- ggplot(dataf, aes(name, nums, fill=name)) + geom_bar(stat='identity')")
r.ggsave(picpath)
(I'm guessing a little about how you want the plot to look, but you get the idea.)
Another great convenience is entering "R mode" from Python with the ipy_rnumpy module. (See the "IPython integration" link at the rnumpy home page).
For complicated stuff, I usually prototype in R until I have the plotting commands worked out. Error reporting in rpy2 or rnumpy can get quite messy.
For instance, the result of an assignment (or other computation) is sometimes printed even when it should be invisible. This is annoying e.g. when assigning to large data frames. A quick workaround is to end the offending line with a trailing statement that evaluates to something short. For instance:
In [59] R> long <- 1:20
Out[59] R>
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20
In [60] R> long <- 1:100; 0
Out[60] R> [1] 0
(To silence some recurrent warnings in rnumpy, I've edited rnumpy.py to add 'from warnings import warn' and replace 'print "error in process_revents: ignored"' with 'warn("error in process_revents: ignored")'. That way, I only see the warning once per session.)
You have to engage the dev() before you shut it off, which means that you have to print() (like JD guesses above) prior to throwing dev.off().
from rpy2 import robjects
r = robjects.r
r.library("ggplot2")
robjects.r('p = ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()')
r.ggsave('/stackBar.jpeg')
robjects.r('print(p)')
r['dev.off']()
To make it slightly more easy when you have to draw more complex plots:
from rpy2 import robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.lib.ggplot2 as ggplot2
r = robjects.r
grdevices = importr('grDevices')
p = r('''
library(ggplot2)
p <- ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
p <- p + opts(title = "{0}")
# add more R code if necessary e.g. p <- p + layer(..)
p'''.format("stackbar"))
# you can use format to transfer variables into R
# use var.r_repr() in case it involves a robject like a vector or data.frame
p.plot()
# grdevices.dev_off()