Getting part of R object from python using rpy2 - python

I can get the output I need using R, but I can not reproduce within python's rpy2 module.
In R:
> wilcox.test(c(1,2,3), c(100,200,300), alternative = "less")$p.value
[1] 0.05
In python:
import rpy2.robjects as robjects
rwilcox = robjects.r['wilcox.test']
x = robjects.IntVector([1,2,3,])
y = robjects.IntVector([100,200,300])
z = rwilcox(x,y, alternative = "less")
print z
Wilcoxon rank sum test
data: 1:3 and c(100L, 200L, 300L)
W = 0, p-value = 0.05
alternative hypothesis: true location shift is less than 0
z1 = z.rx('p.value')
print z1
[1] 0.05
Still trying to get a final value of 0.05 stored as a variable, but this seems to be closer to a final answer.
I am unable to figure out what my python code needs to be to to store the p.value in a new variable.

z1 is a ListVector containing one FloatVector with one element:
>>> z1
<ListVector - Python:0x4173368 / R:0x36fa648>
p.value: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x4173290 / R:0x35e6b38>
You can extract the float itself with z1[0][0] or just float(z1[0]):
>>> z1[0][0]
>>> type(z1[0][0])
<type 'float'>
>>> float(z1[0])
In general you are going to have an easier time figuring out what is going on in an interactive session if you just supply the name of the object you want a representation of. Using print x statement transforms things through str(x) when the repr(x) representation used implicitly by the interactive loop is much more helpful. If you are doing things in a script, use print repr(x) instead.

Just using list() ?
pval = z.rx2('p-value')
print list(pval) # [0.05]
rpy2 also works well with numpy:
import numpy
pval = numpy.array(pval)
print pval # array([ 0.05])


Float divisions returning wierd results

Im trying to do a project and for some reason the same divisions give me different results. I am trying to check if 2 divisions are equal and give me the same results but When I try 5.99/1 and 0.599/0.1 the script says that they are different while they are supposed to return the same results. I figured out what the problem is that 5.99/1 = 5.99 and 0.599/0.1 = 5.989999999999999but I cant find a fix for this.
You can find the reason in this answer:
I have written a possible solution for you:
a = 5.99 / 1
b = 0.599 / 0.1
a_str = "{:.4f}".format(5.99 / 1)
b_str = "{:.4f}".format(0.599 / 0.1)
print(a, b)
print(a_str, b_str)
print(a == b)
print(a_str == b_str)
>>> python3
5.99 5.989999999999999
5.9900 5.9900
As you can see below I have converted the result of division to a formatted string and I check them instead of default floating type.

Sympy simplification of maximum

I don't understand why Sympy won't return to me the expression below simplified (not sure its a bug in my code or a feature of Sympy).
import sympy as sp
a = sp.Symbol('a',finite = True, real = True)
b = sp.Symbol('b',finite = True, real = True)
I would expect the output to be $a+b$, but Sympy still gives me $Max(a-b,a+b)$.
Thanks; as you can see I am a beginner in Sympy so any hints/help are appreciated.
Surely the result should be a + b...
You can do this by setting the assumptions on the symbol as in:
In [2]: a = Symbol('a', negative=True)
In [3]: b = Symbol('b', positive=True)
In [4]: Max(a - b, a + b)
Out[4]: a + b
You are trying to use the new assumptions system but that system is still experimental and is not widely used within sympy. The new assumptions are not used in core evaluation so e.g. the Max function has no idea that you have declared global assumptions on a and b unless those assumptions are declared on the symbols as I show above.

KeyError with a poisson process using pandas

I am trying to create a function which will simulate a poison process for a changeable dt and total time, and have the following:
def compound_poisson(lamda,mu,sigma,dt,T):
points = pd.Series(0)
out = pd.Series(0)
inds = simple_poisson(lamda,dt,T)
for ind in inds.index:
if inds[ind+dt] > inds[ind]:
points[ind+dt] = np.random.normal(mu,sigma)
points[ind+dt] = 0
out = out.append(np.cumsum(points),ignore_index=True)
out.index = np.linspace(0,T,int(T/dt + 1))
return out
However, I receive a "KeyError: 0.010000000000000002", which should not be in the index at all. Is this a result of being lax with float objects?
In short, yes, it's a floating point error. It's quite hard to know how you got there, but probably something like this:
>>> 0.1 * 0.1
Maybe use round?

Understanding pandas.read_csv() float parsing

I am having problems reading probabilities from CSV using pandas.read_csv; some of the values are read as floats with > 1.0.
Specifically, I am confused about the following behavior:
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9... are converted to floats that are strictly greater than 1.0, causing problems e.g. when feeding them into sklearn.metrics.
The documentation states that read_csv has a parameter float_precision that can be used to select “which converter the C engine should use for floating-point values”, and setting this to 'high' indeed solves my problem.
However, I would like to understand the default behavior:
Where can I find the source code of the default float converter?
Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
Why does a single-figure change in the least significant position skip a value?
Why does this behave non-monotonically at all?
Edit regarding “duplicate question”: This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float does not show this behavior:
>>> float("0.99999999999999999")
...and I could not find documentation.
#MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:
The code for xstrtod is roughly like this (translated to pure Python):
def xstrtod(p):
number = 0.
idx = 0
ndecimals = 0
while p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
idx += 1
while idx < len(p) and p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
ndecimals += 1
return number / 10**ndecimals
Which reproduces the "problem" you saw:
print(xstrtod('0.99999999999999997')) # 1.0
print(xstrtod('0.99999999999999998')) # 1.0
print(xstrtod('0.99999999999999999')) # 1.0000000000000002
print(xstrtod('1.00000000000000000')) # 1.0
print(xstrtod('1.00000000000000001')) # 1.0
print(xstrtod('1.00000000000000002')) # 1.0
print(xstrtod('1.00000000000000003')) # 1.0
print(xstrtod('1.00000000000000004')) # 1.0
print(xstrtod('1.00000000000000005')) # 1.0
print(xstrtod('1.00000000000000006')) # 1.0
print(xstrtod('1.00000000000000007')) # 1.0
print(xstrtod('1.00000000000000008')) # 1.0
print(xstrtod('1.00000000000000009')) # 1.0000000000000002
print(xstrtod('1.00000000000000019')) # 1.0000000000000002
The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:
>>> float('100000000000000008')
>>> float('100000000000000009')
It's the 9 in the last place that is responsible for the skewed results.
If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:
>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal} # parse column 0 as decimals
>>> import io
>>> def parse(string):
... return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))
which prints:
Exactly representing the input!
If you want to understand how it works - look at the source code - file "_libs/parsers.pyx" lines: 492-499 for Pandas 0.20.1:
self.parser.double_converter_nogil = xstrtod # <------- default converter
self.parser.double_converter_withgil = NULL
if float_precision == 'high':
self.parser.double_converter_nogil = precise_xstrtod # <------- 'high' converter
self.parser.double_converter_withgil = NULL
elif float_precision == 'round_trip': # avoid gh-15140
self.parser.double_converter_nogil = NULL
self.parser.double_converter_withgil = round_trip
Source code for xstrtod
Source code for precise_xstrtod

How to convert unit prefixes in Sympy

I would like to know how to convert a sympy value with a physical unit into the same unit with another prefix. For example:
>>> import sympy.physics.units as u
>>> x = 0.001 * u.kilogram
should be converted to grams. The approach I have taken so far is very bloated and delivers a wrong result.
>>> x / u.kilogram * u.gram
It should be 1g instead.
If you can accept printing 1 instead of 1g, you could just use division:
>>> x / u.g
Otherwise, you'd better switch to sympy.physics.unitsystems.
>>> from sympy.physics.unitsystems import Quantity
>>> from import mks
>>> Quantity(0.001, mks['kg'])
>>> _.convert_to(mks['g'])
>>> u.convert_to(x, u.gram)
