Understanding pandas.read_csv() float parsing - python

I am having problems reading probabilities from CSV using pandas.read_csv; some of the values are read as floats with > 1.0.
Specifically, I am confused about the following behavior:
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002
Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9... are converted to floats that are strictly greater than 1.0, causing problems e.g. when feeding them into sklearn.metrics.
The documentation states that read_csv has a parameter float_precision that can be used to select “which converter the C engine should use for floating-point values”, and setting this to 'high' indeed solves my problem.
However, I would like to understand the default behavior:
Where can I find the source code of the default float converter?
Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
Why does a single-figure change in the least significant position skip a value?
Why does this behave non-monotonically at all?
Edit regarding “duplicate question”: This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float does not show this behavior:
>>> float("0.99999999999999999")
1.0
...and I could not find documentation.

#MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:
The code for xstrtod is roughly like this (translated to pure Python):
def xstrtod(p):
number = 0.
idx = 0
ndecimals = 0
while p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
idx += 1
while idx < len(p) and p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
ndecimals += 1
return number / 10**ndecimals
Which reproduces the "problem" you saw:
print(xstrtod('0.99999999999999997')) # 1.0
print(xstrtod('0.99999999999999998')) # 1.0
print(xstrtod('0.99999999999999999')) # 1.0000000000000002
print(xstrtod('1.00000000000000000')) # 1.0
print(xstrtod('1.00000000000000001')) # 1.0
print(xstrtod('1.00000000000000002')) # 1.0
print(xstrtod('1.00000000000000003')) # 1.0
print(xstrtod('1.00000000000000004')) # 1.0
print(xstrtod('1.00000000000000005')) # 1.0
print(xstrtod('1.00000000000000006')) # 1.0
print(xstrtod('1.00000000000000007')) # 1.0
print(xstrtod('1.00000000000000008')) # 1.0
print(xstrtod('1.00000000000000009')) # 1.0000000000000002
print(xstrtod('1.00000000000000019')) # 1.0000000000000002
The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:
>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17
It's the 9 in the last place that is responsible for the skewed results.
If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:
>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal} # parse column 0 as decimals
>>> import io
>>> def parse(string):
... return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))
which prints:
0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000
Exactly representing the input!

If you want to understand how it works - look at the source code - file "_libs/parsers.pyx" lines: 492-499 for Pandas 0.20.1:
self.parser.double_converter_nogil = xstrtod # <------- default converter
self.parser.double_converter_withgil = NULL
if float_precision == 'high':
self.parser.double_converter_nogil = precise_xstrtod # <------- 'high' converter
self.parser.double_converter_withgil = NULL
elif float_precision == 'round_trip': # avoid gh-15140
self.parser.double_converter_nogil = NULL
self.parser.double_converter_withgil = round_trip
Source code for xstrtod
Source code for precise_xstrtod

Related

How to convert voltage (or frequency) floating number read backs to mV (or kHz)?

I am successfully able to read back data from an instrument:
When the read back is a voltage, I typically read back values such as 5.34e-02 Volts.
When the read back is frequency, I typically read values like 2.95e+04or 1.49e+05 with units Hz.
I would like to convert the voltage read back of 5.34e-02 to exponent e-3 (aka millivolts), ie.. 53.4e-3. next, I would like to extract the mantissa 53.4 out of this because I want all my data needs to be in milliVolts.
Similarly, I would like to convert all the frequency such as 2.95e+04 (or 1.49e+05) to kiloHz, ie... 29.5e+03 or 149e+03. Next would like to extract the mantissa 29.5 and 149 from this since all my data needs to be kHz.
Can someone suggest how to do this?
Well, to convert volts to millivolts, you multiply by 1000. To convert Hz to kHz, you divide by 1000.
>>> reading = 5.34e-02
>>> millivolts = reading * 1000
>>> print(millivolts)
53.400000000000006
>>> hz = 2.95e+04
>>> khz = hz /1000
>>> khz
29.5
>>>
FOLLOW-UP
OK, assuming your real goal is to keep the units the same but adjust the exponent to a multiple of 3, see if this meets your needs.
def convert(val):
if isinstance(val,int):
return str(val)
cvt = f"{val:3.2e}"
if 'e' not in cvt:
return cvt
# a will be #.##
# b will be -##
a,b = cvt.split('e')
exp = int(b)
if exp % 3 == 0:
return cvt
if exp % 3 == 1:
a = a[0]+a[2]+a[1]+a[3]
exp = abs(exp-1)
return f"{a}e{b[0]}{exp:02d}"
a = a[0]+a[2]+a[3]+a[1]
exp = abs(exp-2)
return f"{a}e{b[0]}{exp:02d}"
for val in (5.34e-01, 2.95e+03, 5.34e-02, 2.95e+04, 5.34e-03, 2.95e+06):
print( f"{val:3.2e} ->", convert(val) )
Output:
5.34e-01 -> 534.e-03
2.95e+03 -> 2.95e+03
5.34e-02 -> 53.4e-03
2.95e+04 -> 29.5e+03
5.34e-03 -> 5.34e-03
2.95e+06 -> 2.95e+06
In this case, I think multiplying/dividing by 1000 is enough to move between SI prefixes. But when units get more complicated it might help to use a library like Pint to keep track of things and make sure you're calculating what you think you are.
In this case you might do:
import pint
ureg = pint.UnitRegistry()
Q = ureg.Quantity
reading_v = Q(5.34e-02, 'volts')
reading_mv = reading_v.to('millivolts')
print(reading_mv.magnitude)
but it seems overkill here.

Float divisions returning wierd results

Im trying to do a project and for some reason the same divisions give me different results. I am trying to check if 2 divisions are equal and give me the same results but When I try 5.99/1 and 0.599/0.1 the script says that they are different while they are supposed to return the same results. I figured out what the problem is that 5.99/1 = 5.99 and 0.599/0.1 = 5.989999999999999but I cant find a fix for this.
You can find the reason in this answer: https://stackoverflow.com/a/588014/11502612
I have written a possible solution for you:
Code:
a = 5.99 / 1
b = 0.599 / 0.1
a_str = "{:.4f}".format(5.99 / 1)
b_str = "{:.4f}".format(0.599 / 0.1)
print(a, b)
print(a_str, b_str)
print(a == b)
print(a_str == b_str)
Output:
>>> python3 test.py
5.99 5.989999999999999
5.9900 5.9900
False
True
As you can see below I have converted the result of division to a formatted string and I check them instead of default floating type.

What does the .N mean in this block of Python code?

I am working through Learn Python the Hard Way and am browsing through some code on Git Hub before moving on. I am just curious what the .N does on the line with "tm.N = 1000" and how it relates to the end of the code.
import matplotlib.pyplot as plt
import random
import pandas.util.testing as tm
tm.N = 1000
df = tm.makeTimeDataFrame()
import string
foo = list(string.letters[:5]) * 200
df['indic'] = list(string.letters[:5]) * 200
random.shuffle(foo)
df['indic2'] = foo
df.boxplot(by=['indic', 'indic2'], fontsize=8, rot=90)
plt.show()
N is a global in the testing.py module, that is used all around the module to test arrays and other things. Its default value is 30. E.g.
np.arange(N * K).reshape((N, K))
Series(randn(N), index=index)
In the code you're posting it have poor usage, because makeTimeDataFrame can be feed with a nper parameter that end up being substituted by N if nper is not provided. This is the correct usage, that would not confuse you:
df = tm.makeTimeDataFrame(nper=1000)
The previous line, import pandas.util.testing as tm, imports the module pandas.util.testing and, for convenience, gives it the name tm. Thus, tm afterwards refers to this module, and so tm.N refers to the object named "N" (whatever that is) in the module.
Source: https://github.com/pydata/pandas/blob/master/pandas/util/testing.py
N is a variable in the pandas.util.testing library (imported as tm). It's used in a few of the functions defined in that library, including the makeTimeSeries function called in the getTimeSeriesData which is in turn called in the makeTimeDataFrame function that you call with df = tm.makeTimeDataFrame()
You can get information about pandas.util.testing.N from the docstring and the type() function:
>>> tm.N.__doc__
'int(x[, base]) -> integer\n\nConvert a string or number to an integer, if possible. A floating point\nargument will be truncated towards zero (this does not include a string\nrepresentation of a floating point number!) When converting a string, use\nthe optional base. It is an error to supply a base when converting a\nnon-string. If base is zero, the proper base is guessed based on the\nstring content. If the argument is outside the integer range a\nlong object will be returned instead.'
>>> print(tm.N.__doc__)
int(x[, base]) -> integer
Convert a string or number to an integer, if possible. A floating point
argument will be truncated towards zero (this does not include a string
representation of a floating point number!) When converting a string, use
the optional base. It is an error to supply a base when converting a
non-string. If base is zero, the proper base is guessed based on the
string content. If the argument is outside the integer range a
long object will be returned instead.
>>> type(tm.N)
<type 'int'>
In pandas in the module pandas.util.testing the N property means TimeSeries
See this reference in the section:
We could alternatively have used the unit testing function to create a TimeSeries of length 20:
>>>> pandas.util.testing.N = 20
>>>> ts = pandas.util.testing.makeTimeSeries()
It makes a timeseries of length 1000.
>>> df.head()
Out[7]:
A B C D
2000-01-03 -0.734093 -0.843961 -0.879394 0.415565
2000-01-04 0.028562 -1.098165 1.292156 0.512677
2000-01-05 1.135995 -0.864060 1.297646 -0.166932
2000-01-06 -0.738651 0.426662 0.505882 -0.124671
2000-01-07 -1.242401 0.225207 0.053541 -0.234740
>>> len(df)
Out[8]: 1000
.N provides the number of elements in array type. For example, if you use a colormap,
plt.get_cmap('Pastel1').N will return 9 because it consists of 9 colors whereas
plt.get_cmap('nipy_spectral').N will return 256

Convert Scientific Notation to Float

Encountered a problem whereby my JSON data gets printed as a scientific notation instead of a float.
import urllib2
import json
import sys
url = 'https://bittrex.com/api/v1.1/public/getmarketsummary?market=btc-quid'
json_obj = urllib2.urlopen(url)
QUID_data = json.load(json_obj)
QUID_MarketName_Trex = QUID_data["result"][0]["MarketName"][4:9]
QUID_Last_Trex = QUID_data["result"][0]["Last"]
QUID_High_Trex = QUID_data["result"][0]["High"]
QUID_Low_Trex = QUID_data["result"][0]["Low"]
QUID_Volume_Trex = QUID_data["result"][0]["Volume"]
QUID_BaseVolume_Trex = QUID_data["result"][0]["BaseVolume"]
QUID_TimeStamp_Trex = QUID_data["result"][0]["TimeStamp"]
QUID_Bid_Trex = QUID_data["result"][0]["Bid"]
QUID_Ask_Trex = QUID_data["result"][0]["Ask"]
QUID_OpenBuyOrders_Trex = QUID_data["result"][0]["OpenBuyOrders"]
QUID_OpenSellOrders_Trex = QUID_data["result"][0]["OpenSellOrders"]
QUID_PrevDay_Trex = QUID_data["result"][0]["PrevDay"]
QUID_Created_Trex = QUID_data["result"][0]["Created"]
QUID_Change_Trex = ((QUID_Last_Trex - QUID_PrevDay_Trex)/ QUID_PrevDay_Trex)*100
QUID_Change_Var = str(QUID_Change_Trex)
QUID_Change_Final = QUID_Change_Var[0:5] + '%'
print QUID_Last_Trex
It prints the following value; 1.357e-05.
I need this to be a float with 8 chars behind the decimal (0.00001370)
As you can see here --> http://i.imgur.com/FCVM1UN.jpg, my GUI displays the first row correct (using the exact same code).
You are looking at the default str() formatting of floating point numbers, where scientific notation is used for sufficiently small or large numbers.
You don't need to convert this, the value itself is a proper float. If you need to display this in a different format, format it explicitly:
>>> print(0.00001357)
1.357e-05
>>> print(format(0.00001357, 'f'))
0.000014
>>> print(format(0.00001357, '.8f'))
0.00001357
Here the f format always uses fixed point notation for the value. The default precision is 6 digits; the .8 instructs the f formatter to show 8 digits instead.
In Python 3, the default string format is essentially the same as format(fpvalue, '.16g'); the g format uses either a scientific or fixed point presentation depending on the exponent of the number. Python 2 used '.12g'.
You can use print formatting:
x = 1.357e-05
print('%f' % x)
Edit:
print('%.08f' % x)
There are some approaches:
#1 float(...) + optionally round() or .format()
x = float(1.357e-05)
round(x, 6)
"{:.8f}".format(x)
#2 with decimal class
import decimal
tmp = decimal.Decimal('1.357e-05')
print('[0]', tmp)
# [0] 0.00001357
tmp = decimal.Decimal(1.357e-05)
print('[1]', tmp)
# [1] 0.0000135700000000000005188384444299032338676624931395053863525390625
decimal.getcontext().prec = 6
tmp = decimal.getcontext().create_decimal(1.357e-05)
print('[2]', tmp)
# [2] 0.0000135700
#3 with .rstrip(...)
x = ("%.17f" % n).rstrip('0').rstrip('.')
Note: there are counterparts to %f:
%f shows standard notation
%e shows scientific notation
%g shows default (scientific if 5 or more zeroes)

Getting part of R object from python using rpy2

I can get the output I need using R, but I can not reproduce within python's rpy2 module.
In R:
> wilcox.test(c(1,2,3), c(100,200,300), alternative = "less")$p.value
gives
[1] 0.05
In python:
import rpy2.robjects as robjects
rwilcox = robjects.r['wilcox.test']
x = robjects.IntVector([1,2,3,])
y = robjects.IntVector([100,200,300])
z = rwilcox(x,y, alternative = "less")
print z
gives:
Wilcoxon rank sum test
data: 1:3 and c(100L, 200L, 300L)
W = 0, p-value = 0.05
alternative hypothesis: true location shift is less than 0
And:
z1 = z.rx('p.value')
print z1
gives:
$p.value
[1] 0.05
Still trying to get a final value of 0.05 stored as a variable, but this seems to be closer to a final answer.
I am unable to figure out what my python code needs to be to to store the p.value in a new variable.
z1 is a ListVector containing one FloatVector with one element:
>>> z1
<ListVector - Python:0x4173368 / R:0x36fa648>
[FloatVector]
p.value: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x4173290 / R:0x35e6b38>
[0.050000]
You can extract the float itself with z1[0][0] or just float(z1[0]):
>>> z1[0][0]
0.05
>>> type(z1[0][0])
<type 'float'>
>>> float(z1[0])
0.05
In general you are going to have an easier time figuring out what is going on in an interactive session if you just supply the name of the object you want a representation of. Using print x statement transforms things through str(x) when the repr(x) representation used implicitly by the interactive loop is much more helpful. If you are doing things in a script, use print repr(x) instead.
Just using list() ?
pval = z.rx2('p-value')
print list(pval) # [0.05]
rpy2 also works well with numpy:
import numpy
pval = numpy.array(pval)
print pval # array([ 0.05])
http://rpy.sourceforge.net/rpy2/doc-2.3/html/numpy.html#from-rpy2-to-numpy

Categories