Keep original precision in dataframes during calculation - python

I am a bit confused about how python store values and how I should keep the original precision.
My problem is that after some calculation, I got a dataframe which contains Inf and NaN.
Then I replicate the same steps in Excel. The strange thing is that I got specific numbers in Excel for the NaN/Inf cases in python, but all other values are the same.
The calculation involves division, and whenever I encountered a very small number, for example, 2.48065E-16 in Excel, I got 0.0 in Python.
So I tried df = df.round(20) at the very beginning, the NaN disappeared, but I still got Inf.
I also tried pd.set_option('float_format', '{:f}'.format) but it didn't work.
Any idea on how I should let python store and manipulate the values in dataframes without changing the original precision?

Related

Running the same script on the same pandas data produces very slightly different dataframes floating-point values

I am executing a script that I have run before on the same data. The dataframe I get is only so slightly different than the previous one (in the 10th decimal point or so). For example:
in some column (and row) the old dataframe contains the price 5673391.88.
In the same column and same row of the new dataframe the value seems to be exactly the same (5673391.88).
However if I subtract the two columns I get a difference of -9.445123e-10.
This is of course the case for the entire column, not just the particular row. How can that be? Please note that I cannot confirm same environment (pandas or Python version) between the two script runs. Can it be one of these two reasons? Something else?
One possible reason: Pandas 1.2.0 which was released back in 26 Dec 2020, they have highlighted this issue:
Change in default floating precision for read_csv and read_table
the methods read_csv() and read_table() could read floating point numbers slightly incorrectly with respect to the last bit in precision.
Before this version floating_precision="high" has always been available to avoid this issue.
But, within this version the default is now floating_precision=None to make precision more acurate. It won't have any impact on performance.

Python Idle column limitation

I am trying to use Idle for working with some data and my problem is that when there are too many columns, some of them will be omitted after running and replaced with the dots. Is there a way to increase the limit set by Idle ide ? I have seen sentdex using Idle with up to 11 columns and all of them were presented, hence my question.
Thank you very much for your responses.
What type are you printing? Some have a reduced representation that is produced by default by their str conversion to avoid flooding the terminal. You can get some of them to produce their full representation by applying repr to them and then printing the result.
This doesn't work for dataframes. They have their own adjustable row and column limits. See Pretty-print an entire Pandas Series / DataFrame

Python Panda.read_csv rounds to get import errors?

I have a 10000 x 250 dataset in a csv file. When I use the command
data = pd.read_csv('pool.csv', delimiter=',',header=None)
while I am in the correct path I actually import the values.
First I get the Dataframe. Since I want to work with the numpy package I need to convert this to its values using
data = data.values
And this is when i gets weird. I have at position [9999,0] in the file a -0.3839 as value. However after importing and calculating with it I noticed, that Python (or numpy) does something strange while importing.
Calling the value of data[9999,0] SHOULD give the expected -0.3839, but gives something like -0.383899892....
I already imported the file in other languages like Matlab and there was no issue of rounding those values. I aswell tried to use the .to_csv command from the pandas package instead of .values. However there is the exact same problem.
The last 10 elements of the first column are
-0.2716
0.3711
0.0487
-1.518
0.5068
0.4456
-1.753
-0.4615
-0.5872
-0.3839
Is there any import routine, which does not have those rounding errors?
Passing float_precision='round_trip' should solve this issue:
data = pd.read_csv('pool.csv',delimiter=',',header=None,float_precision='round_trip')
That's a floating point error. This is because of how computers work. (You can look it up if you really want to know how it works.) Don't be bothered by it, it is very small.
If you really want to use exact precision (because you are testing for exact values) you can look at the decimal module of Python, but your program will be a lot slower (probably like 100 times slower).
You can read more here: https://docs.python.org/3/tutorial/floatingpoint.html
You should know that all languages have this problem, only some are better in hiding it. (Also note that in Python3 this "hiding" of the floating point error has been improved.)
Since this problem cannot be solved by an ideal solution, you are given the task to solve it yourself and choose the most appropriate solution for your situtation
I don't know about 'round_trip' and its limitations, but it probably can help you. Other solutions would be to use float_format from the to_csv method. (https://docs.python.org/3/library/string.html#format-specification-mini-language)

Causes of negative dimensions are not allowed

What would be possible causes of this error:
negative dimensions are not allowed
I am using crosstab on a pandas data frame with 8 million rows, it works fine up to 2 million rows, and it throws this error when the size of data gets larger.
I tried to achieve what I want to achieve (as described in this question) using unstack, and I get exactly the same error.
Can anyone give any hint on this? I checked this question and this, but they were not helpful enough

xlwings representation of #N/A

I'm using xlwings with numpy and unittest in Python to test an Excel spreadsheet. However, when xlwings is importing a value which has #N/A it is resulting in -2146826246.
I understand that this may have something to do with xlwings importing values as float, and there may not be a good float representation of #N/A.
I want to compare #N/A with nan. Any advice on how to accomplish this?
For anyone who may stumble across the same problem in the future. I used a very crude method of building a dictionary with the error numbers and the value which I wanted to return.
error_dict = {-2146826281:np.inf,-2146826246:np.nan}
If anyone has a more elegant solution, please let me know!
Just to add another potential if you brought in (or converted) your excel data as a Pandas Dataframe you can always use replace to convert the old #N/A values to NaN's (which are a little easier to deal with in Python/Pandas..)
df.replace(-2146826246,float('nan'))
When converting range to df, use this option
.options(empty=np.nan)
Then, treating like NaN different errors is very easy
df[df==-2146826246]=np.nan
This is very useful in order to avoid error while using functions and calculations to our data frame

Categories