Python Panda.read_csv rounds to get import errors? - python

I have a 10000 x 250 dataset in a csv file. When I use the command
data = pd.read_csv('pool.csv', delimiter=',',header=None)
while I am in the correct path I actually import the values.
First I get the Dataframe. Since I want to work with the numpy package I need to convert this to its values using
data = data.values
And this is when i gets weird. I have at position [9999,0] in the file a -0.3839 as value. However after importing and calculating with it I noticed, that Python (or numpy) does something strange while importing.
Calling the value of data[9999,0] SHOULD give the expected -0.3839, but gives something like -0.383899892....
I already imported the file in other languages like Matlab and there was no issue of rounding those values. I aswell tried to use the .to_csv command from the pandas package instead of .values. However there is the exact same problem.
The last 10 elements of the first column are
-0.2716
0.3711
0.0487
-1.518
0.5068
0.4456
-1.753
-0.4615
-0.5872
-0.3839
Is there any import routine, which does not have those rounding errors?

Passing float_precision='round_trip' should solve this issue:
data = pd.read_csv('pool.csv',delimiter=',',header=None,float_precision='round_trip')

That's a floating point error. This is because of how computers work. (You can look it up if you really want to know how it works.) Don't be bothered by it, it is very small.
If you really want to use exact precision (because you are testing for exact values) you can look at the decimal module of Python, but your program will be a lot slower (probably like 100 times slower).
You can read more here: https://docs.python.org/3/tutorial/floatingpoint.html
You should know that all languages have this problem, only some are better in hiding it. (Also note that in Python3 this "hiding" of the floating point error has been improved.)
Since this problem cannot be solved by an ideal solution, you are given the task to solve it yourself and choose the most appropriate solution for your situtation
I don't know about 'round_trip' and its limitations, but it probably can help you. Other solutions would be to use float_format from the to_csv method. (https://docs.python.org/3/library/string.html#format-specification-mini-language)

Related

Why does Bessel function in Scipy and EXCEL give different results?

I tried to use Scipy and EXCEL to calculate Bessel function, but they give different results. Do you know why? Thanks in advance.
Python code:
import scipy.special as ss
result = ss.k1(0.2155481626213)
print(result)
EXCEL (I use the OneDrive Excel web app of today's version)
=BESSELK(0,2155481626213; 1)
The result from Python is 4.405746469429914
The result from Excel is 4,405746474969860.
Since the error of the result is quite small, the complexity of the numerical calculations and the error propagation can cause the difference.
Side note:
even Wolfram Alpha got a different value:4.405746469430.
As #HubertusKaiser says; the error is so small that we can assign it to rounding-errors/floating-points.
There's an excelent explanation why 0.1+0.2 != 0.3 for most computers here.
Now imagine doing a lot of those "wrong" floating-points calculations, you end up with the error difference you see

Python Idle column limitation

I am trying to use Idle for working with some data and my problem is that when there are too many columns, some of them will be omitted after running and replaced with the dots. Is there a way to increase the limit set by Idle ide ? I have seen sentdex using Idle with up to 11 columns and all of them were presented, hence my question.
Thank you very much for your responses.
What type are you printing? Some have a reduced representation that is produced by default by their str conversion to avoid flooding the terminal. You can get some of them to produce their full representation by applying repr to them and then printing the result.
This doesn't work for dataframes. They have their own adjustable row and column limits. See Pretty-print an entire Pandas Series / DataFrame

Highly precise division, multiplication, and exponentiation of large complex numbers

I am working on a project that requires highly precise division of large numbers which will sometimes be complex numbers. I need to do this in python, preferably python 3.7, but everything I have tried so far has not worked at all.
With real numbers, I can simply use the decimal module, but I found the decimal module does not work for complex numbers. In addition, when I have tried to extend the decimal module to the complex numbers, it has failed as I get inaccurate results with both large real and large complex inputs. When trying to download external modules with the functionality, it has not worked.
from decimal import *
def div(a,b):
y = b.real - (b.imag*1j)
a = a*y
b = Decimal((b*y).real)
return [Decimal(a.real)/b,Decimal(a.imag)/b]
Here is my code for using the decimal module on complex numbers, and to demonstrate what I mean (and to demonstrate this method of division works) Ill show below some inputs and outputs. The first one will be the method of working with relatively small inputs, and the 2nd will be the method very much not working with a large input.
>>> div(13243,23)[0]*23
Decimal('13243.00000000000000000000000')
>>> div(15**17,23)[0]*23
Decimal('98526125335693355453.21739130')
The result from trying with 15**17 is not only a few thousand higher than 15**17, but it's also not a whole number. This is very incorrect. As said I need this method to be transferable to the complex numbers, and as it stands to store complex numbers in a list is a pain and not ideal. It was necessary to do so in order to use decimal on the parts though, however, it clearly hasn't worked.
I thought at first that perhaps it was a case of I just needed to set the precision higher, but even when set to 1000 it still fails.
At this point, I tried to find some modules that would allow me to do this. I found 2. mpmath and gmpy. I tried to install gmpy via pip, and I tried doing so on multiple versions of python and with multiple versions of gmpy, and each time I got an error message, normally one about a sever "actively refusing connection", as well as others saying it wasn't supported, etc.
This kind of leaves me stuck. I can't get the modules that do it for me, and when I try to do it myself it quite blatantly isn't working. Is there another module that provides this functionality out there or is there something I am particularly doing wrong with my attempts that can be fixed somehow?

How to modify the default mapper of the StringConverter class?

I am trying to read a csv file in Python3 using the numpy genfromtxt function. In my csv file I have a field which is a string that looks like the following: "0x30375107333f3333".
I need to use the "dtype=None" option because I need this section of code to work with many different csv files, only some of them having such a field. Unfortunately numpy interprets this as a float128 which is a pain because 1) it is not a float and 2) I cannot find way to convert it to an int after it has been read as a float128 (without losing precision).
What I would like to do is instead interpret this as a string because it is enough for me. I found on the Numpy documentation that there is a way of getting around this, but they give cryptic instructions:
This behavior may be changed by modifying the default mapper of the StringConverter class.
Unfortunately whenever I Google something related to this I fall back to this documentation page.
I would greatly appreciate either an explanation of what they mean in the above quoted text or a solution to my above stated problem.

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

Categories