I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I ran across a new problem today. I discovered some data that I'm working with, that looks like this (in a csv file):
Male,White,All Combined,1989,30-31,31,"59,546","18,141","328,235"
Male,White,Felony - Drug,1989,30-31,31,"3,861","1,176","328,235"
Male,White,Felony - Other,1989,30-31,31,"2,626",800,"328,235"
Male,White,Felony - Property,1989,30-31,31,"3,468","1,057","328,235"
Male,White,Felony - Violent/Sex,1989,30-31,31,"3,150",960,"328,235"
Male,White,Misdemeanor,1989,30-31,31,"46,441","14,149","328,235"
Male,White,Status,1989,30-31,31,0,0,"328,235"
It's hard to see the problem, so let me highlight the second to last column:
"18,141"
"1,176"
800
"1,057"
960
"14,149"
0
The problem is values with commas are being exported as strings, while values without commas are exported as numbers. To be clear, the data should be interpreted as:
18141
1176
800
1057
960
14149
0
That is, it should all be interpreted as numeric values.
However, it makes me think that some "standard" application is exporting data like this. For the moment, let's say that it is Excel.
Is there any effective way to try to import flat files with this varying data type within the same column? Both R (read_csv from readr library) and Python's Pandas (read_csv), using their standard flags, interpreted this data by doing the following:
Presuming they should all be numbers (regardless of whether or not quotes are present in all "cells").
Presuming that the commas, therefore, must be the European-style of using a comma for a decimal place (instead of the US period).
So, both packages interpreted that column as follows:
18.141
1.176
800
1.057
960
14.149
0
In a way, it's impressive that both R (read_csv from readr library) and Pandas (read_csv) could both handle this incongruity and get the guesses almost right.
However, is there a flag that I can set or a package out there which can handle this sort of thing? For instance, a flag to say "remove quoted commas, they are most certainly not European for our US decimal place.
If not, is there enough of a need to contribute to this via forking either of their GitHub repos?
pandas.read_csv has thousands=',' parameter which you can set to a comma so pandas will read your column as
0 18141
1 1176
2 800
3 1057
4 960
5 14149
6 0
Also there is a converters parameter that takes a dictionary of columns and corresponding functions for applying to each column. You can use it for more complex preprocessing, something like this (does the same thing):
pd.read_csv('data.csv', converters={'column_name': lambda x: int(x.replace(',',''))})
I want the following records (currently displaying as 3.200000e+18 but actually (hopefully) each a different long integer), created using pd.read_excel(), to be interpreted differently:
ipdb> self.after['class_parent_ref']
class_id
3200000000000515954 3.200000e+18
3200000000000515951 NaN
3200000000000515952 NaN
3200000000000515953 NaN
3200000000000515955 3.200000e+18
3200000000000515956 3.200000e+18
Name: class_parent_ref, dtype: float64
Currently, they seem to 'come out' as scientifically notated strings:
ipdb> self.after['class_parent_ref'].iloc[0]
3.2000000000005161e+18
Worse, though, it's not clear to me that the number has been read correctly from my .xlsx file:
ipdb> self.after['class_parent_ref'].iloc[0] -3.2e+18
516096.0
The number in Excel (the data source) is 3200000000000515952.
This is not about the display, which I know I can change here. It's about keeping the underlying data in the same form it was in when read (so that if/when I write it back to Excel, it'll look the same and so that if I use the data, it'll look like it did in Excel and not Xe+Y). I would definitely accept a string if I could count on it being a string representation of the correct number.
You may notice that the number I want to see is in fact (incidentally) one of the labels. Pandas correctly read those in as strings (perhaps because Excel treated them as strings?) unlike this number which I entered. (Actually though, even when I enter ="3200000000000515952" into the cell in question before redoing the read, I get the same result described above.)
How can I get 3200000000000515952 out of the dataframe? I'm wondering if pandas has a limitation with long integers, but the only thing I've found on it is 1) a little dated, and 2) doesn't look like the same thing I'm facing.
Thank you!
Convert your column values with NaN into 0 then typcast that column as integer to do so.
df[['class_parent_ref']] = df[['class_parent_ref']].fillna(value = 0)
df['class_parent_ref'] = df['class_parent_ref'].astype(int)
Or in reading your file, specify keep_default_na = False for pd.read_excel() and na_filter = False for pd.read_csv()
I wrote a script to take files of data that is in columns and plot it depending on which column the user wants to view. Well, I noticed that the plots look crazy, and have all the wrong numbers because python is ignoring the exponential.
My numbers are in the format: 1.000000E+1 OR 1.000000E-1
What dtype is that? I am using numpy.genfromtxt to import with a dtype = float. I know there are all sorts of dtypes you can enter, but I cannot find a comprehensive list of the options, and examples.
Thanks.
Here is an example of my input (those spaces are tabs):
Time StampT1_ModBtT2_90BendT3_InPET5_Stg2Rfrg
5:22 AM2.115800E+21.400000E+01.400000E+03.035100E+1
5:23 AM2.094300E+21.400000E+01.400000E+03.034800E+1
5:24 AM2.079300E+21.400000E+01.400000E+03.031300E+1
5:25 AM2.069500E+21.400000E+01.400000E+03.031400E+1
5:26 AM2.052600E+21.400000E+01.400000E+03.030400E+1
5:27 AM2.040700E+21.400000E+01.400000E+03.029100E+1
Update
I figured out at least part of the reason why what I am doing does not work. Still do not know how to define dtypes the way I want to.
import numpy as np
file = np.genfromtxt('myfile.txt', usecols = (0,1), dtype = (str, float), delimiter = '\t')
That returns an array of strings for each column. How do I tell it I want column 0 to be a str, and all the rest of the columns to be float?
In [55]: type(1.000000E+1)
Out[55]: <type 'float'>
What does your input data look like, it's fair possible that it's in the wrong input format but it's also sure that it's fairly easy to convert it to the right format.
Numbers in the form 1.0000E+1 can be parsed by float(), so I'm not sure what the problem is:
>>> float('1.000E+1')
10.0
I think you'll want to get a text parser to parse the format into a native python data type.
like 1.00000E+1 turns into 1.0^1, which could be expressed as a float.