I'm taking in shipment data from a csv file, I've edited data for privacy purposes, but the thing to look at is when using pandas.read_csv on my csv file the original as shown below is normal in this sense: the ZIP code (01234) has a leading 0, and the order number (22276) is an integer.
After using pandas.read_csv and printing out my data (and viewing my data in a text editor) I see that the leading 0 was taken out from the ZIP code (it is now 1234), and the order number is now a floating number (22276.0)
Original:
GROUND,THIRD PARTY,Company Name,1 Road
Ave,Town,State,01234,,22276,22276,22276,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
After using pandas.read_csv:
GROUND,THIRD PARTY,Dreams,100 Higginson
Ave,LINCOLN,RI,1234,,22276.0,22276.0,22276.0,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
I've seen others have these issues as well, and in those questions you will see well-written answers about HOW to fix the problem. What I want to know is WHY the problem exists in the first place. Why does a reading function write out original data back to the file?
EDIT
Here's the code I'm currently working with, reference is the name of the column with the order number.
import pandas
grid = pandas.read_csv("thirdparty.csv", dtype={'ZIP': int, 'REFERENCE': int})
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(grid)
How
You'll want to use the dtype argument of pd.read_csv. One solution would be to read in all the columns as string type. This will preserve the values exactly as they were in your csv file.
import pandas as pd
data = pd.read_csv("thirdparty.csv", dtype=str)
Though a better solution would be to specify your desired dtype of each column:
data = pd.read_csv(("thirdparty.csv", dtype={‘ZIP’: str, ‘REFERENCE’: int}
When writing the csv file back out again you should also use the float_format argument to ensure any floats are wrote as you desire.
Why
You also asked why the "problem" exists.
Essentially, when you use pd.read_csv without specifying a dtype, anything which looks like a number is read in as a float. So, 01234 is converted to 1234 on read.
When you write back out to your file this number is now wrote as a float. The pd.read_csv function is not writing out data to the original file.
Related
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I have a weird bug (?) when reading a csv with read_csv function. Some of the numbers (in my concrete case in 11 lines from a total of 500) are read with many trailing zeros and a seemingly random number at the end.
For example, for a value that is "0.052" in the csv, when I run pandas I get this:
values = pd.read_csv(filename, header=2)
values.column1[487]
0.052000000000000005
This is happening just for some columns, others are read normally.
Any ideas of what is going on here?
It probably is the data type. Specifying the datatype will solve it. If you just want to change the representation use:
pd.set_option("display.precision", *number of numbers behind the comma*)
pd.set_option("display.precision", 3)
If you would want to visualize it to 0.052. Put this. pd.set_option before the output (preferably at the top). NOTE: This only shows 0.052 but pandas still calculates with 0.052000005 which in most cases isn't relevant. But in your case it might.
The conversion of xml to csv file, this is done by some code and the specifications that I have added.
As as result I get a csv file, once I open it I see some weird numbers that look something like this
1,25151E+21
Is there any way to eliminate this and show the whole numbers. The code itself that parses xml to csv is working fine so I’m assuming it is an excel thing.
I don’t want to go and do something manually every time I am generating a new csv file
Additional
The entire code can be found HERE and I have only long numbers in Quality
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text
CSV doesn't pass any cell formatting rules to Excel. Hence if you open a CSV that has very large numbers in it, the default cell formatting will likely be Scientific. You can try changing the cell formatting to Number and if that changes the view to the entire number like you want, consider using the Xlsxwriter to apply cell formatting to the document while writing to Xlsx instead of CSV.
I often end up running a lambda on dataframes with this issue when I bring in csv, fwf, etc, for ETL and back out to XLSX. In my case they are all account numbers, so it's pretty bad when Excel helpfully overrides it to scientific notation.
If you don't mind the long number being a string, you can do this:
# First I force it to be an int column as I import everything as objects for unrelated reasons
df.thatlongnumber = df.thatlongnumber.astype(np.int64)
# Then I convert that to a string
df.thatlongnumber.apply(lambda x: '{:d}'.format(x))
Let me know if this is useful at all.
Scientific notation is a pain, what I've used before to handle situations like this is to cast it into a float and then use a format specifier, something like this should work:
a = "1,25151E+21"
print(f"{float(a.replace(',', '.')):.0f}")
>>> 1251510000000000065536
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I want the following records (currently displaying as 3.200000e+18 but actually (hopefully) each a different long integer), created using pd.read_excel(), to be interpreted differently:
ipdb> self.after['class_parent_ref']
class_id
3200000000000515954 3.200000e+18
3200000000000515951 NaN
3200000000000515952 NaN
3200000000000515953 NaN
3200000000000515955 3.200000e+18
3200000000000515956 3.200000e+18
Name: class_parent_ref, dtype: float64
Currently, they seem to 'come out' as scientifically notated strings:
ipdb> self.after['class_parent_ref'].iloc[0]
3.2000000000005161e+18
Worse, though, it's not clear to me that the number has been read correctly from my .xlsx file:
ipdb> self.after['class_parent_ref'].iloc[0] -3.2e+18
516096.0
The number in Excel (the data source) is 3200000000000515952.
This is not about the display, which I know I can change here. It's about keeping the underlying data in the same form it was in when read (so that if/when I write it back to Excel, it'll look the same and so that if I use the data, it'll look like it did in Excel and not Xe+Y). I would definitely accept a string if I could count on it being a string representation of the correct number.
You may notice that the number I want to see is in fact (incidentally) one of the labels. Pandas correctly read those in as strings (perhaps because Excel treated them as strings?) unlike this number which I entered. (Actually though, even when I enter ="3200000000000515952" into the cell in question before redoing the read, I get the same result described above.)
How can I get 3200000000000515952 out of the dataframe? I'm wondering if pandas has a limitation with long integers, but the only thing I've found on it is 1) a little dated, and 2) doesn't look like the same thing I'm facing.
Thank you!
Convert your column values with NaN into 0 then typcast that column as integer to do so.
df[['class_parent_ref']] = df[['class_parent_ref']].fillna(value = 0)
df['class_parent_ref'] = df['class_parent_ref'].astype(int)
Or in reading your file, specify keep_default_na = False for pd.read_excel() and na_filter = False for pd.read_csv()