Pandas is Reading .xlsx Column as Datetime rather than float - python

I obtained an Excel file with complicated formatting for some cells. Here is a sample:
The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:
I need to read this column into pandas as a float value. However, when I use
import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])
I get
['USDC Amount USDC']
0 NaT
1 1927-06-05 05:38:32.726400
2 1872-07-25 18:21:27.273600
3 NaT
4 NaT
Name: USDC Amount USDC, dtype: datetime64[ns]
I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:
['USDC Amount USDC']
0 NaN
1 10018.235101
2 -10018.235101
3 NaN
4 NaN
Name: USDC Amount USDC, dtype: float64
The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.
I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.
EDIT:
I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:
openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8

It appears updating OpenPyXL from 3.0.3 to 3.1.0 resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.

You could use the dtype input in read_excel to be along the lines of
import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':np.float64})
but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':"string"})
After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.
Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is
df["numbers_only"] = df["numbers_only"].astype(np.float64)

Related

python/pandas : Pandas changing the value adding extra digits in values [duplicate]

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

Pandas read csv file with float values results in weird rounding and decimal digits

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

How to import csv's with "occasional" quotes (R and / or Pandas)?

I ran across a new problem today. I discovered some data that I'm working with, that looks like this (in a csv file):
Male,White,All Combined,1989,30-31,31,"59,546","18,141","328,235"
Male,White,Felony - Drug,1989,30-31,31,"3,861","1,176","328,235"
Male,White,Felony - Other,1989,30-31,31,"2,626",800,"328,235"
Male,White,Felony - Property,1989,30-31,31,"3,468","1,057","328,235"
Male,White,Felony - Violent/Sex,1989,30-31,31,"3,150",960,"328,235"
Male,White,Misdemeanor,1989,30-31,31,"46,441","14,149","328,235"
Male,White,Status,1989,30-31,31,0,0,"328,235"
It's hard to see the problem, so let me highlight the second to last column:
"18,141"
"1,176"
800
"1,057"
960
"14,149"
0
The problem is values with commas are being exported as strings, while values without commas are exported as numbers. To be clear, the data should be interpreted as:
18141
1176
800
1057
960
14149
0
That is, it should all be interpreted as numeric values.
However, it makes me think that some "standard" application is exporting data like this. For the moment, let's say that it is Excel.
Is there any effective way to try to import flat files with this varying data type within the same column? Both R (read_csv from readr library) and Python's Pandas (read_csv), using their standard flags, interpreted this data by doing the following:
Presuming they should all be numbers (regardless of whether or not quotes are present in all "cells").
Presuming that the commas, therefore, must be the European-style of using a comma for a decimal place (instead of the US period).
So, both packages interpreted that column as follows:
18.141
1.176
800
1.057
960
14.149
0
In a way, it's impressive that both R (read_csv from readr library) and Pandas (read_csv) could both handle this incongruity and get the guesses almost right.
However, is there a flag that I can set or a package out there which can handle this sort of thing? For instance, a flag to say "remove quoted commas, they are most certainly not European for our US decimal place.
If not, is there enough of a need to contribute to this via forking either of their GitHub repos?
pandas.read_csv has thousands=',' parameter which you can set to a comma so pandas will read your column as
0 18141
1 1176
2 800
3 1057
4 960
5 14149
6 0
Also there is a converters parameter that takes a dictionary of columns and corresponding functions for applying to each column. You can use it for more complex preprocessing, something like this (does the same thing):
pd.read_csv('data.csv', converters={'column_name': lambda x: int(x.replace(',',''))})

Converting long integers to strings in pandas (to avoid scientific notation)

I want the following records (currently displaying as 3.200000e+18 but actually (hopefully) each a different long integer), created using pd.read_excel(), to be interpreted differently:
ipdb> self.after['class_parent_ref']
class_id
3200000000000515954 3.200000e+18
3200000000000515951 NaN
3200000000000515952 NaN
3200000000000515953 NaN
3200000000000515955 3.200000e+18
3200000000000515956 3.200000e+18
Name: class_parent_ref, dtype: float64
Currently, they seem to 'come out' as scientifically notated strings:
ipdb> self.after['class_parent_ref'].iloc[0]
3.2000000000005161e+18
Worse, though, it's not clear to me that the number has been read correctly from my .xlsx file:
ipdb> self.after['class_parent_ref'].iloc[0] -3.2e+18
516096.0
The number in Excel (the data source) is 3200000000000515952.
This is not about the display, which I know I can change here. It's about keeping the underlying data in the same form it was in when read (so that if/when I write it back to Excel, it'll look the same and so that if I use the data, it'll look like it did in Excel and not Xe+Y). I would definitely accept a string if I could count on it being a string representation of the correct number.
You may notice that the number I want to see is in fact (incidentally) one of the labels. Pandas correctly read those in as strings (perhaps because Excel treated them as strings?) unlike this number which I entered. (Actually though, even when I enter ="3200000000000515952" into the cell in question before redoing the read, I get the same result described above.)
How can I get 3200000000000515952 out of the dataframe? I'm wondering if pandas has a limitation with long integers, but the only thing I've found on it is 1) a little dated, and 2) doesn't look like the same thing I'm facing.
Thank you!
Convert your column values with NaN into 0 then typcast that column as integer to do so.
df[['class_parent_ref']] = df[['class_parent_ref']].fillna(value = 0)
df['class_parent_ref'] = df['class_parent_ref'].astype(int)
Or in reading your file, specify keep_default_na = False for pd.read_excel() and na_filter = False for pd.read_csv()

python data types

I wrote a script to take files of data that is in columns and plot it depending on which column the user wants to view. Well, I noticed that the plots look crazy, and have all the wrong numbers because python is ignoring the exponential.
My numbers are in the format: 1.000000E+1 OR 1.000000E-1
What dtype is that? I am using numpy.genfromtxt to import with a dtype = float. I know there are all sorts of dtypes you can enter, but I cannot find a comprehensive list of the options, and examples.
Thanks.
Here is an example of my input (those spaces are tabs):
Time StampT1_ModBtT2_90BendT3_InPET5_Stg2Rfrg
5:22 AM2.115800E+21.400000E+01.400000E+03.035100E+1
5:23 AM2.094300E+21.400000E+01.400000E+03.034800E+1
5:24 AM2.079300E+21.400000E+01.400000E+03.031300E+1
5:25 AM2.069500E+21.400000E+01.400000E+03.031400E+1
5:26 AM2.052600E+21.400000E+01.400000E+03.030400E+1
5:27 AM2.040700E+21.400000E+01.400000E+03.029100E+1
Update
I figured out at least part of the reason why what I am doing does not work. Still do not know how to define dtypes the way I want to.
import numpy as np
file = np.genfromtxt('myfile.txt', usecols = (0,1), dtype = (str, float), delimiter = '\t')
That returns an array of strings for each column. How do I tell it I want column 0 to be a str, and all the rest of the columns to be float?
In [55]: type(1.000000E+1)
Out[55]: <type 'float'>
What does your input data look like, it's fair possible that it's in the wrong input format but it's also sure that it's fairly easy to convert it to the right format.
Numbers in the form 1.0000E+1 can be parsed by float(), so I'm not sure what the problem is:
>>> float('1.000E+1')
10.0
I think you'll want to get a text parser to parse the format into a native python data type.
like 1.00000E+1 turns into 1.0^1, which could be expressed as a float.

Categories