prevent pandas read_excel function from automatic converting boolean columns into float64 - python

I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.

Related

How to make sure all columns are read as string in pandas

I had thought this can make all values are read as string, but it doesn't:
df = pd.read_csv(file, sep='\t', dtype=str, low_memory=False)
Because when I do this;
for index, row in df.iterrows():
id_value = row['id']
...
My error message says that 'id_value' is a float, which can't do str concatenation.
Why can't dtype=str achieve that in dataframe?
According to the read_csv documentation, you have to set both dtype=str and na_values=""
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
NaN is a float type (unless covering to the new pandas.NA), so if you have missing values, this is likely the origin of your error.
Also, I am not sure which operation you want to do, but if you make it vectorial (i.e. not using iterrows) this should handle the NaNs automatically.

Pandas adding extra zeros to decimal value

I'm importing a file into pandas dataframe but the dataframe is not retaining original values as is, instead its adding extra zero to some float columns.
example original value in the file is 23.84 but when i import it into the dataframe it has a value of 23.8400
How to fix this? or is there a way to import original values to the dataframe as is in the text file.
For anyone who encounters the same problem, I'm adding the solution I found to this problem. Pandas read_csv has an attribute as dtype where we can tell pandas to read all columns as a string so this will read the data as is and not interpret based on its own logic.
df1 = pd.read_csv('file_location', sep = ',', dtype = {'col1' : 'str', 'col2' : 'str'}
I had too many columns so I first created a dictionary with all the columns as keys and 'str' as their values and passed this dictionary to the dtype argument.

Pandas - Parallelizing astype function

I'm dealing with a huge dataset with lots of features. Those features are actually int type,but since they have np.nan values, pandas assigns float64 type to them.
I'm casting those features to float32 by iterating every single column. It takes about 10 minutes to complete. Is there any way to speed up this operation?
The data is read from a csv file. There are object and int64 columns in the data.
for col in float_cols:
df[col] = df[col].astype(np.float32)
Use dtype parameter with dictionary in read_csv:
df = pd.read_csv(file, dtype=dict.fromkeys(float_cols, np.float32))

Pandas read_sql_query converting integer column to float

I have the following line
df = pandas.read_sql_query(sql = sql_script, con=conn, coerce_float = False)
that pulls data from Postgres using a sql script. Pandas keeps setting some of the columns to type float64. They should be just int. These columns contain some null values. Is there a ways to pull the data without having Pandas setting them to float64?
Thanks!
As per the documentation, the lack of NA representation in Numpy implies integer NA values can't be managed, so pandas promotes int columns into float.

Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding
quoting=csv.QUOTE_NONNUMERIC
for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.
Using pandas 0.12.0
To avoid having Pandas infer your data type, provide a converters argument to read_csv:
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
For your file this would look like:
df2 = pd.read_csv('test', sep='\t', converters={'a':str})
My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.
You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!
As #Steven points out you can use the converters argument of read_csv:
df2 = pd.read_csv('test', sep='\t', converters={'a': str})
A lazy solution is just to patch this up after you've read in the file:
In [11]: df2['a'] = df2['a'].astype('str')
# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)
Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

Categories