How to make sure all columns are read as string in pandas - python

I had thought this can make all values are read as string, but it doesn't:
df = pd.read_csv(file, sep='\t', dtype=str, low_memory=False)
Because when I do this;
for index, row in df.iterrows():
id_value = row['id']
...
My error message says that 'id_value' is a float, which can't do str concatenation.
Why can't dtype=str achieve that in dataframe?

According to the read_csv documentation, you have to set both dtype=str and na_values=""
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
NaN is a float type (unless covering to the new pandas.NA), so if you have missing values, this is likely the origin of your error.
Also, I am not sure which operation you want to do, but if you make it vectorial (i.e. not using iterrows) this should handle the NaNs automatically.

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

Groupby year dropping some variables

This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN

Pandas Python: read_excel: Is it possible to only read the column headers in a certain data type (read the headers as strings)?

I have a dataset in which some of the column names are numbers (integer or with fractions), I want to keep the names as it is, but read_excel makes all of them float.
Can I declare only the headers as string? (With headers and dtype?)
Here in this call, I want to make column headers str.
df = pd.read_excel('file.xlsx',
sheet_name='sheet1',
index_col=None,
dtype = str ,
engine='openpyxl')
If this is not possible,
can I make the 0th row string (no headers but the first row to be string) while reading the data which would give me the column names as strings?
Thanks
Try this way
# with this setting your header will be pushed down to be your first row
df = pd.read_excel('file.xlsx', header=None)
# use 1st row to set your column names
df.rename(columns=df.iloc[0])
# reset the index
df.reset_index(drop=True, inplace=True)
Note: dtype keyword is meant to specify the data type that dictates the data type of the data in the entire dataframe or if a dict is assigned, then individual columns will have that specified datatype.
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use
object to preserve data as stored in Excel and not interpret dtype. If
converters are specified, they will be applied INSTEAD of dtype conversion.

prevent pandas read_excel function from automatic converting boolean columns into float64

I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.

Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding
quoting=csv.QUOTE_NONNUMERIC
for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.
Using pandas 0.12.0
To avoid having Pandas infer your data type, provide a converters argument to read_csv:
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
For your file this would look like:
df2 = pd.read_csv('test', sep='\t', converters={'a':str})
My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.
You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!
As #Steven points out you can use the converters argument of read_csv:
df2 = pd.read_csv('test', sep='\t', converters={'a': str})
A lazy solution is just to patch this up after you've read in the file:
In [11]: df2['a'] = df2['a'].astype('str')
# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)
Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

Categories