Pandas adding NA changes column from float to object

Pandas adding NA changes column from float to object - python

I have a dataframe with a column that is of type float64. Then when I run the following it converts to object, which ends up messing with a downstream calculation where this is the denominator and there are some 0s:
df.loc[some criteria,'my_column'] = pd.NA
After this statement, my column is now of type object. Can someone explain why that is, and how to correct this? I'd rather do it right the first time rather than adding this to the end:
df['my_column'] = df['my_column'].astype(float)
Thanks.

In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns.
df.loc[some criteria,'my_column'] = np.nan

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?

You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

How to assign integer to a cell in Pandas?

I would like to iterate through a dataframe and create a new column with the indice returned by enumerate(), but I'm unable to assign the value as integer and I need to convert it later. Is there a solution to do that in one shot ?
As you can see, direct assignment of an integer fails.
df.loc[indexes, ('route', 'id')] = int(i)
print(df.loc[indexes, ('route', 'id')].dtypes) # float64
Conversion with a second line of code is necessary:
df.loc[indexes, ('route', 'id')] = df.loc[indexes, ('route', 'id')].astype(int)
print(df.loc[indexes, ('route', 'id')].dtypes) # int64

This link shows you how to assing the value of a column to an int type in pandas.
Basically you can do this by using:
to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).

Integers with NA in pandas from json_normalize

I do df = pandas.io.json.json_normalize(data) and I want to specify datatype of an integer column that has missing data. Since pandas doesn't have an integer NaN, I want to use dtype object, i.e. string. As far as I can tell, json_normalize doesn't have any dtype parameter. If I try to .astype(object) the columns afterwards, I'll end up with decimal points in the column.
How can I get the string format with no decimal point into this column?

There should be a more elegant solution, but this one should work for your specific task:
df['col'] = df['col'].astype(object).apply(lambda x: '%.f' % x)
hope it helps.

How to define the default missing value in pandas dataframe

I want to read a dataframe with given datatype and missing values, but the follwing code is wrong. I have no idea, why this happens!
myText = StringIO("""1,2
3,\N
5,6""")
myDf = pd.read_csv(myText, header=None, names=["a1","a2"], na_values=["\N"], dtype={"a1":"int", "a2":"int"})
I got the error message:
ValueError: Integer column has NA values in column 1
If I remove the dtype option dtype={"a1":"int", "a2":"int"}, then it works fine. Does the integer column don't allow missing values?

Integer doesn't allow missing values. Float allows missing values. If you need it to be integers, you'll need to use a sentinel for the missing ones, like 0 or 99999999 or something (not recommended). Otherwise, use a type like float64 that allows out-of-band values like NaN.

dataframe values.tolist() datatype

I have a dataframe like this:
This dataframe has several columns. Two are of type float: price and change, while volme and amountare of type int.
I use the method df.values.tolist() change df to list and get the data:
datatmp = df.values.tolist()
print(datatmp[0])
[20160108150023.0, 11.12, -0.01, 4268.0, 4746460.0, 2.0]
The int types in df all change to float types.
My question is why do int types change to the float types? How can I get the int data I want?

You can convert column-by-column:
by_column = [df[x].values.tolist() for x in df.columns]
This will preserve the data type of each column.
Than convert to the structure you want:
list(list(x) for x in zip(*by_column))
You can do it in one line:
list(list(x) for x in zip(*(df[x].values.tolist() for x in df.columns)))
You can check what datatypes your columns have with:
df.info()
Very likely your column amount is of type float. Do you have any NaN in this column? These are always of type float and would make the whole column float.
You can cast to int with:
df.values.astype(int).tolist()

I think the pandas documentation helps:
DataFrame.values
Numpy representation of NDFrame
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
So here apparently float is chosen to accomodate all component types. A simple method would be (however, most possibly there are more elegant solutions around, I'm not too familiar with pandas):
datatmp = map(lambda row: list(row[1:]), df.itertuples())
Here the itertuples() gives an iterator with elements of the form (rownumber, colum1_entry, colum2_entry, ...). The map takes each such tuple and applies the lambda function, which removes the first component (rownumber), and returns a list containing the components of a single row. You can also remove the list() invocation if it's ok for you to work with a list of tuples.
[Dataframe values property][1]
"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html#pandas.DataFrame.values"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas adding NA changes column from float to object - python

In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns. df.loc[some criteria,'my_column'] = np.nan

Related

.tolist() converting pandas dataframe values to bytes

How to assign integer to a cell in Pandas?

Integers with NA in pandas from json_normalize

How to define the default missing value in pandas dataframe

dataframe values.tolist() datatype

Categories

Resources