Integers with NA in pandas from json_normalize - python

I do df = pandas.io.json.json_normalize(data) and I want to specify datatype of an integer column that has missing data. Since pandas doesn't have an integer NaN, I want to use dtype object, i.e. string. As far as I can tell, json_normalize doesn't have any dtype parameter. If I try to .astype(object) the columns afterwards, I'll end up with decimal points in the column.
How can I get the string format with no decimal point into this column?

There should be a more elegant solution, but this one should work for your specific task:
df['col'] = df['col'].astype(object).apply(lambda x: '%.f' % x)
hope it helps.

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

Pandas adding NA changes column from float to object

I have a dataframe with a column that is of type float64. Then when I run the following it converts to object, which ends up messing with a downstream calculation where this is the denominator and there are some 0s:
df.loc[some criteria,'my_column'] = pd.NA
After this statement, my column is now of type object. Can someone explain why that is, and how to correct this? I'd rather do it right the first time rather than adding this to the end:
df['my_column'] = df['my_column'].astype(float)
Thanks.
In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns.
df.loc[some criteria,'my_column'] = np.nan

Dataframe.max() giving all NaNs

I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))

Error when converting pandas data frame columns from string with comma to float with dot

I am trying to convert a pands data frame (read in from a .csv file) from string to float. The columns 1 until 20 are recognized as "strings" by the system. However, they are float values in the format of "10,93847722". Therefore, I tried the following code:
new_df = df[df.columns[1:20]].transform(lambda col: col.str.replace(',','.').astype(float))
The last line causes the Error:
AttributeError: 'DataFrame' object has no attribute 'transform'
Maybe important to know, I can only use pands version 0.16.2.
Thank you very much for your help!
#all: Short extract from one of the columns
23,13854599
23,24945831
23,16853714
23,0876255
23,05908775
Use DataFrame.apply:
df[df.columns[1:20]].apply(lambda col: col.str.replace(',','.').astype(float))
EDIT: If some non numeric values is possible use to_numeric with errors='coerce' for replace these values to NaNs:
df[df.columns[1:20]].apply(lambda col: pd.to_numeric(col.str.replace(',','.'),errors='coerce'))
You should load them directly as numbers:
pd.read_csv(..., decimal=',')
This will recognize , as decimal point for every column.

How to sum up numbers in a DataFrame containing mixed numbers and strings

I am trying to sum up numbers (integers or floats) in a DataFrame column that contains mixed numbers and strings.
Actually I am new to pandas and I just tried the easy, straightforward way.
Assuming the following:
e = pd.DataFrame(np.random.randn(5,1))
e[0][1]='x'
e[0][4]='y'
Doing
e[0].sum()
certainly returns a type error.
I suspect to replace the strings with pd.np.nan. But how?
Use pandas.to_numeric:
import pandas as pd
pd.to_numeric(e[0], 'coerce').sum()
Output:
-1.8781900945531884
coerce option will set any invalid element as NaN, and pd.Series.sum by default excludes the NaNs in its calculation.

Categories