Pandas: convert to numeric with callback - python

I'm converting columns & cell values of a dataframe to float64 using either:
df.infer_objects()
or
df.apply(pd.to_numeric)
The first keeps those columns as object-type that are not convertible while the second one raises an exception if some objects are can not be converted. My question is, if it's somehow possible to supply my own error/converter callback function? Something like this:
def my_converter(value: object) -> float:
# add all your "known" value conversions and fallbacks
converted_value = float(value)
return converted_value
df.apply(pd.to_numeric, converter=my_converted)

I don't know of a way to concisely do exactly what your asking. That would be a nice addition to the api though. Here is something that will work, it's a little convoluted.
Set the pd.to_numeric method to not raise an exception but instead return a NaN using the errors parameter. Using the location of your NaN apply your special converter function. Now use add and the paramter fill_value=0 to add the DataFrames that were converted with pd.to_numeric and with your special converter.
You can find some information in the docs for to_numeric and add
It would look something like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'A': np.random.randn(5),
'B': np.random.randn(5)})
A B
0 -0.619165 1.310489
1 0.908564 1.017284
2 0.046072 -1.059349
3 1.123730 -2.229261
4 0.689580 -0.200981
df1 = df[df < 1] # This is your DataFrame from df.apply(pd.to_numeric, errors='coerce')
df1
A B
0 -0.619165 NaN
1 0.908564 NaN
2 0.046072 -1.059349
3 NaN -2.229261
4 0.689580 -0.200981
df2 = df[df1.isnull()]
df2 # This is the DataFrame you want to apply your converter df2.apply(my_converter)
df2 = df2.apply(lambda x: x*10) # This is my dummy special converter
df2
A B
0 NaN 13.104885
1 NaN 10.172835
2 NaN NaN
3 11.237296 NaN
4 NaN NaN
df1.add(df2, fill_value=0) # This is the final dataframe you're looking for
A B
0 -0.619165 13.104885
1 0.908564 10.172835
2 0.046072 -1.059349
3 11.237296 -2.229261
4 0.689580 -0.200981

Related

Pandas DataFrame Isnull Multiple Columns at Once

Given the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,1,2],'b':[np.nan,np.nan,4]})
a b
0 NaN NaN
1 1.0 NaN
2 2.0 4.0
How do I return rows where both columns 'a' and 'b' are null without having to use pd.isnull for each column?
Desired result:
a b
0 NaN NaN
I know this works (but it's not how I want to do it):
df.loc[(pd.isnull(df['a']) & (pd.isnull(df['b'])]
I tried this:
df.loc[pd.isnull(df[['a', 'b']])]
...but got the following error:
ValueError: Cannot index with multidimensional key
Thanks in advance!
You are close:
df[~pd.isnull(df[['a', 'b']]).all(1)]
Or
df[df[['a','b']].isna().all(1)]
How about:
df.dropna(subset=['a','b'], how='all')
With your shown samples, please try following. Using isnull function here.
mask1 = df['a'].isnull()
mask2 = df['b'].isnull()
df[mask1 & mask2]
Above answer is with creating 2 variables for better understanding. In case you want to use conditions inside df itself and don't want to create condition variables(mask1 and mask2 in this case) then try following.
df[df['a'].isnull() & df['b'].isnull()]
Output will be as follows.
a b
0 NaN NaN
You can use dropna() with parameter as how=all
df.dropna(how='all')
Output:
a b
1 1.0 NaN
2 2.0 4.0
Since the question was updated, you can then create masking either using df.isnull() or using df.isna() and filter accordingly.
df[df.isna().all(axis=1)]
a b
0 NaN NaN

Find any negative values in a given set of dataframes and replace whole column with np.nan

I have a large dataframe and I want to search 144 of the columns to check if there are any negative values in them. If there is even one negative value in a column, I want to replace the whole column with np.nan. I then want to use the new version of the dataframe for later analysis.
I've tried a varied of methods but can't seem to find one that works. I think this is almost there but I can't seem to find a solution to what I'm trying to do.
clean_data_df.loc[clean_data_df.cols < 0, cols] = np.nan #cols is a list of the column names I want to check
null_columns=clean_data_df.columns[clean_data_df.isnull().any(axis=1)]
clean_data_df[null_columns] = np.nan
When I run the above code I get the following error: AttributeError: 'DataFrame' object has no attribute 'cols'
Thanks in advance!
You could use a loop to iterate over the columns:
for i in col:
if df[i].isna().any():
df[i] = np.nan
Minumum reproducible example:
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[1,2,3]})
for i in df:
if df[i].isna().any():
df[i] = np.nan
print(df)
Output:
a b c
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 3
Idea is filter only filtered rows by cols by DataFrame.lt and DataFrame.any and then add all another columns filled by False in Series.reindex, last set values by DataFrame.loc, here first : means all rows:
df = pd.DataFrame({'a':list('abc'), 'b':[-2,-1,-3],'c':[1,2,3]})
cols = ['b','c']
df.loc[:, df[cols].lt(0).any().reindex(df.columns, fill_value=False)] = np.nan
print(df)
a b c
0 a NaN 1
1 b NaN 2
2 c NaN 3
Detail:
print(df[cols].lt(0).any())
b True
c False
dtype: bool
print (df[cols].lt(0).any().reindex(df.columns, fill_value=False))
a False
b True
c False
dtype: bool

Boolean values are turned into floats when appending two pandas dataframes

Appending two pandas dataframes has an unexpected behavior when one of the dataframes has a column with all null values (NaN) and the other one has boolean values at the same column.
The corresponding column in the resulting (from appending) dataframe is typed as float64 and the boolean values are turned into ones and zeros based on their original boolean values.
Example:
df1 = pd.DataFrame(data = [[1, 2 ,True], [10, 20, True]], columns=['a', 'b', 'c'])
df1
a b c
0 1 2 True
1 10 20 False
df2 = pd.DataFrame(data = [[1,2], [10,20]], columns=['a', 'b'])
df2['c'] = np.nan
df2
a b c
0 1 2 NaN
1 10 20 NaN
Appending:
df1.append(df2)
a b c
0 1 2 1.0
1 10 20 0.0
0 1 2 NaN
1 10 20 NaN
My workaround is to reset the typing of the column as bool, but this turns the NaN values to booleans:
appended_df = df1.append(df2)
appended_df
a b c
0 1 2 1.0
1 10 20 0.0
0 1 2 NaN
1 10 20 NaN
appended_df['c'] = appended_df.c.astype(bool)
appended_df
a b c
0 1 2 True
1 10 20 False
0 1 2 True
1 10 20 True
Unfortunately, the pandas append documentation doesn't refer to the problem, any idea why pandas has this behavior?
Mixed types of elements in DataFrame column is not allowed, see this discussion Mixed types of elements in DataFrame's column
The type of np.nan is float, so all the boolean values are casted to float when appending. To avoid this, you could change the type of the 'c' column to 'object' using .astype():
df1['c'] = df1['c'].astype(dtype='object')
df2['c'] = df2['c'].astype(dtype='object')
Then the append command has the desired result. However, as stated in the discussion mentioned above, having multiple types in the same column is not recommended. If instead of np.nan you use None, which is the NoneType object, you don't need to go through the type definition yourself. For the difference between NaN (Not a Number) and None types, see What is the difference between NaN and None?
You should think of what the 'c' column really represents, and choose the dtype accordingly.
You need to use convert_dtypes, if you are using Pandas 1.0.0 and above. Refer link for description and use convert_dtypes
Solution code:
df1 = df1.convert_dtypes()
df1.append(df2)
print(df1)

Count all NaNs in a pandas DataFrame

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

Convert data type of a column containing nan, hyphen and comma in pandas data frame

df = pd.read_csv("data.csv", encoding = "ISO-8859-1")
Now, I have a column where I have values are as below:
Sample data for reference:
Now, i want to convert the column a to a numeric format using below code:
df[['A']] = df[['A']].astype(int)
and it gives me an error.
The problem is I have all three (nan, hyphen and comma) all in one column and need to address them together.
Is there any better way to convert these without replacing (nan to -1) and things like that?
Use parameters thousands and na_values, but converting to integers is not possible with missing values, because now at least one NaN value cast column to floats, see this. So possible solution is replace them to int, e.g. -1 and then cast to integer:
Notice - In new version of pandas (0.24.0, coming soon) pandas has gained the ability to hold integer dtypes with missing values, Nullable Integer Data Type.
import pandas as pd
temp=u'''A
2254
"1,234"
"3,385"
nan
-
-
nan'''
#after testing replace 'pd.compat.StringIO(temp)' to 'data.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
encoding = "ISO-8859-1",
thousands=',',
na_values='-')
print (df)
A
0 2254.0
1 1234.0
2 3385.0
3 NaN
4 NaN
5 NaN
6 NaN
df['A'] = df['A'].fillna(-1).astype(int)
print (df)
A
0 2254
1 1234
2 3385
3 -1
4 -1
5 -1
6 -1
Maybe should do pd.to_numeric with errors='coerce' and str.replace:
df['A'] = pd.to_numeric(df['A'].str.replace(',',''),errors='coerce')
And now:
print(df['A'])
Is:
0 2254.0
1 1234.0
2 3385.0
3 NaN
4 NaN
5 NaN
6 NaN
Name: A, dtype: float64

Categories