Given the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,1,2],'b':[np.nan,np.nan,4]})
a b
0 NaN NaN
1 1.0 NaN
2 2.0 4.0
How do I return rows where both columns 'a' and 'b' are null without having to use pd.isnull for each column?
Desired result:
a b
0 NaN NaN
I know this works (but it's not how I want to do it):
df.loc[(pd.isnull(df['a']) & (pd.isnull(df['b'])]
I tried this:
df.loc[pd.isnull(df[['a', 'b']])]
...but got the following error:
ValueError: Cannot index with multidimensional key
Thanks in advance!
You are close:
df[~pd.isnull(df[['a', 'b']]).all(1)]
Or
df[df[['a','b']].isna().all(1)]
How about:
df.dropna(subset=['a','b'], how='all')
With your shown samples, please try following. Using isnull function here.
mask1 = df['a'].isnull()
mask2 = df['b'].isnull()
df[mask1 & mask2]
Above answer is with creating 2 variables for better understanding. In case you want to use conditions inside df itself and don't want to create condition variables(mask1 and mask2 in this case) then try following.
df[df['a'].isnull() & df['b'].isnull()]
Output will be as follows.
a b
0 NaN NaN
You can use dropna() with parameter as how=all
df.dropna(how='all')
Output:
a b
1 1.0 NaN
2 2.0 4.0
Since the question was updated, you can then create masking either using df.isnull() or using df.isna() and filter accordingly.
df[df.isna().all(axis=1)]
a b
0 NaN NaN
Related
I have the following Pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4], 'type': ['a,b,c,d', 'b,d', 'c,e', np.nan]})
I need to split the type column based on the commma delimiter and pivot the values into multiple columns to get this
I looked at Pandas documentation for pivot() and also searched stackoverflow. I did not find anything that seems to achieve (directly or indirectly) what I need to do here. Any suggestions?
Edited:
enke's solution works using Pandas 1.3.5. However it does not work using the latest version 1.4.1. Here is the screenshot:
You could use str.get_dummies to get the dummy variables; then join back to df:
out = df[['id']].join(df['type'].str.get_dummies(sep=',').add_prefix('type_').replace(0, float('nan')))
Output:
id type_a type_b type_c type_d type_e
0 1 1.0 1.0 1.0 1.0 NaN
1 2 NaN 1.0 NaN 1.0 NaN
2 3 NaN NaN 1.0 NaN 1.0
3 4 NaN NaN NaN NaN NaN
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I am trying to fill nan after goupby and filter in pandas. For example, I want to group by 'label' and filter whether there are both nan and not nan. If both conditions are satisfied, I will fill the nan with the value in the same category.
Here's what I'm working on so far:
import pandas as pd
df = pd.DataFrame(data={'label':['a','a','b','b','c','c'],
'value':[nan,'a1','b1','b1',nan,nan]})
#I am trying to do
df.groupby('label')\
.filter(lambda x:x.value.isna().values.any() and not x.value.isna().values.all())\
.apply(lambda x:x.sort_values('value').value.ffill())
I use sort_values because I want to put nan at the end so that I can use ffill()
But I got an error there is no axis names value. I wonder where is wroing. Or is there better way to do this? And how can the filled data be assigned to the original dataframe?
Thanks for your help.
We can do groupby then just do fillna , if all NaN, it will remain NaN
df.groupby('label').value.apply(lambda x : x.ffill().bfill())
0 a1
1 a1
2 b1
3 b1
4 NaN
5 NaN
Name: value, dtype: object
I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4
I'm converting columns & cell values of a dataframe to float64 using either:
df.infer_objects()
or
df.apply(pd.to_numeric)
The first keeps those columns as object-type that are not convertible while the second one raises an exception if some objects are can not be converted. My question is, if it's somehow possible to supply my own error/converter callback function? Something like this:
def my_converter(value: object) -> float:
# add all your "known" value conversions and fallbacks
converted_value = float(value)
return converted_value
df.apply(pd.to_numeric, converter=my_converted)
I don't know of a way to concisely do exactly what your asking. That would be a nice addition to the api though. Here is something that will work, it's a little convoluted.
Set the pd.to_numeric method to not raise an exception but instead return a NaN using the errors parameter. Using the location of your NaN apply your special converter function. Now use add and the paramter fill_value=0 to add the DataFrames that were converted with pd.to_numeric and with your special converter.
You can find some information in the docs for to_numeric and add
It would look something like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'A': np.random.randn(5),
'B': np.random.randn(5)})
A B
0 -0.619165 1.310489
1 0.908564 1.017284
2 0.046072 -1.059349
3 1.123730 -2.229261
4 0.689580 -0.200981
df1 = df[df < 1] # This is your DataFrame from df.apply(pd.to_numeric, errors='coerce')
df1
A B
0 -0.619165 NaN
1 0.908564 NaN
2 0.046072 -1.059349
3 NaN -2.229261
4 0.689580 -0.200981
df2 = df[df1.isnull()]
df2 # This is the DataFrame you want to apply your converter df2.apply(my_converter)
df2 = df2.apply(lambda x: x*10) # This is my dummy special converter
df2
A B
0 NaN 13.104885
1 NaN 10.172835
2 NaN NaN
3 11.237296 NaN
4 NaN NaN
df1.add(df2, fill_value=0) # This is the final dataframe you're looking for
A B
0 -0.619165 13.104885
1 0.908564 10.172835
2 0.046072 -1.059349
3 11.237296 -2.229261
4 0.689580 -0.200981