I have a Series and a Dataframe that share the same index:
s = pd.Series([300, 300])
df = pd.DataFrame({
'A': [10,20],
'B': [20,30]
})
When I do s.div(df), I see:
A B 0 1
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
I expect:
A B
0 30 15
1 15 10
pandas.__version__: 1.3.4.
Use DataFrame.rdiv for divide from right side:
df1 = df.rdiv(s, axis=0)
print (df1)
A B
0 30.0 15.0
1 15.0 10.0
Related
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
How can I replace all the non-NaN values in a pandas dataframe with 1 but leave the NaN values alone? This almost does what I'm looking for. The problem is it also makes NaN values 0. Then I have to reset them to NaN after.
I would like this
a b
0 NaN QQQ
1 AAA NaN
2 NaN BBB
to become this
a b
0 NaN 1
1 1 NaN
2 NaN 1
This code is almost what I want
newdf = df.notnull().astype('int')
The above code does this
a b
0 0 1
1 1 0
2 0 1
One way would be to select all non-null values from the original data frame and set them to one:
df[df.notnull()] = 1
This solution on your data:
df = pd.DataFrame({'a': [np.nan, 'AAA', np.nan], 'b': ['QQQ', np.nan, 'BBB']})
df[df.notnull()] = 1
df
a b
0 NaN 1
1 1 NaN
2 NaN 1
You can use np.where() with DataFrame.isna() to accomplish this
df=pd.DataFrame(data=[[1,np.NaN,5],
['q',np.NaN,np.NaN],
['7',{'a':1},np.NaN]],
columns=['a','b','c'])
a b c
0 1 NaN 5.0
1 q NaN NaN
2 7 {'a': 1} NaN
df1=pd.DataFrame(np.where(df.isna(),df,1), columns=df.columns)
a b c
0 1 NaN 1
1 1 NaN NaN
2 1 1 NaN
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
I have two datasets like this
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2,3,4,5], 'first': [np.nan,np.nan,1,0,np.nan], 'second': [1,np.nan,np.nan,np.nan,0]})
df2 = pd.DataFrame({'id': [1, 2,3,4,5, 6], 'first': [np.nan,1,np.nan,np.nan,0, 1], 'third': [1,0,np.nan,1,1, 0]})
And I want to get
result = pd.merge(df1, df2, left_index=True, right_index=True,on='id', how= 'outer')
result['first']= result[["first_x", "first_y"]].sum(axis=1)
result.loc[(result['first_x'].isnull()) & (result['first_y'].isnull()), 'first'] = np.nan
result.drop(['first_x','first_y'] , 1)
id second third first
0 1 1.0 1.0 NaN
1 2 NaN 0.0 1.0
2 3 NaN NaN 1.0
3 4 NaN 1.0 0.0
4 5 0.0 1.0 0.0
5 6 NaN 0.0 1.0
The problem is that the real dataset includes about 200 variables and my way is very long. How to make it easier? Thanks
You should be able to use combine_first:
>>> df1.set_index('id').combine_first(df2.set_index('id'))
first second third
id
1 NaN 1 1
2 1 NaN 0
3 1 NaN NaN
4 0 NaN 1
5 0 0 1
6 1 NaN 0
Should probably use combine_first as mentioned by Alexander. If you want to keep id as a column, you would just use:
merged = df1.merge(df2)
I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5