I am trying to sort each row of a DataFrame element wise.
Input:
A B C
0 10 5 6
1 3 6 5
2 1 2 3
Output:
A B C
0 10 6 5
1 6 5 3
2 3 2 1
It feels this should be easy but I've been failing for while... Very much a beginner in Python.
Use np.sort with swap ordering by indexing:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
Pandas solution, slowier, is apply sorting for each row separately, convert to array and then to Series:
f = lambda x: pd.Series(x.sort_values(ascending=False).to_numpy(), index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
If possible missing values for me working:
print (df)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
In pandas is possible use na_position parameter for specify order of them:
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='first').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
Related
I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!
You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']
IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN
I have two dataframes with same column id and for each id I need to apply the following function
def findConstant(df1,df2):
c = df1.iloc[[0], df1.eq(df1.iloc[0]).all().to_numpy()].squeeze()
return pd.concat([df1, df2]).assign(**c).reset_index(drop=True)
what I am doing the is the following:
df3 = pd.DataFrame()
for idx in df1['id']:
tmp1 = df1[df1['id']==idx]
tmp2 = df2[df2['id']==idx]
tmp3 = findConstant(tmp1,tmp2)
df3 = pd.concat([df3,tmp3], ignore_index(drop=True))
I would like to know how to avoid a loop like that
Use:
print (df1)
A B C id val
0 ar 2 8 1 3.2
1 ar 3 7 1 5.6
3 ar1 0 3 2 7.8
4 ar1 4 3 2 9.2
5 ar1 5 3 2 3.4
print (df2)
id val
0 1 3.3
1 2 6.4
#get number of unique values and first values to df3
df3 = df1.groupby('id').agg(['nunique','first'])
#filter if same values by comapre by 1
m = df3.xs('nunique', axis=1, level=1).eq(1)
#get correct values to df with replace not matched by original df2
df = df3.xs('first', axis=1, level=1).where(m).combine_first(df2.set_index('id'))
print (df)
A B C val
id
1 ar NaN NaN 3.3
2 ar1 NaN 3.0 6.4
#join together
df = pd.concat([df1, df.reset_index()], ignore_index=True)
print (df)
A B C id val
0 ar 2.0 8.0 1 3.2
1 ar 3.0 7.0 1 5.6
2 ar1 0.0 3.0 2 7.8
3 ar1 4.0 3.0 2 9.2
4 ar1 5.0 3.0 2 3.4
5 ar NaN NaN 1 3.3
6 ar1 NaN 3.0 2 6.4
I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()
I have a dataframe which looks like this:
0 1 2 3 4 5 6
0 a(A) b c c d a a
1 b h w k d c(A) k
2 g e(A) s g h s f
3 f d s h(A) c w n
4 e g s b c e w
I want to get the index of the cell which contains (A) in each column.
0 0
1 2
2 NaN
3 3
4 NaN
5 1
6 NaN
I tried this code but the result doesn't reach my expectation.
df.apply(lambda x: (x.str.contains(r'(A)')==True).idxmax(), axis=0)
Result looks like this:
0 0
1 2
2 0
3 3
4 0
5 1
6 0
I think it returns the first index if there is no (A) in that column.
How should I fix it?
Use Series.where for set default missing value for overwrite default 0 value of DataFrame.idxmax:
mask = df.apply(lambda x: x.str.contains('A'))
s1 = mask.idxmax().where(mask.any())
print (s1)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
You could do what you're doing but explicitly check if the rows contain any matches:
In [51]: pred = df.applymap(lambda x: '(A)' in x)
In [52]: pred.idxmax() * np.where(pred.any(), 1, np.nan)
Out[52]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
Or alternatively, using DataFrame.where directly:
In [211]: pred.where(pred).idxmax()
Out[211]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
A slightly cheatier one-liner is to use DataFrame.where on the identity:
In [78]: df.apply(lambda x: x.str.contains('A')).where(lambda x: x).idxmax()
Out[78]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
Add an if condition at the end of the apply:
>>> df.apply(lambda x: x.str.contains('A').idxmax() if 'A' in x[x.str.contains('A').idxmax()] else np.nan)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
>>>
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN