Find the index of first occurrence in DataFrame - python

I have a dataframe which looks like this:
0 1 2 3 4 5 6
0 a(A) b c c d a a
1 b h w k d c(A) k
2 g e(A) s g h s f
3 f d s h(A) c w n
4 e g s b c e w
I want to get the index of the cell which contains (A) in each column.
0 0
1 2
2 NaN
3 3
4 NaN
5 1
6 NaN
I tried this code but the result doesn't reach my expectation.
df.apply(lambda x: (x.str.contains(r'(A)')==True).idxmax(), axis=0)
Result looks like this:
0 0
1 2
2 0
3 3
4 0
5 1
6 0
I think it returns the first index if there is no (A) in that column.
How should I fix it?

Use Series.where for set default missing value for overwrite default 0 value of DataFrame.idxmax:
mask = df.apply(lambda x: x.str.contains('A'))
s1 = mask.idxmax().where(mask.any())
print (s1)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64

You could do what you're doing but explicitly check if the rows contain any matches:
In [51]: pred = df.applymap(lambda x: '(A)' in x)
In [52]: pred.idxmax() * np.where(pred.any(), 1, np.nan)
Out[52]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
Or alternatively, using DataFrame.where directly:
In [211]: pred.where(pred).idxmax()
Out[211]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
A slightly cheatier one-liner is to use DataFrame.where on the identity:
In [78]: df.apply(lambda x: x.str.contains('A')).where(lambda x: x).idxmax()
Out[78]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN

Add an if condition at the end of the apply:
>>> df.apply(lambda x: x.str.contains('A').idxmax() if 'A' in x[x.str.contains('A').idxmax()] else np.nan)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
>>>

Related

combine and group rows from 2 dfs

I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!
You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']
IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN

Duplicate positions from group

I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()

Python, element wise sorting of a DataFrame

I am trying to sort each row of a DataFrame element wise.
Input:
A B C
0 10 5 6
1 3 6 5
2 1 2 3
Output:
A B C
0 10 6 5
1 6 5 3
2 3 2 1
It feels this should be easy but I've been failing for while... Very much a beginner in Python.
Use np.sort with swap ordering by indexing:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
Pandas solution, slowier, is apply sorting for each row separately, convert to array and then to Series:
f = lambda x: pd.Series(x.sort_values(ascending=False).to_numpy(), index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
If possible missing values for me working:
print (df)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
In pandas is possible use na_position parameter for specify order of them:
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='first').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

DataFrameGroupBy diff() on condition

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

Categories