Pandas conditionally replace value if >1 unique values for other column - python

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
df
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C N
I need a line of code that:
1. identifies if there are more than 1 unique values in column B for each category of A (i.e. category "C" in column A has 2 unique values in column B whereas categories "A" and "B" in column A only have 1 unique value each).
2. Changes the value in column B to "Y" only if there are more than 1 unique values per that category (i.e. Column B should have "Y" for both rows of category "C" in column A.
Here's the desired result:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y
Thanks in advance!

You could:
df['B'] = df.groupby('A')['B'].transform(lambda x: 'Y' if x.nunique() > 1 else x)
to get:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y

This should work:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
# Get unique items in each column A group
group_counts = df.groupby('A').B.apply(lambda x: len(x.unique()))
# Find all of them with more than 1 unique value
cols_to_impute = group_counts[group_counts > 1].index.values
# Change column B to 'Y' for such columns
df.loc[df.A.isin(cols_to_impute),'B'] = 'Y'
In [20]: df
Out[20]:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y

Related

Group the most frequent items in a column associated with the most frequent items in another column

Having the following data frame, I'd like to find the 2 most frequent groups in the 'first' column, and within each group find the 2 most frequent groups in the 'second' column.
df = pd.DataFrame({'first': list('cbbcbcbccabc'), 'second': list('zvvzwyzyxxwz')})
df
gets
first second
0 c z
1 b v
2 b v
3 c z
4 b w
5 c y
6 b z
7 c y
8 c x
9 a x
10 b w
11 c z
and by df.groupby(['first']).size() we get
first
a 1
b 5
c 6
so, 'c' and 'b' are the most frequent items in the 'first' column. We want the 2 most frequent items in the 'second' column within 'c' and 'b' groups. If we do df.groupby(['first', 'second']).size() we get
first second
a x 1
b v 2
w 2
z 1
c x 1
y 2
z 3
therefore we're interested in 'z' and 'y' within 'c', and 'v' and 'w' within 'b', that is
first second
c z 3
y 2
b v 2
w 2
I think here is possible use Series.value_counts, because sorting by default by counts - first is filtered top2 values, filtered DataFrame and the is returned top2 per groups with change order by idx values:
Notice - Filtering by m is not necessary, but added for better performance (processing only 2 groups instead all)
df = pd.DataFrame({'first': list('cbbcbcbccabc'), 'second': list('zvvzwyzyxxwz')})
idx = df['first'].value_counts().head(2).index
m = df['first'].isin(idx)
df = (df[m].groupby(['first'])['second']
.apply(lambda x: x.value_counts().iloc[:2])
.reindex(idx, level=0)
.rename_axis(['first','second']))
print (df)
first second
c z 3
y 2
b w 2
v 2
Name: second, dtype: int64
Solution for 3 levels:
df = pd.DataFrame({'second': list('cbbcbcbccabc'),
'third': list('zvvzwyzyxxwz')})
#3 column df
df = (pd.concat([df, df], keys=('a','b'))
.reset_index(level=1, drop=True)
.rename_axis('first')
.reset_index())
# print (df)
idx = df['first'].value_counts().head(2).index
m = df['first'].isin(idx)
idx1 = (df[m].groupby(['first'])['second']
.apply(lambda x: x.value_counts().iloc[:2])
.index)
print (idx1)
df = df.set_index(['first','second'])
df = (df.loc[idx1].groupby(['first','second'], sort=False)['third']
.apply(lambda x: x.value_counts().iloc[:2])
.rename_axis(['first','second','third']))
print (df)
first second third
a c z 3
y 2
b w 2
v 2
b c z 3
y 2
b w 2
v 2
Name: third, dtype: int64

Aggregate over difference of levels of factor in Pandas DataFrame?

Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2
I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C

Finding total number of unique values in a multi indexed dataframe?

For the given dataframe below. I want to know if for each index in X (i.e 1,2 and 3), the values in other index Y are same and total number of them.
So for X index 1, i want to know what values are in Y which is a, b and c. And whether it is equal to Y's index values for 2 and 3. So here Y values for X index 1 is equal to Y values for X index 3 i.e they both have a,b and c, while for 3 is not same.
X Y
1 a A
b B
c C
2 a A
b B
3 a A
b B
c D
If maximum length of Y level has all possible unique values is possible use Series.unstack for reshape:
print (type(s))
<class 'pandas.core.series.Series'>
print (s.unstack())
Y a b c
X
1 A B C
2 A B NaN
3 A B D
And then remove rows with incomplete values, it means rows with missing values by DataFrame.dropna:
df1 = s.unstack().dropna()
print (df1)
Y a b c
X
1 A B C
3 A B D
print (df1.columns.tolist())
['a', 'b', 'c']
print (len(df1))
2

Pandas add rows according for each unique element of a column

I've got a dataframe, like so:
ID A
0 z
2 z
2 y
5 x
To which I want to add rows for each unique value of an ID column:
ID A
0 z
2 z
2 y
5 x
0 b
2 b
5 b
I'm currently doing so in a very naïve way, which is quite inefficient/slow:
IDs = df["ID"].unique()
for ID in IDs:
df = df.append(pd.DataFrame([[ID, "b"]], columns=df.columns), ignore_index=True)
How would I go to accomplish the same without the explicit foreach, only pandas function calls?
Use drop_duplicates, rewrite column by assign and append or concat to original DataFrame:
df = df.append(df.drop_duplicates("ID").assign(A='B'), ignore_index=True)
#alternative
#df = pd.concat([df, df.drop_duplicates("ID").assign(A='B')], ignore_index=True)
print (df)
ID A
0 0 z
1 2 z
2 2 y
3 5 x
4 0 B
5 2 B
6 5 B

pandas get last value of column x when column y is equal to z

Suppose I create a pandas DataFrame with two columns, one of which contains some numbers and the other contains letters. Like this:
import pandas as pd
from pprint import pprint
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
pprint(df)
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
Now say that I want to make a third column (c) whose value is equal to the last value of a when b was equal to x. In the cases where a value of x was not encountered in b yet, the value in c should default to 0.
The procedure should produce pretty much the following result:
last_a = 0
c = []
for i,b in enumerate(df['b']):
if b == 'x':
last_a = df.iloc[i]['a']
c += [last_a]
df['c'] = c
pprint(df)
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4
Is there a more elegant way to accomplish this either with or without pandas?
In [140]: df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
In [141]: df
Out[141]:
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
FInd out where column 'b' == x, then return the value in that column (not the location); this column is already the 'a' column
In [142]: df['c'] = df.loc[df['b']=='x','a'].apply(lambda v: v if v < len(df) else np.nan)
Fill the rest of the values forward, then fill holes with 0
In [143]: df['c'] = df['c'].ffill().fillna(0)
In [144]: df
Out[144]:
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4

Categories