Groupby and keep rows depending on string value

Groupby and keep rows depending on string value - python

I have this DF:
In [106]: dfTest = pd.DataFrame( {'name':['a','a','b','b'], 'value':['x','y','x','h']})
In [107]: dfTest
Out[107]:
name value
0 a x
1 a y
2 b x
3 b h
So my intention is to obtain one row per name group and the value to keep will depend. If for each group of name I find h in value, I'd like to keep it. Otherwise, any value would fit, such as:
In [109]: dfTest
Out[109]:
name value
0 a x
1 b h

You can do it this way:
dfTest.reindex(dfTest.groupby('name')['value'].agg(lambda x: (x=='h').idxmax()))
Output:
name value
value
0 a x
3 b h

Another approach with drop_duplicates:
(dfTest.loc[dfTest['value'].eq('h').sort_values().index]
.drop_duplicates('name', keep='last')
)
Output:
name value
1 a y
3 b h

Related

Pandas: Groupby, concatenate one column and identify the row with maximums

I have a datframe like this:
prefix input_text target_text score
X V A 1
X V B 2
X W C 1
X W B 3
I want to group them by some columns and concatenate the column target_text, meanwhile get the maximum of score in each group and identify the target_text with highest score, like this:
prefix input_text target_text score top
X V A, B 2 B
X W C, B 3 B
This is my code which does the concatenation, however I just don't know about the rest.
df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])
In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.

Let us try
df.sort_values('score',ascending=False).\
drop_duplicates(['prefix','input_text']).\
rename(columns={'target_text':'top'}).\
merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]:
prefix input_text top score target_text
0 X W B 3 C,B
1 X V B 2 A,B

groupby agg would be useful here:
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values
new_df:
prefix input_text target_text score top
0 X V A, B 2 B
1 X W C, B 3 B
Aggregations are as follows:
target_text is joined together using ', '.join.
score is aggregated to only keep the max value with `'max'
top is the idxmax of the score column.
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
prefix input_text target_text score top
0 X V A, B 2 1
1 X W C, B 3 3
The values in top are the corresponding indexes from df:
prefix input_text target_text score
0 X V A 1
1 X V B 2 # index 1
2 X W C 1
3 X W B 3 # index 3
These values need to be "looked up" from df:
df.loc[new_df['top'], 'target_text']
1 B
3 B
Name: target_text, dtype: object
And assigned back to new_df. values is needed to break the index alignment.

try via sort_values(), groupby() and agg():
out=(df.sort_values('score')
.groupby(['prefix', 'input_text'], as_index=False)
.agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))
output of out:
input_text prefix score target_text top
0 V X 2 A, B B
1 W X 3 C, B B
Explaination:
we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:
we are joining together the values of 'target_text' by ', '
we are getting only max value of 'score column' bcz we are aggregrating max
we are getting last value of 'target_text' column since we sorted previously so now we are aggregrating last on it
Update:
If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:
newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Contains one of several values

I have a dataframe df with column x and a list lst =["apple","peach","pear"]
df
x
apple234
pear231
banana233445
If row1 in df["x"] contains any of the values in lst: then 1 else 0
Final data should look like this:
df
x y
apple234 -- 1
pear231 -- 1
banana233445 - 0

Use str.contains with regex | for join all values of list, last cast boolean mask to 0,1 by astype:
lst =["apple","peach","pear"]
df['y'] = df['x'].str.contains('|'.join(lst)).astype(int)
print (df)
x y
0 apple234 1
1 pear231 1
2 banana233445 0

pandas.DataFrame.apply() using index as an arg

I'm trying to apply a function to every row in a pandas dataframe. The number of columns is variable but I'm using the index in the function as well
def pretend(np_array, index):
sum(np_array)*index
df = pd.DataFrame(np.arange(16).reshape(8,2))
answer = df.apply(pretend, axis=1, args=(df.index))
I shaped it to 8x2 but I'd like it to work on any shape I pass it.

the index values can be accessed via the .name attribute:
In [3]:
df = pd.DataFrame(data = np.random.randn(5,3), columns=list('abc'))
df
Out[3]:
a b c
0 -1.662047 0.794483 0.672300
1 -0.812412 -0.325160 -0.026990
2 -0.334991 0.412977 -2.016004
3 -1.337757 -1.328030 -1.005114
4 0.699106 -1.527408 -1.288385
In [8]:
def pretend(np_array):
return (np_array.sum())*np_array.name
df.apply(lambda x: pretend(x), axis=1)
Out[8]:
0 -0.000000
1 -1.164561
2 -3.876037
3 -11.012701
4 -8.466748
dtype: float64
You can see that the first row becomes 0 as the index value is 0

Count size of rolling intersection in pandas

I have a dataframe that consists of group labels ('B') and elements of each group ('A'). The group labels are ordered, and I want to know how many elements of group I show up in group i+1.
An example:
df= pd.DataFrame({ 'A': ['a','b','c','a','c','a','d'], 'B' : [1,1,1,2,2,3,3]})
A B
0 a 1
1 b 1
2 c 1
3 a 2
4 c 2
5 a 3
6 d 3
The desired output would be something like:
B
1 NaN
2 2
3 1
One way to go about this would be to compute the number of distinct elements in the union of group I and group i+1 and then subtract of the number of distinct elements in each group. I've tried:
pd.rolling_apply(grp['A'], lambda x: len(x.unique()),2)
but this produces an error:
AttributeError: 'Series' object has no attribute 'type'
How do I get this to work with rolling_apply or is there a better way to attack this problem?

An approach with using sets and shifting the result:
First grouping the dataframe and then converting column A of each group into a set:
In [86]: grp = df.groupby('B')
In [87]: s = grp.apply(lambda x : set(x['A']))
In [88]: s
Out[88]:
B
1 set([a, c, b])
2 set([a, c])
3 set([a, d])
dtype: object
To calculate the intersection between consecutive sets, make a shifted version (I replace the NaN to an empty set for the next step):
In [89]: s2 = s.shift(1).fillna(set([]))
In [90]: s2
Out[90]:
B
1 set([])
2 set([a, c, b])
3 set([a, c])
dtype: object
Combine both series and calculate the length of the intersection:
In [91]: s.combine(s2, lambda x, y: len(x.intersection(y)))
Out[91]:
B
1 0
2 2
3 1
dtype: object
Another way to do the last step (for sets & means intersection):
df = pd.concat([s, s2], axis=1)
df.apply(lambda x: len(x[0] & x[1]), axis=1)
The reason the rolling apply does not work is because 1) you provided it a GroupBy object and not a series, and 2) it only works with numerical values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby and keep rows depending on string value - python

You can do it this way: dfTest.reindex(dfTest.groupby('name')['value'].agg(lambda x: (x=='h').idxmax())) Output: name value value 0 a x 3 b h

Another approach with drop_duplicates: (dfTest.loc[dfTest['value'].eq('h').sort_values().index] .drop_duplicates('name', keep='last') ) Output: name value 1 a y 3 b h

Related

Pandas: Groupby, concatenate one column and identify the row with maximums

Renaming columns on slice of dataframe not performing as expected

Contains one of several values

pandas.DataFrame.apply() using index as an arg

Count size of rolling intersection in pandas

Categories

Resources