Pandas increment values on groupby with a condition - python

Lets say I have a df like this, need to groupby on links,
and if a link repeated more than 3 times, should increment its value
name links
A https://a.com/-pg0
B https://b.com/-pg0
C https://c.com/-pg0
D https://c.com/-pg0
x https://c.com/-pg0
y https://c.com/-pg0
z https://c.com/-pg0
E https://e.com/-pg0
F https://e.com/-pg0
Expected output, here names C,D,x,y,z, repeated more than 3, so first 3 will be zero and next will be incremented
name links
A https://a.com/-pg0
B https://b.com/-pg0
C https://c.com/-pg0
D https://c.com/-pg0
x https://c.com/-pg0
y https://c.com/-pg1
z https://c.com/-pg1
E https://e.com/-pg0
F https://e.com/-pg0

You can try cumcount with //
s = df.groupby('links').cumcount()//3
Out[125]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 0
8 0
dtype: int64
df['links'] = df['links'] + s.astype(str)

Related

Create a dataframe of all combinations of columns names per row based on mutual presence of columns pairs

I'm trying to create a dataframe based on other dataframe and a specific condition.
Given the pandas dataframe above, I'd like to have a two column dataframe, which each row would be the combinations of pairs of words that are different from 0 (coexist in a specific row), beginning with the first row.
For example, for this part of image above, the new dataframe that I want is like de following:
and so on...
Does anyone have some tip of how I can do it? I'm struggling... Thanks!
As you didn't provide a text example, here is a dummy one:
>>> df
A B C D E
0 0 1 1 0 1
1 1 1 1 1 1
2 1 0 0 1 0
3 0 0 0 0 1
4 0 1 1 0 0
you could use a combination of masking, explode and itertools.combinations:
from itertools import combinations
mask = df.gt(0)
series = (mask*df.columns).apply(lambda x: list(combinations(set(x).difference(['']), r=2)), axis=1)
pd.DataFrame(series.explode().dropna().to_list(), columns=['X', 'Y'])
output:
X Y
0 C E
1 C B
2 E B
3 E D
4 E C
5 E B
6 E A
7 D C
8 D B
9 D A
10 C B
11 C A
12 B A
13 A D
14 C B

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

Pandas: how to do value counts within groups

I have the following dataframe. I want to group by a and b first. Within each group, I need to do a value count based on c and only pick the one with most counts. If there are more than one c values for one group with the most counts, just pick any one.
a b c
1 1 x
1 1 y
1 1 y
1 2 y
1 2 y
1 2 z
2 1 z
2 1 z
2 1 a
2 1 a
The expected result would be
a b c
1 1 y
1 2 y
2 1 z
What is the right way to do it? It would be even better if I can print out each group with c's value counts sorted as an intermediate step.
You are looking for .value_counts():
df.groupby(['a', 'b'])['c'].value_counts()
a b c
1 1 y 2
x 1
2 y 2
z 1
2 1 a 2
z 2
Name: c, dtype: int64
group the original dataframe by ['a', 'b'] and get the .max() should work
df.groupby(['a', 'b'])['c'].max()
you can also aggregate 'count' and 'max' values
df.groupby(['a', 'b'])['c'].agg({'max': max, 'count': 'count'}).reset_index()
Try:
df=df.groupby(["a", "b", "c"])["c"].count().sort_values(ascending=False).reset_index(name="dropme").drop_duplicates(subset=["a", "b"], keep="first").drop("dropme", axis=1)
Outputs:
a b c
0 2 1 z
2 1 2 y
3 1 1 y

pandas groupby apply list from column based on binary column

I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA
IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]

Add a column results in difference of rows

Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.
Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4

Categories