I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1
Suppose I have a dataframe like
A B
1 a
1 b
1 c
1 d
1 e
2 a
2 v
2 r
Now suppose I want to select from each groups of A with batch size 2 so that they are shuffled. I want a code that returns the following sequence:
A B
1 a
1 b
2 a
2 v
1 c
1 d
I omitted the rest because they can't form a batch size of 2 with similar items (on column A).
It's part of my code, but I am still looking for smaller code
dict_dfs = dict(tuple(self.split_df.groupby("A")))
bs = self.batch_size
keys = list(dict_dfs)
total = len(self.split_df)
ii = 0
rr = 0
start = {}
rows = []
while ii < total:
if rr >= bs:
rr += 1
rr = rr % len(keys)
key = keys[rr]
df = dict_dfs[key]
kk = 0
for index, row in df.iterrows():
if kk < start[key]:
kk += 1
continue
batch = {}
batch["A"] = row["A"]
...
rows.append(batch)
start[key] += 1
kk += 1
if kk > bs:
break
I have a dataframe similar to this one:
col inc_idx
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
6 D 1
7 E 1
8 F 1
9 F 2
10 Z 1
And I'm trying to iterate the df by batches:
First loop: All col rows with inc_idx >= 1 and inc_idx <=2
A 1
A 2
B 1
B 2
...
Second loop: All col rows with inc_idx >= 3 and inc_idx <=4
A 3
The way I'm doing it now leaves a lot of room for improvement:
i = 0
while True:
for col, grouped_rows in df.groupby(by=['col']):
from_idx = i * 2
to_idx = from_idx + 2
items = grouped_rows .iloc[from_idx:to_idx].to_list()
i += 2
I think that there's got to be a more efficient approach and also a way to remove the "while True" loop and instead just waiting for the internal loop to run out of items.
I don't know exactly what you want to do. Here's something that groups the rows.
df.groupby((df.inc_idx + 1) // 2).agg(list)
col inc_idx
inc_idx
1 [A, B, C, A, B, D, E, F, F, Z] [1, 1, 1, 2, 2, 1, 1, 1, 2, 1]
2 [A] [3]
I've found (I think) a simpler way to solve it. I'll add a new "batch" column:
df['batch'] = df.apply(lambda x: x['inc_idx'] // 2, axis=1)
With this new column, now I can simply do something like:
df.groupby(by=['col', 'batch'])
I have a dataframe with 3 columns. I need to get the value from col A and B in the middle of C when C = 1. If the amount of C = 1 is even, I want the first one from the middle
For example, this one is for an odd amount of C = 1
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 0
q p 0
The row in the middle when C = 1 is
A B C
e p 1
Therefore, it should return
df_return
A B C
e p 1
When we have an even amount of C = 1:
df_return
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 1
r e 1
u f 1
q p 0
The ones in the middle when C = 1 are
A B C
t b 1
u e 1
However, I want only 1 of them, and it should be the first one. So
df_return
A B C
t b 1
How can I do it?
One thing you should know is that A and B are ordered
Focus on the relevant part, discarding rows holding zeros:
df = df[df.C == 1]
Now it's simple. Just find the midpoint, based on length or .shape.
if len(df) > 0:
mid = (len(df) - 1) // 2
return df.iloc[mid, :]
I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]