I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1
Related
I have dataset in form of pandas dataframe as:
In this dataset, I want to find those names and values that have not same value. It should also work for non-square matrices. Example:
A to B is 4. So, B to A must be 4. But B to A is 8.
A to C is 5. So, C to A must be 5. OK.
A to D is 8. So, D to A must be 8. But D to A is 5.
B to C is 6. So, C to B must be 6. But C to B is 3.
and so on...
So, want output as :
(A,B,4) and (B,A,8)
(A,D,8) and (D,A,5)
(B,C,6) and (C,B,3)
Don't print where value is same.
I am trying it using numpy array and dictionaries but can't figure out exact logic.
Here's an approach using Pandas
data = {'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3], 'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}
tdf = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
for idx in tdf.index:
if idx in tdf.columns:
for col in tdf.columns:
if col in tdf.index and col != idx and tdf[idx][col] != tdf[col][idx]:
print (f'({idx}, {col}, {tdf[idx][col]}) and ({col}, {idx}, {tdf[col][idx]})' )
tdf:
A B C D F
A 0 4 5 8 7
B 8 0 6 9 5
C 5 3 0 2 6
D 5 7 4 0 2
E 1 2 3 4 1
and output is of the form:
(A, B, 8) and (B, A, 4)
(A, D, 5) and (D, A, 8)
(B, A, 4) and (A, B, 8)
(B, C, 3) and (C, B, 6)
(B, D, 7) and (D, B, 9)
(C, B, 6) and (B, C, 3)
(C, D, 4) and (D, C, 2)
(D, A, 8) and (A, D, 5)
(D, B, 9) and (B, D, 7)
(D, C, 2) and (C, D, 4)
Making a dataframe thanks to what #itprorh66 has provided:
df = pd.DataFrame({'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3],
'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}, index=['A', 'B', 'C', 'D', 'E'])
Intersect and create a square data.frame:
cmg = df.index.intersection(df.columns)
df = df[cmg].loc[cmg]
We can use the numpy upper and lower triangle functions, and pull out the indices for the upper triangle:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
Then put everything into a dataframe, the joining of the rownames and column names is a bit unslightly, but thats the best I can do for now:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
match_tri = pd.DataFrame({'i1':df.index[tr[0]] + ',' + df.columns[tr[1]],
'v1':mat[tr],
'i2':df.index[tr[1]] + ',' + df.columns[tr[0]],
'v2':mat.T[tr]
})
Then we just subset based on the values:
match_tri[match_tri.v1 != match_tri.v2]
i1 v1 i2 v2
0 A,B 4 B,A 8
2 A,D 8 D,A 5
3 B,C 6 C,B 3
4 B,D 9 D,B 7
5 C,D 2 D,C 4
How can I create iterate group of three from a iterate object? For creating a pair of iteration function I can do something like
from itertools import tee
def func(iterate):
i, j = tee(iterate)
next(j, None)
return zip(i, j)
l = [1,2,3,4,5]
for a, b in func(l):
print(a, b)
> 1, 2
> 2, 3
> 3, 4
> 4, 5
You can expand on what you already did for groups of two, but with one more variable for the third item:
def func(iterate):
i, j, k = tee(iterate, 3)
next(j, None)
next(k, None)
next(k, None)
return zip(i, j, k)
l = [1,2,3,4,5]
for a, b, c in func(l):
print(a, b, c)
This outputs:
1 2 3
2 3 4
3 4 5
Note that your sample code in the question is incorrect as it is missing a call to zip in the returning value from func.
Use zip():
l = [1,2,3,4,5]
for a, b, c in zip(l, l[1:], l[2:]):
print(a, b, c)
# 1 2 3
# 2 3 4
# 3 4 5
You can also create groups of two with this method:
l = [1,2,3,4,5]
for a, b in zip(l, l[1:]):
print(a, b)
# 1 2
# 2 3
# 3 4
# 4 5
Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?
Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].
This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C
Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Hi I have a dataframe like this:
A B
0: some value [[L1, L2]]
I want to change it into:
A B
0: some value L1
1: some value L2
How can I do that?
Pandas >= 0.25
df1 = pd.DataFrame({'A':['a','b'],
'B':[[['1', '2']],[['3', '4', '5']]]})
print(df1)
A B
0 a [[1, 2]]
1 b [[3, 4, 5]]
df1 = df1.explode('B')
df1.explode('B')
A B
0 a 1
0 a 2
1 b 3
1 b 4
1 b 5
I don't know how good this approach is but it works when you have a list of items.
you can do it this way:
In [84]: df
Out[84]:
A B
0 some value [[L1, L2]]
1 another value [[L3, L4, L5]]
In [85]: (df['B'].apply(lambda x: pd.Series(x[0]))
....: .stack()
....: .reset_index(level=1, drop=True)
....: .to_frame('B')
....: .join(df[['A']], how='left')
....: )
Out[85]:
B A
0 L1 some value
0 L2 some value
1 L3 another value
1 L4 another value
1 L5 another value
UPDATE: a more generic solution
Faster solution with chain.from_iterable and numpy.repeat:
from itertools import chain
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
print (df)
A B
0 a [[A1, A2]]
1 b [[A1, A2, A3]]
df1 = pd.DataFrame({ "A": np.repeat(df.A.values,
[len(x) for x in (chain.from_iterable(df.B))]),
"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
print (df1)
A B
0 a A1
1 a A2
2 b A1
3 b A2
4 b A3
Timings:
A = np.unique(np.random.randint(0, 1000, 1000))
B = [[list(string.ascii_letters[:random.randint(3, 10)])] for _ in range(len(A))]
df = pd.DataFrame({"A":A, "B":B})
print (df)
A B
0 0 [[a, b, c, d, e, f, g, h]]
1 1 [[a, b, c]]
2 3 [[a, b, c, d, e, f, g, h, i]]
3 5 [[a, b, c, d, e]]
4 6 [[a, b, c, d, e, f, g, h, i]]
5 7 [[a, b, c, d, e, f, g]]
6 8 [[a, b, c, d, e, f]]
7 10 [[a, b, c, d, e, f]]
8 11 [[a, b, c, d, e, f, g]]
9 12 [[a, b, c, d, e, f, g, h, i]]
10 13 [[a, b, c, d, e, f, g, h]]
...
...
In [67]: %timeit pd.DataFrame({ "A": np.repeat(df.A.values, [len(x) for x in (chain.from_iterable(df.B))]),"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
1000 loops, best of 3: 818 µs per loop
In [68]: %timeit ((df['B'].apply(lambda x: pd.Series(x[0])).stack().reset_index(level=1, drop=True).to_frame('B').join(df[['A']], how='left')))
10 loops, best of 3: 103 ms per loop
I can't find a elegant way to handle this, but the following codes can work...
import pandas as pd
import numpy as np
df = pd.DataFrame([{"a":1,"b":[[1,2]]},{"a":4, "b":[[3,4,5]]}])
z = []
for k,row in df.iterrows():
for j in list(np.array(row.b).flat):
z.append({'a':row.a, 'b':j})
result = pd.DataFrame(z)
I think this is the fastest and simplest way:
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
df.set_index('A')['B'].apply(lambda x: pd.Series(x[0]))
Here's another option
unpacked = (pd.melt(df.B.apply(pd.Series).reset_index(),id_vars='index')
.merge(df, left_on = 'index', right_index = True))
unpacked = (unpacked.loc[unpacked.value.notnull(),:]
.drop(columns=['index','variable','B'])
.rename(columns={'value':'B'})
Apply pd.series to column B --> splits each list entry to a different row
Melt this, so that each entry is a separate row (preserving index)
Merge this back on original dataframe
Tidy up - drop unnecessary columns and rename the values column
If I have a table like
ID Date Disease
1 03.07 A
1 03.07 B
1 03.09 A
1 03.09 C
1 03.10 D
I wrote a code like:
def combination(listData):
comListData = [];
for datum in listData :
start = listData.index(datum) + 1
while start < len(listData) :
if datum!=listData[start] :
comStr = datum+':'+listData[start]
if not comStr in comListData :
comListData.append(comStr)
start+=1;
return comListData
def insertToDic(dic,comSick):
for datum in comSick :
if dic.has_key(datum) :
dic[datum]+=1
else :
dic[datum] = 1
try:
con = mdb.connect('blahblah','blah','blah','blah')
cur = con.cursor()
sql ="select * from table"
cur.execute(sql);
data = cur.fetchall();
start = 0
end = 1
sick = []
dic = {}
for datum in data :
end = datum[0]
if end!=start:
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
sick = []
sick.append(datum[2])
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
for k,v in dic.items():
a,b = k.split(':')
print >>f, a.ljust(0), b.ljust(0), v
f.close()
then I got:
From To Count
A B 1
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A C 1
A D 1
C D 1
and the final version table I got is (In same ID, same direction such as A --> C count as 1 not 2. Same diseases like A --> A doesn't count. A --> B is different with B --> A)
From To Count
A B 1
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
but what I want is (excluding same date cases version):
From To Count
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A D 1
C D 1
and finally
From To Count
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
which part of my code should I edit?
Let me try to rephrase your question. For each ID (excluding date to make the problem simpler), you want all possible pairs of values in Disease column and how often they occur, in which order of the pair matters. Now, up front there is a builtin function in Python that achieve this:
from itertools import permutations
all_pairs = permutations(diseases, 2)
Given your data, I am guessing it is in csv files. If it is not, please adjust my code yourself (which is kind of trivial Google searches). We will be using the famous library in data-science stacks called Pandas. Here is how it goes:
from itertools import permutations
import pandas as pd
df = pd.read_csv('data.csv', header=0)
pairs_by_did = df.groupby('ID').apply(lambda grp: pd.Series(list(permutations(grp['Disease'], 2))))
all_pairs = pd.concat([v for i, v in pairs_by_did.iterrows()])
pair_counts = all_pairs.value_counts()
print pair_counts
For your example, it prints
>>> print pair_counts
(A, B) 2
(D, A) 2
(A, D) 2
(C, A) 2
(B, A) 2
(A, C) 2
(A, A) 2
(C, B) 1
(D, C) 1
(C, D) 1
(D, B) 1
(B, D) 1
(B, C) 1
Name: 1, dtype: int64
Now group by ID and date at the same time, and see what you get.