Check for pairs not having same values in pandas dataframe - python

I have dataset in form of pandas dataframe as:
In this dataset, I want to find those names and values that have not same value. It should also work for non-square matrices. Example:
A to B is 4. So, B to A must be 4. But B to A is 8.
A to C is 5. So, C to A must be 5. OK.
A to D is 8. So, D to A must be 8. But D to A is 5.
B to C is 6. So, C to B must be 6. But C to B is 3.
and so on...
So, want output as :
(A,B,4) and (B,A,8)
(A,D,8) and (D,A,5)
(B,C,6) and (C,B,3)
Don't print where value is same.
I am trying it using numpy array and dictionaries but can't figure out exact logic.

Here's an approach using Pandas
data = {'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3], 'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}
tdf = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
for idx in tdf.index:
if idx in tdf.columns:
for col in tdf.columns:
if col in tdf.index and col != idx and tdf[idx][col] != tdf[col][idx]:
print (f'({idx}, {col}, {tdf[idx][col]}) and ({col}, {idx}, {tdf[col][idx]})' )
tdf:
A B C D F
A 0 4 5 8 7
B 8 0 6 9 5
C 5 3 0 2 6
D 5 7 4 0 2
E 1 2 3 4 1
and output is of the form:
(A, B, 8) and (B, A, 4)
(A, D, 5) and (D, A, 8)
(B, A, 4) and (A, B, 8)
(B, C, 3) and (C, B, 6)
(B, D, 7) and (D, B, 9)
(C, B, 6) and (B, C, 3)
(C, D, 4) and (D, C, 2)
(D, A, 8) and (A, D, 5)
(D, B, 9) and (B, D, 7)
(D, C, 2) and (C, D, 4)

Making a dataframe thanks to what #itprorh66 has provided:
df = pd.DataFrame({'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3],
'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}, index=['A', 'B', 'C', 'D', 'E'])
Intersect and create a square data.frame:
cmg = df.index.intersection(df.columns)
df = df[cmg].loc[cmg]
We can use the numpy upper and lower triangle functions, and pull out the indices for the upper triangle:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
Then put everything into a dataframe, the joining of the rownames and column names is a bit unslightly, but thats the best I can do for now:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
match_tri = pd.DataFrame({'i1':df.index[tr[0]] + ',' + df.columns[tr[1]],
'v1':mat[tr],
'i2':df.index[tr[1]] + ',' + df.columns[tr[0]],
'v2':mat.T[tr]
})
Then we just subset based on the values:
match_tri[match_tri.v1 != match_tri.v2]
i1 v1 i2 v2
0 A,B 4 B,A 8
2 A,D 8 D,A 5
3 B,C 6 C,B 3
4 B,D 9 D,B 7
5 C,D 2 D,C 4

Related

Get the frequency of all combinations in Pandas

I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1

Arrange sequences of entries in pairs in a dataframe

Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?
Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].
This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C
Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C

How to convert column with list of values into rows in Pandas DataFrame

Hi I have a dataframe like this:
A B
0: some value [[L1, L2]]
I want to change it into:
A B
0: some value L1
1: some value L2
How can I do that?
Pandas >= 0.25
df1 = pd.DataFrame({'A':['a','b'],
'B':[[['1', '2']],[['3', '4', '5']]]})
print(df1)
A B
0 a [[1, 2]]
1 b [[3, 4, 5]]
df1 = df1.explode('B')
df1.explode('B')
A B
0 a 1
0 a 2
1 b 3
1 b 4
1 b 5
I don't know how good this approach is but it works when you have a list of items.
you can do it this way:
In [84]: df
Out[84]:
A B
0 some value [[L1, L2]]
1 another value [[L3, L4, L5]]
In [85]: (df['B'].apply(lambda x: pd.Series(x[0]))
....: .stack()
....: .reset_index(level=1, drop=True)
....: .to_frame('B')
....: .join(df[['A']], how='left')
....: )
Out[85]:
B A
0 L1 some value
0 L2 some value
1 L3 another value
1 L4 another value
1 L5 another value
UPDATE: a more generic solution
Faster solution with chain.from_iterable and numpy.repeat:
from itertools import chain
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
print (df)
A B
0 a [[A1, A2]]
1 b [[A1, A2, A3]]
df1 = pd.DataFrame({ "A": np.repeat(df.A.values,
[len(x) for x in (chain.from_iterable(df.B))]),
"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
print (df1)
A B
0 a A1
1 a A2
2 b A1
3 b A2
4 b A3
Timings:
A = np.unique(np.random.randint(0, 1000, 1000))
B = [[list(string.ascii_letters[:random.randint(3, 10)])] for _ in range(len(A))]
df = pd.DataFrame({"A":A, "B":B})
print (df)
A B
0 0 [[a, b, c, d, e, f, g, h]]
1 1 [[a, b, c]]
2 3 [[a, b, c, d, e, f, g, h, i]]
3 5 [[a, b, c, d, e]]
4 6 [[a, b, c, d, e, f, g, h, i]]
5 7 [[a, b, c, d, e, f, g]]
6 8 [[a, b, c, d, e, f]]
7 10 [[a, b, c, d, e, f]]
8 11 [[a, b, c, d, e, f, g]]
9 12 [[a, b, c, d, e, f, g, h, i]]
10 13 [[a, b, c, d, e, f, g, h]]
...
...
In [67]: %timeit pd.DataFrame({ "A": np.repeat(df.A.values, [len(x) for x in (chain.from_iterable(df.B))]),"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
1000 loops, best of 3: 818 µs per loop
In [68]: %timeit ((df['B'].apply(lambda x: pd.Series(x[0])).stack().reset_index(level=1, drop=True).to_frame('B').join(df[['A']], how='left')))
10 loops, best of 3: 103 ms per loop
I can't find a elegant way to handle this, but the following codes can work...
import pandas as pd
import numpy as np
df = pd.DataFrame([{"a":1,"b":[[1,2]]},{"a":4, "b":[[3,4,5]]}])
z = []
for k,row in df.iterrows():
for j in list(np.array(row.b).flat):
z.append({'a':row.a, 'b':j})
result = pd.DataFrame(z)
I think this is the fastest and simplest way:
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
df.set_index('A')['B'].apply(lambda x: pd.Series(x[0]))
Here's another option
unpacked = (pd.melt(df.B.apply(pd.Series).reset_index(),id_vars='index')
.merge(df, left_on = 'index', right_index = True))
unpacked = (unpacked.loc[unpacked.value.notnull(),:]
.drop(columns=['index','variable','B'])
.rename(columns={'value':'B'})
Apply pd.series to column B --> splits each list entry to a different row
Melt this, so that each entry is a separate row (preserving index)
Merge this back on original dataframe
Tidy up - drop unnecessary columns and rename the values column

count cases in python

If I have a table like
ID Date Disease
1 03.07 A
1 03.07 B
1 03.09 A
1 03.09 C
1 03.10 D
I wrote a code like:
def combination(listData):
comListData = [];
for datum in listData :
start = listData.index(datum) + 1
while start < len(listData) :
if datum!=listData[start] :
comStr = datum+':'+listData[start]
if not comStr in comListData :
comListData.append(comStr)
start+=1;
return comListData
def insertToDic(dic,comSick):
for datum in comSick :
if dic.has_key(datum) :
dic[datum]+=1
else :
dic[datum] = 1
try:
con = mdb.connect('blahblah','blah','blah','blah')
cur = con.cursor()
sql ="select * from table"
cur.execute(sql);
data = cur.fetchall();
start = 0
end = 1
sick = []
dic = {}
for datum in data :
end = datum[0]
if end!=start:
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
sick = []
sick.append(datum[2])
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
for k,v in dic.items():
a,b = k.split(':')
print >>f, a.ljust(0), b.ljust(0), v
f.close()
then I got:
From To Count
A B 1
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A C 1
A D 1
C D 1
and the final version table I got is (In same ID, same direction such as A --> C count as 1 not 2. Same diseases like A --> A doesn't count. A --> B is different with B --> A)
From To Count
A B 1
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
but what I want is (excluding same date cases version):
From To Count
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A D 1
C D 1
and finally
From To Count
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
which part of my code should I edit?
Let me try to rephrase your question. For each ID (excluding date to make the problem simpler), you want all possible pairs of values in Disease column and how often they occur, in which order of the pair matters. Now, up front there is a builtin function in Python that achieve this:
from itertools import permutations
all_pairs = permutations(diseases, 2)
Given your data, I am guessing it is in csv files. If it is not, please adjust my code yourself (which is kind of trivial Google searches). We will be using the famous library in data-science stacks called Pandas. Here is how it goes:
from itertools import permutations
import pandas as pd
df = pd.read_csv('data.csv', header=0)
pairs_by_did = df.groupby('ID').apply(lambda grp: pd.Series(list(permutations(grp['Disease'], 2))))
all_pairs = pd.concat([v for i, v in pairs_by_did.iterrows()])
pair_counts = all_pairs.value_counts()
print pair_counts
For your example, it prints
>>> print pair_counts
(A, B) 2
(D, A) 2
(A, D) 2
(C, A) 2
(B, A) 2
(A, C) 2
(A, A) 2
(C, B) 1
(D, C) 1
(C, D) 1
(D, B) 1
(B, D) 1
(B, C) 1
Name: 1, dtype: int64
Now group by ID and date at the same time, and see what you get.

Optimizing pandas filter inside apply function

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]

Categories