count cases in python - python

If I have a table like
ID Date Disease
1 03.07 A
1 03.07 B
1 03.09 A
1 03.09 C
1 03.10 D
I wrote a code like:
def combination(listData):
comListData = [];
for datum in listData :
start = listData.index(datum) + 1
while start < len(listData) :
if datum!=listData[start] :
comStr = datum+':'+listData[start]
if not comStr in comListData :
comListData.append(comStr)
start+=1;
return comListData
def insertToDic(dic,comSick):
for datum in comSick :
if dic.has_key(datum) :
dic[datum]+=1
else :
dic[datum] = 1
try:
con = mdb.connect('blahblah','blah','blah','blah')
cur = con.cursor()
sql ="select * from table"
cur.execute(sql);
data = cur.fetchall();
start = 0
end = 1
sick = []
dic = {}
for datum in data :
end = datum[0]
if end!=start:
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
sick = []
sick.append(datum[2])
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
for k,v in dic.items():
a,b = k.split(':')
print >>f, a.ljust(0), b.ljust(0), v
f.close()
then I got:
From To Count
A B 1
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A C 1
A D 1
C D 1
and the final version table I got is (In same ID, same direction such as A --> C count as 1 not 2. Same diseases like A --> A doesn't count. A --> B is different with B --> A)
From To Count
A B 1
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
but what I want is (excluding same date cases version):
From To Count
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A D 1
C D 1
and finally
From To Count
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
which part of my code should I edit?

Let me try to rephrase your question. For each ID (excluding date to make the problem simpler), you want all possible pairs of values in Disease column and how often they occur, in which order of the pair matters. Now, up front there is a builtin function in Python that achieve this:
from itertools import permutations
all_pairs = permutations(diseases, 2)
Given your data, I am guessing it is in csv files. If it is not, please adjust my code yourself (which is kind of trivial Google searches). We will be using the famous library in data-science stacks called Pandas. Here is how it goes:
from itertools import permutations
import pandas as pd
df = pd.read_csv('data.csv', header=0)
pairs_by_did = df.groupby('ID').apply(lambda grp: pd.Series(list(permutations(grp['Disease'], 2))))
all_pairs = pd.concat([v for i, v in pairs_by_did.iterrows()])
pair_counts = all_pairs.value_counts()
print pair_counts
For your example, it prints
>>> print pair_counts
(A, B) 2
(D, A) 2
(A, D) 2
(C, A) 2
(B, A) 2
(A, C) 2
(A, A) 2
(C, B) 1
(D, C) 1
(C, D) 1
(D, B) 1
(B, D) 1
(B, C) 1
Name: 1, dtype: int64
Now group by ID and date at the same time, and see what you get.

Related

Get the frequency of all combinations in Pandas

I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1

How to select batches of data from a dataframe based on a column

Suppose I have a dataframe like
A B
1 a
1 b
1 c
1 d
1 e
2 a
2 v
2 r
Now suppose I want to select from each groups of A with batch size 2 so that they are shuffled. I want a code that returns the following sequence:
A B
1 a
1 b
2 a
2 v
1 c
1 d
I omitted the rest because they can't form a batch size of 2 with similar items (on column A).
It's part of my code, but I am still looking for smaller code
dict_dfs = dict(tuple(self.split_df.groupby("A")))
bs = self.batch_size
keys = list(dict_dfs)
total = len(self.split_df)
ii = 0
rr = 0
start = {}
rows = []
while ii < total:
if rr >= bs:
rr += 1
rr = rr % len(keys)
key = keys[rr]
df = dict_dfs[key]
kk = 0
for index, row in df.iterrows():
if kk < start[key]:
kk += 1
continue
batch = {}
batch["A"] = row["A"]
...
rows.append(batch)
start[key] += 1
kk += 1
if kk > bs:
break

Group by categorical column and by range of ids

I have a dataframe similar to this one:
col inc_idx
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
6 D 1
7 E 1
8 F 1
9 F 2
10 Z 1
And I'm trying to iterate the df by batches:
First loop: All col rows with inc_idx >= 1 and inc_idx <=2
A 1
A 2
B 1
B 2
...
Second loop: All col rows with inc_idx >= 3 and inc_idx <=4
A 3
The way I'm doing it now leaves a lot of room for improvement:
i = 0
while True:
for col, grouped_rows in df.groupby(by=['col']):
from_idx = i * 2
to_idx = from_idx + 2
items = grouped_rows .iloc[from_idx:to_idx].to_list()
i += 2
I think that there's got to be a more efficient approach and also a way to remove the "while True" loop and instead just waiting for the internal loop to run out of items.
I don't know exactly what you want to do. Here's something that groups the rows.
df.groupby((df.inc_idx + 1) // 2).agg(list)
col inc_idx
inc_idx
1 [A, B, C, A, B, D, E, F, F, Z] [1, 1, 1, 2, 2, 1, 1, 1, 2, 1]
2 [A] [3]
I've found (I think) a simpler way to solve it. I'll add a new "batch" column:
df['batch'] = df.apply(lambda x: x['inc_idx'] // 2, axis=1)
With this new column, now I can simply do something like:
df.groupby(by=['col', 'batch'])

Get the middle value from a column according to a criteria

I have a dataframe with 3 columns. I need to get the value from col A and B in the middle of C when C = 1. If the amount of C = 1 is even, I want the first one from the middle
For example, this one is for an odd amount of C = 1
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 0
q p 0
The row in the middle when C = 1 is
A B C
e p 1
Therefore, it should return
df_return
A B C
e p 1
When we have an even amount of C = 1:
df_return
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 1
r e 1
u f 1
q p 0
The ones in the middle when C = 1 are
A B C
t b 1
u e 1
However, I want only 1 of them, and it should be the first one. So
df_return
A B C
t b 1
How can I do it?
One thing you should know is that A and B are ordered
Focus on the relevant part, discarding rows holding zeros:
df = df[df.C == 1]
Now it's simple. Just find the midpoint, based on length or .shape.
if len(df) > 0:
mid = (len(df) - 1) // 2
return df.iloc[mid, :]

Optimizing pandas filter inside apply function

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]

Categories