How to count string with pattern in series object? - python

Suppose a data like this:
>>> data
x
0 [wdq, sda, q]
1 [q, d, qasd]
2 [d, b, sdaaaa]
I wonder how many string contains a in each list, which means I need an answer like this:
>>> data
x count_a
0 [wdq, sda, q] 1
1 [q, d, qasd] 1
2 [d, b, sdaaaa] 1
How can I do this in python?

Assuming this is a pandas.DataFrame and x is a list object:
df['count_a'] = df['x'].apply(lambda x: sum('a' in e for e in x))

you can try this;
for i in my_series:
print (i.count('a'))
this gives each your series letter

a = ['a', '12asf3', 'sdf']
b = ['gfdg5', ' ', 'vag gfd4']
c = [' fd4 ', 'sfsa fa', 'df4 a']
abc = [a, b, c]
for mem in abc:
counter = 0
for str in mem:
if 'a' in str:
counter += 1
print abc.index(mem), mem, counter
Output:
0 ['a', '12asf3', 'sdf'] 2
1 ['gfdg5', ' ', 'vag gfd4'] 1
2 [' fd4 ', 'sfsa fa', 'df4 a'] 2

To find out how many Strings in each List contain the letter a, you can use the following:
l = ['wdq', 'sda', 'qaaa']
print(sum([1 for x in l if 'a' in x]))
This prints the following:
2

Related

Get the frequency of all combinations in Pandas

I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1

replace values of elements of a list using values of another list as a reference

I have the following 2 list
>>> a
['a', 'b', 'c', 'd']
>>> b
['b a d c f g', 'a b b c d f', 'b c d c h']
and I want to update/replace b to looks like :
['b a d c var_1 var_2','a b b c d var_1','b c d c var_1']
i.e those key elements from a that it doesn't appear in the chain of the elements of b
I'm trying to convert b in a list of list an iterate over each element of the chain and compare against a, but I don't know if there is the proper approach here.
I'm not sure if I totally understand your requirement, but my output is same as yours:
a = ['a', 'b', 'c', 'd']
b = ['b a d c f g', 'a b b c d f', 'b c d c h']
def replace_sub_b(sub_b):
bl = sub_b.split()
index = 0
rl = []
for c in bl:
if c in a:
rl.append(c)
else:
index += 1
rl.append(f'var_{index}')
return ' '.join(rl)
new_b = [replace_sub_b(sub_b) for sub_b in b]
print(new_b)
I would do it this way.
set_a = set(a)
n = len(b)
for i in range(n):
unknown_vars_mapping = dict()
chain_vars = b[i].split()
for j, var in enumerate(chain_vars):
# if the variable is not in a, store it
if var not in set_a and var not in unknown_vars_mapping:
unknown_vars_mapping[var] = 'var_' + str(len(unknown_vars_mapping) + 1)
# if the variable is not in a, but already been seen
if var in unknown_vars_mapping:
chain_vars[j] = unknown_vars_mapping[var]
b[i] = " ".join(chain_vars)

Python Printing Row and Column number on 2d matrix?

I'm trying to print output as follows.
Strings: ["cat", "dog", "big"]
Print:
0 1 2
0 c a t
1 d o g
2 b i g
But I can't seem to print the indices properly
for i in a:
for j in i:
print(j, end=' ')
print()
I know this prints the matrix itself but doesn't give me the row and column numbers I need
Ideal job for pandas:
import pandas as pd
lst = ["cat", "dog", "big"]
df = pd.DataFrame([[y for y in x] for x in lst])
print(df)
# 0 1 2
# 0 c a t
# 1 d o g
# 2 b i g
please try below:
str_list = ["cat", "dog", "big"]
print (" ", " ".join([str(x) for x in range(len(str_list))]))
for i, x in enumerate(str_list):
print (i, " ".join(x))
Demo
Ta-da! This solution should work regardless of list dimensions.
str_list = ['cat', 'dog', 'loooong', 'big']
max_row_len = len(max(str_list, key=len))
#header
print(' ', end='')
print(*range(max_row_len), sep=' ')
#rows
for idx, val in enumerate(str_list):
print(idx, end=' ')
print(*val, sep=' ')
Output:
0 1 2 3 4 5 6
0 c a t
1 d o g
2 l o o o o n g
3 b i g
You can achieve your output without using any library by following code, Otherwise PANDAS would be helpful for oneliner
str = ['cat','dog','big']
r = 0
c = 0
print(' ',end='')
for i in range(0,(max([len(i) for i in str]))):
print(c,end=' ')
c+=1
print()
for i in str:
print(r,end=' ')
for j in i:
print(j , end=' ')
print()
r+=1
Output :

count cases in python

If I have a table like
ID Date Disease
1 03.07 A
1 03.07 B
1 03.09 A
1 03.09 C
1 03.10 D
I wrote a code like:
def combination(listData):
comListData = [];
for datum in listData :
start = listData.index(datum) + 1
while start < len(listData) :
if datum!=listData[start] :
comStr = datum+':'+listData[start]
if not comStr in comListData :
comListData.append(comStr)
start+=1;
return comListData
def insertToDic(dic,comSick):
for datum in comSick :
if dic.has_key(datum) :
dic[datum]+=1
else :
dic[datum] = 1
try:
con = mdb.connect('blahblah','blah','blah','blah')
cur = con.cursor()
sql ="select * from table"
cur.execute(sql);
data = cur.fetchall();
start = 0
end = 1
sick = []
dic = {}
for datum in data :
end = datum[0]
if end!=start:
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
sick = []
sick.append(datum[2])
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
for k,v in dic.items():
a,b = k.split(':')
print >>f, a.ljust(0), b.ljust(0), v
f.close()
then I got:
From To Count
A B 1
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A C 1
A D 1
C D 1
and the final version table I got is (In same ID, same direction such as A --> C count as 1 not 2. Same diseases like A --> A doesn't count. A --> B is different with B --> A)
From To Count
A B 1
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
but what I want is (excluding same date cases version):
From To Count
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A D 1
C D 1
and finally
From To Count
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
which part of my code should I edit?
Let me try to rephrase your question. For each ID (excluding date to make the problem simpler), you want all possible pairs of values in Disease column and how often they occur, in which order of the pair matters. Now, up front there is a builtin function in Python that achieve this:
from itertools import permutations
all_pairs = permutations(diseases, 2)
Given your data, I am guessing it is in csv files. If it is not, please adjust my code yourself (which is kind of trivial Google searches). We will be using the famous library in data-science stacks called Pandas. Here is how it goes:
from itertools import permutations
import pandas as pd
df = pd.read_csv('data.csv', header=0)
pairs_by_did = df.groupby('ID').apply(lambda grp: pd.Series(list(permutations(grp['Disease'], 2))))
all_pairs = pd.concat([v for i, v in pairs_by_did.iterrows()])
pair_counts = all_pairs.value_counts()
print pair_counts
For your example, it prints
>>> print pair_counts
(A, B) 2
(D, A) 2
(A, D) 2
(C, A) 2
(B, A) 2
(A, C) 2
(A, A) 2
(C, B) 1
(D, C) 1
(C, D) 1
(D, B) 1
(B, D) 1
(B, C) 1
Name: 1, dtype: int64
Now group by ID and date at the same time, and see what you get.

How to concatenate and compress string variable at the same time in python ?

Say I have this variables below ( id , a,b,c,d )
id a b c d
x 2 4 5 7
y 4 5 9
z 1 2
I want to create a new concatenate variable named 'total' from these strings , so I used this code below :
total = a + ' ' + b + ' ' + c + ' ' + d
Since I don't want all these to be next to each other 2457 , I need one space blank ( ' ' ) between each variables 2 4 5 7, my result look something like this
id a b c d total
x 2 4 5 7 2 4 5 7
y 4 5 9 4 5 9
z 1 2 1 2
My problem is .. for example # y between 5 & 9 , I only want one space instead two Or i want my result to look like this ... can anyone show me how to accomplish this ? In SAS , I can easily use something to compress , not sure how can I do this in python ..
id a b c d total
x 2 4 5 7 2 4 5 7
y 4 5 9 4 5 9
z 1 2 1 2
Hopefully I'm not confusing anyone ~ , thanks :-)
One of the reasons to use join instead of manually concatenating things is that you can do more complex stuff more easily.
First, if you turn your a + ' ' + b + ' ' + c + ' ' + d into a join:
' '.join((a, b, c, d))
That doesn't change anything yet.
2 4 5 7
4 5 9
1 2
But now, how do we say "all of the non-empty strings in (a, b, c, d)"? Easy:
' '.join(x for x in (a, b, c, d) if x)
So:
2 4 5 7
4 5 9
1 2
That's it.
If the empty values aren't empty strings (or None) but, say, ' ', you need to change the test. For example, maybe:
' '.join(x for x in (a, b, c, d) if x.strip())
If you don't understand generator expressions, all of the following are roughly equivalent, and hopefully you'll understand one:
total = ' '.join(x for x in (a, b, c, d) if x)
total = ' '.join([x for x in (a, b, c, d) if x])
total = ' '.join(filter(bool, (a, b, c, d))
non_zero_values = []
for x in (a, b, c, d):
if x:
non_zero_values.append(x)
total = ' '.join(non_zero_values)
In every case, the idea is the same: We have a sequence of 4 values, and we're filtering it down to a sequence of 0 to 4 values by keeping only the ones that aren't empty.
If we stuck with your explicit concatenation, this is still possible, it's just much harder and uglier:
((a + ' ') if a else '' +
(b + ' ') if b else '' +
(c + ' ') if c else '' +
d if d else '')
Which again gives you:
2 4 5 7
4 5 9
1 2
Assuming your table data is in a list or tuple where each row has the id value as its first column and the value for a given column in a row is None if it is empty:
totals = [' '.join(value for value in row[1:] if value is not None) for row in data]
Alternatively, you could put it in a dict, which might be more useful depending upon how you use it later.
data = {'x' : {'values' : (2, 4, 5, 7)},
'y' : {'values' : (4, 5, None, 9)},
'z' : {'values' : (None, 1, None, 2)}}
for data_set in data.values():
data_set['total'] = ' '.join(value for value in data_set['values'] if value is not None)

Categories