Fastest most efficient way to apply groupby that references multiple columns

Fastest most efficient way to apply groupby that references multiple columns - python

Suppose we have a dataset.
tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
'bye': [12,23,35,35,53,62,31,22,33,22,12],
'yes': [12,2,32,3,5,6,23,2,32,2,21],
'no': [1,92,93,3,95,6,33,2,33,22,1],
'maybe': [91,2,32,3,95,69,3,2,93,2,1]})
In python we can easily do tmp.groupby('hi').agg(total_bye = ('bye', sum)) to get the sum of bye for each group. However, if I want to reference multiple columns, what would be the fastest, most efficient and least amount of cleanly (easily readable) written code to do this in python? In particular, can I do this using df.groupby(my_cols).agg()? What are the fastest alternatives? I'm open (actually prefer) to using faster libraries than pandas such as dask or vaex.
For example, in R data.table we can do this pretty easily, and it's super fast
# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']
# output 1
hi my_new_col
1: 1 1
2: 2 116
3: 3 3
4: 5 95
5: 6 6
# Similarly, we can even group by a rule instead of creating a new col to group by. See below
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]
# output 2
new_rule my_new_col
1: 0 120
2: 1 101
# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
mean=mean(as.double(x), na.rm=T),
min=min(as.double(x), na.rm=T),
max=max(as.double(x), na.rm=T))
tmp[,
unlist(
list(N = .N, # add a N column (row count) to the summary
unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
recursive = F),
.SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]
output 3:
N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum no.mean no.min
1: 11 340 30.90909 12 62 140 12.72727 2 32 381 34.63636 1
no.max maybe.sum maybe.mean maybe.min maybe.max
1: 95 393 35.72727 1 95
Do we have this same flexibility in python?

You can use agg on all wanted columns and add a prefix:
tmp.groupby('hi').agg('sum').add_prefix('total_')
output:
total_bye total_yes total_no total_maybe
hi
1 24 33 2 92
2 67 6 116 6
3 134 90 162 131
5 53 5 95 95
6 62 6 6 69
You can even combine columns and operations flexibly with a dictionary:
tmp.groupby('hi').agg(**{'%s_%s' % (label,c): (c, op)
for c in tmp.columns
for (label,op) in [('total', 'sum'), ('average', 'mean')]
})
output:
total_hi average_hi total_bye average_bye total_yes average_yes total_no average_no total_maybe average_maybe
hi
1 2 1 24 12.000000 33 16.5 2 1.000000 92 46.00
2 6 2 67 22.333333 6 2.0 116 38.666667 6 2.00
3 12 3 134 33.500000 90 22.5 162 40.500000 131 32.75
5 5 5 53 53.000000 5 5.0 95 95.000000 95 95.00
6 6 6 62 62.000000 6 6.0 6 6.000000 69 69.00

Related

Compare row wise elements of a single column. If there are 2 continuous L then select lowest from High column and ignore other. Conversly if 2 L

High D_HIGH D_HIGH_H
33 46.57 0 0L
0 69.93 42 42H
1 86.44 68 68H
34 56.58 83 83L
35 67.12 125 125L
2 117.91 158 158H
36 94.51 186 186L
3 120.45 245 245H
4 123.28 254 254H
37 83.20 286 286L
In column D_HIGH_H there is L & H at end.
If there are two continuous H then the one having highest value in High column has to be selected and other has to be ignored(deleted).
If there are two continuous L then the one having lowest value in High column has to be selected and other has to be ignored(deleted).
If the sequence is H,L,H,L then no changes to be made.
Output I want is as follows:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I tried various options using list map but did not work out.Also tried with groupby but no logical conclusion.

Here's one way:
g = ((l := df['D_HIGH_H'].str[-1]) != l.shift()).cumsum()
def f(x):
if (x['D_HIGH_H'].str[-1] == 'H').any():
return x.nlargest(1, 'D_HIGH')
return x.nsmallest(1, 'D_HIGH')
df.groupby(g, as_index=False).apply(f)
Output:
High D_HIGH D_HIGH_H
0 33 46.57 0 0L
1 1 86.44 68 68H
2 34 56.58 83 83L
3 2 117.91 158 158H
4 36 94.51 186 186L
5 4 123.28 254 254H
6 37 83.20 286 286L

You can use extract to get the letter, then compute a custom group and groupby.apply with a function that depends on the letter:
# extract letter
s = df['D_HIGH_H'].str.extract('(\D)$', expand=False)
# group by successive letters
# get the idxmin/idxmax depending on the type of letter
keep = (df['High']
.groupby([s, s.ne(s.shift()).cumsum()], sort=False)
.apply(lambda x: x.idxmin() if x.name[0] == 'L' else x.idxmax())
.tolist()
)
out = df.loc[keep]
Output:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L

Apply arithmetic operation from json file to dataframe without using eval function

I have json file contains arithmetic operation list to apply in dataframe columns,
I used eval function , but i could not use eval function due to security issue, how can apply arithmetic operation from json file without using eval function
my df function will be like the below table:
a
b
c
d
10
20
40
25
15
10
35
10
25
20
30
100
100
15
35
20
20
25
25
15
10
25
45
30
10
20
40
25
10
20
40
25
json_ops_list= {"1":"a/b",
"2":"a+b+c",
"3": "(a-b)/(b+d)",
"4":"(a-b)/(b*d+c)"}
def calculations(df, json_ops_list):
final_df = pd.DataFrame()
for i in json_ops_list:
col_name = i + '_final'
final_df[col_name] = df.eval(i)
return final_df```

I think security problem is with python eval, in pandas is no this problem:
Neither simple nor compound statements are allowed. This includes things like for, while, and if.
Your solution is necessary change for dictionary:
json_ops_d= {"1":"a/b",
"2":"a+b+c",
"3": "(a-b)/(b+d)",
"4":"(a-b)/(b*d+c)"}
def calculations(df, d):
final_df = pd.DataFrame()
for k, v in d.items():
col_name = k + '_final'
final_df[col_name] = df.eval(v)
return final_df
df = calculations(df, json_ops_d)
print (df)
1_final 2_final 3_final 4_final
0 0.500000 70 -0.222222 -0.018519
1 1.500000 60 0.250000 0.037037
2 1.250000 75 0.041667 0.002463
3 6.666667 150 2.428571 0.253731
4 0.800000 70 -0.125000 -0.012500
5 0.400000 80 -0.272727 -0.018868
6 0.500000 70 -0.222222 -0.018519
7 0.500000 70 -0.222222 -0.018519

Numpy: Use vectorization for loop while referring to previous row value?

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
"Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
"Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
"Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
}
)
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
The formula for Value column in pseudocode is:
if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]
The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.
for i in range(len(df)):
if df.loc[i, "Is First?"] == "Yes":
df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
else:
df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]

One way:
You may use np.subtract.accumulate with transform
s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
.groupby(s)
.transform(np.subtract.accumulate))
Out[1749]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.
df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)
Out[1671]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125.0
1 A 220 20 No 105.0
2 A 200 24 No 81.0
3 A 402 52 No 29.0
4 B 313 40 Yes 353.0
5 B 434 12 No 341.0
6 B 321 43 No 298.0
7 C 343 23 Yes 366.0
8 C 120 16 No 350.0

This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,
col_sum = df.Inbound+df.Outbound
mask_no = df['Is First?'].eq('No')
mask_yes = df['Is First?'].eq('Yes')
consec_no = mask_yes.cumsum()
result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()

Use:
df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()

Annotated numpy code:
## 1. line up values to sum
ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]
## 2. calculate block sums and subtract each from the
## first element of the **next** block
ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()
Result:
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.

Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]

This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Pandas dataframe with multiindex column - merge levels

I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?

There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49

you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]

Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!

Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest most efficient way to apply groupby that references multiple columns - python

Related

Compare row wise elements of a single column. If there are 2 continuous L then select lowest from High column and ignore other. Conversly if 2 L

Apply arithmetic operation from json file to dataframe without using eval function

Numpy: Use vectorization for loop while referring to previous row value?

compare 2 dataframe with pandas

Pandas dataframe with multiindex column - merge levels

Categories

Resources