Counting line frequencies and producing output files - python

With a textfile like this:
a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f
How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.
Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

Here is an answer without frozen set.
df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']
df_all = pd.concat([df_count.assign(letter=lambda x: x['A']),
df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])
df_first = df_all.groupby(['letter']).first().reset_index()
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
------------older answer --------
Since order matters you can use a frozen set as the key to a groupby
import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']
Which will give you this
Combos Count
0 (a, b) 3
1 (b, f) 2
2 (d, c) 2
3 (g, f) 1
4 (b, h) 1
5 (c, g) 1
6 (d, f) 1
7 (e, a) 1
To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.
df_a = df_count.copy()
df_b = df_count.copy()
df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])
df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])
And since this is sorted by letter and count (descending) just get the first row of each group.
df_first = df_all.groupby('letter').first()
And to get the top 25%, just use
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
And then use .to_csv to output to file.

Related

Python script to sum values according to conditions in a loop

I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not.
My input file is made this way:
Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468
IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108
IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258
IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418
Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results.
What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals.
My script is this:
import pandas as pd
import numpy as np
import itertools
# defines columns names
cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH']
# loads data (the file needs to be in the same folder where the script is)
data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols)
# removes the sample ID for ID1/ID2 columns and places it in two dedicated columns
data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True)
data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True)
# gets the groups list from both ID columns...
groups_id1 = list(data.ID1.unique())
groups_id2 = list(data.ID2.unique())
groups = list(set(groups_id1 + groups_id2))
# ... and all the possible pairs
group_pairs = [i for i in itertools.combinations(groups, 2)]
# subsets the pairs having Roma
group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or
('Roma' in x[1] and x[1] != 'Romanian')]
# preapres output df
result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum'])
# loops all the possible pairs and computes the sum of IBD length
for idx, group_pair in enumerate(group_pairs_roma):
id1 = group_pair[0]
id2 = group_pair[1]
ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) |
((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3)
result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum]
# saves results
result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False)
My current output is something like this:
ID1 ID2 IBD_sum
Sindhi IBS 3.275
Sindhi Moroccan 74.201
Sindhi Sindhi 119.359
While I need something like:
ID1 ID2 IBD_sum
Sindhi_individual1 Moroccan_individual1 3.275
Sindhi_individual2 Moroccan_individual2 5.275
Sindhi_individual3 IBS_individual1 4.275
I have tried by substituting one line in my code, by writing
groups_id1 = list(data.ID1_samples.unique())
groups_id2 = list(data.ID2_samples.unique())
and later
ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) |
((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3)
Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?
I have solved the problem on my own but using R language.
This is the code:
ibd <- read.delim("input.txt", sep='\t')
ibd_sum_indv <- ibd %>%
group_by(ID1, ID2) %>%
summarise(SIBD = sum(IBDLENGTH),
NIBD = n()) %>%
ungroup()

Dataframe-Compare list of columns' values against one column's value with multiple criteria - vectorized

I have the below dataframe:
Com1 a b c d e f g
0.83389 0.83585 0.8351 0.83493 0.83683 0.83697 0.83622 0.83573
0.83375 0.83585 0.8351 0.83493 0.83683 0.83487 0.83357 0.83147
0.83366 0.83585 0.8351 0.83493 0.83683 0.83567 0.83487 0.83357
0.834 0.83585 0.8351 0.83493 0.83567 0.83147 0.82807 0.82257
What I need is to compare the letter columns to the Com1 column. In plain language, the formula is:
ColList = [a,b,c,d,e,f,g]
if any number in ColList is > Com1 and < (Com1 +0.001):
Result = 0
else:
Result = 1
The desired result is:
Com1 a b c d e f g Result
0.83389 0.83585 0.8351 [0.83473] 0.83683 0.83697 0.83622 0.83573 0
0.83375 0.83585 0.8351 0.83493 0.83683 0.83487 0.83357 0.83147 1
0.83366 0.83585 0.8351 0.83493 0.83683 0.83567 0.83487 0.83357 1
0.834 0.83585 0.8351 [0.83493] 0.83567 0.83147 0.82807 0.82257 0
(I bracketed the numbers that are causing the "0" result)
I tried to use np.where, but it did not yield the correct results when I manually compared in excel. The formula I tried was:
ColList = [a,b,c,d,e,f,g]
df['Result'] = np.where( (df[ColList ].gt(df['Com1'],axis=0).any(axis=1)) & (df[ColList ].lt(df['Com1'] + 0.001,axis=0).any(axis=1)),0,1)
I also considered using the between() function, but from my research, since its a list of columns, I am not having a series and it cannot be used.
I appreciate the assistance.
For vectorized operations, you could use gt and lt methods to get boolean frames and combine them using & operator and combine across rows using any:
cols = df.drop(columns='Com1')
out = (1-(cols.gt(df['Com1'], axis=0) &
cols.lt(df['Com1']+0.001, axis=0))
.any(axis=1).astype(int))
Output:
0 1
1 1
2 1
3 0
dtype: int64
Note that the output I get here is different from the one you get because the input given is different from the input used to derive the desired output.
You can use apply with axis=1 to work on the rows with the help of a function:
def between(s: pd.Series):
ColList = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
return int(not any(s[ColList].between(s['Com1'], s['Com1']+0.001)))
df['result'] = df.apply(between, axis=1)
If you want you could also have the column list and column to compare to as argument and use partial before applying it.

Comparing values in different pairs of columns in Pandas

I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2.
I have this dataframe:
,A,B,C,A2,B2,C2
2018-12-01,7,0,8,17,17,17
2018-12-02,0,0,8,20,18,18
2018-12-03,9,8,8,17,17,18
2018-12-04,8,8,8,17,17,18
2018-12-05,8,8,8,17,17,17
2018-12-06,9,8,8,15,17,17
2018-12-07,8,9,9,17,17,16
2018-12-08,0,0,0,17,17,17
2018-12-09,8,0,0,17,20,18
2018-12-10,8,8,8,17,17,17
2018-12-11,8,8,9,17,17,17
2018-12-12,8,8,8,17,17,17
2018-12-13,8,8,8,17,17,17
2018-12-14,8,8,8,17,17,17
2018-12-15,9,9,9,17,17,17
2018-12-16,12,0,0,17,19,17
2018-12-17,11,9,9,17,17,17
2018-12-18,8,9,9,17,17,17
2018-12-19,8,9,8,17,17,17
2018-12-20,9,8,8,17,17,17
2018-12-21,9,9,9,17,17,17
2018-12-22,10,9,0,17,17,17
2018-12-23,10,11,10,17,17,17
2018-12-24,10,10,8,17,19,17
2018-12-25,7,10,10,17,17,18
2018-12-26,10,0,10,17,19,17
2018-12-27,9,10,8,18,17,17
2018-12-28,9,9,9,17,17,17
2018-12-29,10,10,12,18,17,17
2018-12-30,10,0,10,16,19,17
2018-12-31,11,8,8,19,17,16
I expect the following value:
A with B = 14
A with C = 14
A2 with B2 = 14
A2 with C2 = 14
I have done this:
ia = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]:
ia=ia+1
ib = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]:
ib=ib+1
In order to take advantage of pandas, this is one possible solution:
import numpy as np
dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0)
After that I could sum all the elements in the new column 'que'.
Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that.
Any other more efficient or elegant solutions?
The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches.
Put that in a loop and add the required "text" for the output you want:
mains = ('A', 'A2') # the main columns
comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with
for main, pair in zip(mains, comps):
for col in pair:
print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}')
# or without f-strings, do:
# print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum())
Output:
A with B = 14
A with C = 14
A2 with B2 = 21
A2 with C2 = 20
Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum().
In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets:
mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3
comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's),
dfr_h_max1.columns[2::3]) # offset by 2 columns (C's),
# in steps of 3
(Or even using the column names / starting letter.)

Performance enhancement of ranking function by replacement of lambda x with vectorization

I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]
​
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val

Advanced groupby column creation in pandas Dataframe

I have a catalogue of groups of galaxies in a DataFrame, 'compact', which consists mainly in
a group id ('CG', int),
a magnitude ('R', negative float)
and a morphology ('Morph', string, for example 'S' or 'E').
I'm trying to construct a second pandas DataFrame with the following properties of the groups:
'Morph' of the object having the lowest 'R' in the group
Difference between the second lowest and the lowest 'R' in the group
Difference between the lowest 'R' in the group and R of the group, defined as -2.5*log10(sum(10**(-0.4*R)))
Proportions of objects having a given 'Morph' (on column for 'S', one for other morphologies, for example) in the group, NOT COUNTING THE ONE HAVING THE LOWEST 'R'.
I'm having troubles for the last one, could you help me to write it? The other ones work, but, as a secondary question, I would like if I'm doing it clean or if there's better to do.
Here is my code (with a line for my last column which works but doesn't give exactly what I want, and a try in comments which doesn't work):
GroupBy = compact.sort_values('R').groupby('CG', as_index=False)
R2 = GroupBy.head(2).groupby('CG', as_index=False).last().R
R1 = GroupBy.first().sort_values('CG').R
DeltaR12 = R2 - R1
MorphCen = GroupBy.first().sort_values('CG').Morph
Group = GroupBy.first().sort_values('CG').CG
RGroup = GroupBy.apply(lambda x: -2.5*np.log10((10**(-0.4*x.R)).sum()))
DeltaR1gr = R1 - RGroup
# Works, but counts the object with lowest R:
PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S'].shape[0]/x.shape[0])
# Tries to let aside lowest R, but doesn't work:
# PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S' &
# x['R']>x['R'].min()].shape[0]/x.shape[0])
# PropRed = same than PropS, but for 'Morph' != 'S'
CompactML = pd.DataFrame([Group,MorphCen,DeltaR12,DeltaR1gr]).transpose()
CompactML.columns = ['CG', 'MorphCen', 'DeltaR12','DeltaR1gr']
First, its nice if you provide actual data or create some fake data. Below I have created some fake data with 5 different integer CG groups, 2 types of morphology (S and E) and random negative numbers for 'R'.
I have then redone all your aggregations in a custom function that computes each of the 4 returning aggregations in one line and sends the results back as a Series which adds each output as row to your original DataFrame.
#create fake data
df = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100})
print(df.head())
CG Morph R
0 3 E -72.377887
1 2 E -26.126565
2 0 E -4.428494
3 0 E -2.055434
4 4 E -93.341489
# define custom aggregation function
def my_agg(x):
x = x.sort_values('R')
morph = x.head(1)['Morph'].values[0]
diff = x.iloc[0]['R'] - x.iloc[1]['R']
diff2 = -2.5*np.log10(sum(10**(-0.4*x['R'])))
prop = (x['Morph'].iloc[1:] == 'S').mean()
return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop'])
# apply custom agg function
df.groupby('CG').apply(my_agg)
morph diff diff2 prop
CG
0 E -1.562630 -97.676934 0.555556
1 S -3.228845 -98.398337 0.391304
2 S -6.537937 -91.092164 0.307692
3 E -0.023813 -99.919336 0.500000
4 E -11.943842 -99.815734 0.705882
So, here is the final code, thanks to Ted Pertou:
# define custom aggregation function
def my_agg(x):
x = x.sort_values('R')
morph = x.head(1)['Morph'].values[0]
diff = x.iloc[1]['R'] - x.iloc[0]['R']
diff2 = x.iloc[0]['R'] + 2.5*np.log10(sum(10**(-0.4*x['R'])))
prop = (x['Morph'].iloc[1:] == 'S').mean()
return pd.Series([morph, diff, diff2, prop], index=['MorphCen', 'DeltaR12', 'DeltaRGrp1', 'PropS'])
# apply custom agg function
compact.groupby('CG').apply(my_agg)

Categories