Advanced groupby column creation in pandas Dataframe

Advanced groupby column creation in pandas Dataframe - python

I have a catalogue of groups of galaxies in a DataFrame, 'compact', which consists mainly in
a group id ('CG', int),
a magnitude ('R', negative float)
and a morphology ('Morph', string, for example 'S' or 'E').
I'm trying to construct a second pandas DataFrame with the following properties of the groups:
'Morph' of the object having the lowest 'R' in the group
Difference between the second lowest and the lowest 'R' in the group
Difference between the lowest 'R' in the group and R of the group, defined as -2.5*log10(sum(10**(-0.4*R)))
Proportions of objects having a given 'Morph' (on column for 'S', one for other morphologies, for example) in the group, NOT COUNTING THE ONE HAVING THE LOWEST 'R'.
I'm having troubles for the last one, could you help me to write it? The other ones work, but, as a secondary question, I would like if I'm doing it clean or if there's better to do.
Here is my code (with a line for my last column which works but doesn't give exactly what I want, and a try in comments which doesn't work):
GroupBy = compact.sort_values('R').groupby('CG', as_index=False)
R2 = GroupBy.head(2).groupby('CG', as_index=False).last().R
R1 = GroupBy.first().sort_values('CG').R
DeltaR12 = R2 - R1
MorphCen = GroupBy.first().sort_values('CG').Morph
Group = GroupBy.first().sort_values('CG').CG
RGroup = GroupBy.apply(lambda x: -2.5*np.log10((10**(-0.4*x.R)).sum()))
DeltaR1gr = R1 - RGroup
# Works, but counts the object with lowest R:
PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S'].shape[0]/x.shape[0])
# Tries to let aside lowest R, but doesn't work:
# PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S' &
# x['R']>x['R'].min()].shape[0]/x.shape[0])
# PropRed = same than PropS, but for 'Morph' != 'S'
CompactML = pd.DataFrame([Group,MorphCen,DeltaR12,DeltaR1gr]).transpose()
CompactML.columns = ['CG', 'MorphCen', 'DeltaR12','DeltaR1gr']

First, its nice if you provide actual data or create some fake data. Below I have created some fake data with 5 different integer CG groups, 2 types of morphology (S and E) and random negative numbers for 'R'.
I have then redone all your aggregations in a custom function that computes each of the 4 returning aggregations in one line and sends the results back as a Series which adds each output as row to your original DataFrame.
#create fake data
df = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100})
print(df.head())
CG Morph R
0 3 E -72.377887
1 2 E -26.126565
2 0 E -4.428494
3 0 E -2.055434
4 4 E -93.341489
# define custom aggregation function
def my_agg(x):
x = x.sort_values('R')
morph = x.head(1)['Morph'].values[0]
diff = x.iloc[0]['R'] - x.iloc[1]['R']
diff2 = -2.5*np.log10(sum(10**(-0.4*x['R'])))
prop = (x['Morph'].iloc[1:] == 'S').mean()
return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop'])
# apply custom agg function
df.groupby('CG').apply(my_agg)
morph diff diff2 prop
CG
0 E -1.562630 -97.676934 0.555556
1 S -3.228845 -98.398337 0.391304
2 S -6.537937 -91.092164 0.307692
3 E -0.023813 -99.919336 0.500000
4 E -11.943842 -99.815734 0.705882

So, here is the final code, thanks to Ted Pertou:
# define custom aggregation function
def my_agg(x):
x = x.sort_values('R')
morph = x.head(1)['Morph'].values[0]
diff = x.iloc[1]['R'] - x.iloc[0]['R']
diff2 = x.iloc[0]['R'] + 2.5*np.log10(sum(10**(-0.4*x['R'])))
prop = (x['Morph'].iloc[1:] == 'S').mean()
return pd.Series([morph, diff, diff2, prop], index=['MorphCen', 'DeltaR12', 'DeltaRGrp1', 'PropS'])
# apply custom agg function
compact.groupby('CG').apply(my_agg)

Related

Comparing values in different pairs of columns in Pandas

I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2.
I have this dataframe:
,A,B,C,A2,B2,C2
2018-12-01,7,0,8,17,17,17
2018-12-02,0,0,8,20,18,18
2018-12-03,9,8,8,17,17,18
2018-12-04,8,8,8,17,17,18
2018-12-05,8,8,8,17,17,17
2018-12-06,9,8,8,15,17,17
2018-12-07,8,9,9,17,17,16
2018-12-08,0,0,0,17,17,17
2018-12-09,8,0,0,17,20,18
2018-12-10,8,8,8,17,17,17
2018-12-11,8,8,9,17,17,17
2018-12-12,8,8,8,17,17,17
2018-12-13,8,8,8,17,17,17
2018-12-14,8,8,8,17,17,17
2018-12-15,9,9,9,17,17,17
2018-12-16,12,0,0,17,19,17
2018-12-17,11,9,9,17,17,17
2018-12-18,8,9,9,17,17,17
2018-12-19,8,9,8,17,17,17
2018-12-20,9,8,8,17,17,17
2018-12-21,9,9,9,17,17,17
2018-12-22,10,9,0,17,17,17
2018-12-23,10,11,10,17,17,17
2018-12-24,10,10,8,17,19,17
2018-12-25,7,10,10,17,17,18
2018-12-26,10,0,10,17,19,17
2018-12-27,9,10,8,18,17,17
2018-12-28,9,9,9,17,17,17
2018-12-29,10,10,12,18,17,17
2018-12-30,10,0,10,16,19,17
2018-12-31,11,8,8,19,17,16
I expect the following value:
A with B = 14
A with C = 14
A2 with B2 = 14
A2 with C2 = 14
I have done this:
ia = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]:
ia=ia+1
ib = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]:
ib=ib+1
In order to take advantage of pandas, this is one possible solution:
import numpy as np
dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0)
After that I could sum all the elements in the new column 'que'.
Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that.
Any other more efficient or elegant solutions?

The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches.
Put that in a loop and add the required "text" for the output you want:
mains = ('A', 'A2') # the main columns
comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with
for main, pair in zip(mains, comps):
for col in pair:
print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}')
# or without f-strings, do:
# print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum())
Output:
A with B = 14
A with C = 14
A2 with B2 = 21
A2 with C2 = 20
Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum().
In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets:
mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3
comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's),
dfr_h_max1.columns[2::3]) # offset by 2 columns (C's),
# in steps of 3
(Or even using the column names / starting letter.)

Performance enhancement of ranking function by replacement of lambda x with vectorization

I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df

I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]

ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop

I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val

Counting line frequencies and producing output files

With a textfile like this:
a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f
How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.
Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

Here is an answer without frozen set.
df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']
df_all = pd.concat([df_count.assign(letter=lambda x: x['A']),
df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])
df_first = df_all.groupby(['letter']).first().reset_index()
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
------------older answer --------
Since order matters you can use a frozen set as the key to a groupby
import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']
Which will give you this
Combos Count
0 (a, b) 3
1 (b, f) 2
2 (d, c) 2
3 (g, f) 1
4 (b, h) 1
5 (c, g) 1
6 (d, f) 1
7 (e, a) 1
To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.
df_a = df_count.copy()
df_b = df_count.copy()
df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])
df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])
And since this is sorted by letter and count (descending) just get the first row of each group.
df_first = df_all.groupby('letter').first()
And to get the top 25%, just use
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
And then use .to_csv to output to file.

Finding the position of a subsequence in a sequence

If T1 is this:
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
and P is this:
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
how do I get the positions of P within T1 ?
In terms of min and max I would like to see this returned
min max
3 6
8 11
If these dataframes were represented as SQL tables I could use this SQL method translated to pandas:
DECLARE #Items INT = (SELECT COUNT(*) FROM #P);
SELECT MIN(t.KeyCol) AS MinKey,
MAX(t.KeyCol) AS MaxKey
FROM dbo.T1 AS t
INNER JOIN #P AS p ON p.Val = t.Val
GROUP BY t.KeyCol - p.KeyCol
HAVING COUNT(*) = #Items;
This SQL solution is from Pesomannen's reply to http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2

well, you can always do a workaround like this:
t1 = ''.join(T1.val)
p = ''.join(P.val)
start, res = 0, []
while True:
try:
res.append(t1.index(p, start))
start = res[-1] + 1
except:
break
to get the starting indices and then figure out the ending indices by mathing and access the dataframe by using iloc. you should use 0-based indexing (not 1-based, like you do in the example)

Granted, this doesn't utilize P, but may serve your purposes.
groups = T1.groupby(T1.val).groups
pd.DataFrame({'min': [min(x) for x in groups.values()],
'max': [max(x) for x in groups.values()]}, index=groups.keys())
yields
max min
E 7 2
B 10 0
D 9 1
A 8 3
[4 rows x 2 columns]

I think I've worked it out by following the same approach as the SQL solution - a type of relational division (ie match up on the values, group by the differences in the key columns and select the group that has the count equal to the size of the subsequence):
import pandas as pd
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
# use the index to create a new column that's going to be the key (zero based)
T1 = T1.reset_index()
# do the same for the subsequence that we want to find within T1
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
P = P.reset_index()
# join on the val column
J = T1.merge(P,on=['val'],how='inner')
# group by difference in key columns calculating the min, max and count of the T1 key
FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})
# Final result is where the count is the size of the subsequence - in this case 4
FullResult[FullResult['count'] == 4]
Really enjoying using pandas !

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460x4460, so can't do it visually.

You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.
But if you want to do this in pandas, you can unstack and sort the DataFrame:
import pandas as pd
import numpy as np
shape = (50, 4460)
data = np.random.normal(size=shape)
data[:, 1000] += data[:, 2000]
df = pd.DataFrame(data)
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print so[-4470:-4460]
Here is the output:
2192 1522 0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64

#HYRY's answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:
import pandas as pd
d = {'x1': [1, 4, 4, 5, 6],
'x2': [0, 0, 8, 2, 4],
'x3': [2, 8, 8, 10, 12],
'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()
print("Correlation Matrix")
print(df.corr())
print()
def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(df, n=5):
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))
That gives the following output:
Data Frame
x1 x2 x3 x4
0 1 0 2 -1
1 4 0 8 -4
2 4 8 8 -4
3 5 2 10 -4
4 6 4 12 -5
Correlation Matrix
x1 x2 x3 x4
x1 1.000000 0.399298 1.000000 -0.969248
x2 0.399298 1.000000 0.399298 -0.472866
x3 1.000000 0.399298 1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248 1.000000
Top Absolute Correlations
x1 x3 1.000000
x3 x4 0.969248
x1 x4 0.969248
dtype: float64

Few lines solution without redundant pairs of variables:
corr_matrix = df.corr().abs()
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
.stack()
.sort_values(ascending=False))
#first element of sol series is the pair with the biggest correlation
Then you can iterate through names of variables pairs (which are pandas.Series multi-indexes) and theirs values like this:
for index, value in sol.items():
# do some staff

Combining some features of #HYRY and #arun's answers, you can print the top correlations for dataframe df in a single line using:
df.corr().unstack().sort_values().drop_duplicates()
Note: the one downside is if you have 1.0 correlations that are not one variable to itself, the drop_duplicates() addition would remove them

I liked Addison Klinke's post the most, as being the simplest, but used Wojciech Moszczyńsk’s suggestion for filtering and charting, but extended the filter to avoid absolute values, so given a large correlation matrix, filter it, chart it, and then flatten it:
Created, Filtered and Charted
dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()
Function
In the end, I created a small function to create the correlation matrix, filter it, and then flatten it. As an idea, it could easily be extended, e.g., asymmetric upper and lower bounds, etc.
def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened
corrFilter(df, .7)
Follow-Up
Eventually, I refined the functions
# Returns correlation matrix
def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
return xFiltered
# flattens correlation matrix with bounds
def corrFilterFlattened(x: pd.DataFrame, bound: float):
xFiltered = corrFilter(x, bound)
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened
# Returns correlation for a variable from flattened correlation matrix
def filterForLabels(df: pd.DataFrame, label):
try:
sideLeft = df[label,]
except:
sideLeft = pd.DataFrame()
try:
sideRight = df[:,label]
except:
sideRight = pd.DataFrame()
if sideLeft.empty and sideRight.empty:
return pd.DataFrame()
elif sideLeft.empty:
concat = sideRight.to_frame()
concat.rename(columns={0:'Corr'},inplace=True)
return concat
elif sideRight.empty:
concat = sideLeft.to_frame()
concat.rename(columns={0:'Corr'},inplace=True)
return concat
else:
concat = pd.concat([sideLeft,sideRight], axis=1)
concat["Corr"] = concat[0].fillna(0) + concat[1].fillna(0)
concat.drop(columns=[0,1], inplace=True)
return concat

You can do graphically according to this simple code by substituting your data.
corr = df.corr()
kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

Use the code below to view the correlations in the descending order.
# See the correlations in descending order
corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Combining most the answers above into a short snippet:
def top_entries(df):
mat = df.corr().abs()
# Remove duplicate and identity entries
mat.loc[:,:] = np.tril(mat.values, k=-1)
mat = mat[mat>0]
# Unstack, sort ascending, and reset the index, so features are in columns
# instead of indexes (allowing e.g. a pretty print in Jupyter).
# Also rename these it for good measure.
return (mat.unstack()
.sort_values(ascending=False)
.reset_index()
.rename(columns={
"level_0": "feature_a",
"level_1": "feature_b",
0: "correlation"
}))

Lot's of good answers here. The easiest way I found was a combination of some of the answers above.
corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
.sort_values(by='column', ascending=False)\
.dropna()

The following function should do the trick. This implementation
Removes self correlations
Removes duplicates
Enables the selection of top N highest correlated features
and it is also configurable so that you can keep both the self correlations as well as the duplicates. You can also to report as many feature pairs as you wish.
def get_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):
"""
Compute the feature correlation and sort feature pairs based on their correlation
:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool
:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()
# Remove comparisons of the same feature
if remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]
# Remove duplicates
if remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
if top_n:
return sorted_correlated_features[:top_n]
return sorted_correlated_features

Use itertools.combinations to get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use '.sort_values'. Set ascending = True to display lowest correlations on top
corrank takes a DataFrame as argument because it requires .corr().
def corrank(X: pandas.DataFrame):
import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
corrank(X) # prints a descending list of correlation pair (Max on top)

I didn't want to unstack or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.
So I ended up with the following simplified solution:
# map features to their absolute correlation values
corr = features.corr().abs()
# set equality (self correlation) as zero
corr[corr == 1] = 0
# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)
# display the highly correlated features
display(corr_cols[corr_cols > 0.8])
In this case, if you want to drop correlated features, you may map through the filtered corr_cols array and remove the odd-indexed (or even-indexed) ones.

I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:
def sort_correlation_matrix(correlation_matrix):
cor = correlation_matrix.abs()
top_col = cor[cor.columns[0]][1:]
top_col = top_col.sort_values(ascending=False)
ordered_columns = [cor.columns[0]] + top_col.index.tolist()
return correlation_matrix[ordered_columns].reindex(ordered_columns)

This is a improve code from #MiFi. This one order in abs but not excluding the negative values.
def top_correlation (df,n):
corr_matrix = df.corr()
correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
correlation = pd.DataFrame(correlation).reset_index()
correlation.columns=["Variable_1","Variable_2","Correlacion"]
correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
return correlation.head(n)
top_correlation(ANYDATA,10)

simple is better
from collections import defaultdict
res = defaultdict(dict)
corr = returns.corr().replace(1, -1)
names = list(corr)
for name in names:
idx = corr[name].argmax()
max_pairwise_name = names[idx]
res[name][max_pairwise_name] = corr.loc[max_pairwisename, name]
Now res contains the maximum pairwise correlation for each pair

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Advanced groupby column creation in pandas Dataframe - python

Related

Comparing values in different pairs of columns in Pandas

Performance enhancement of ranking function by replacement of lambda x with vectorization

Counting line frequencies and producing output files

Finding the position of a subsequence in a sequence

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

Categories

Resources