A GroupBy with combinations of the categorical variables - python

Let's say I have data:
pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
which gives:
column
index
a 1
b 2
c 3
a 4
b 1
c 2
Then to get the mean of each subgroup one would:
df.groupby(df.index).mean()
column
index
a 2.5
b 1.5
c 2.5
However, what I've been trying to achieve without constantly looping and slicing the data, is how do I get the mean for pairs of subgroups?
For instance, the mean of a & b is 2? As if their values were combined.
The output would be something akin to:
column
index
a & a 2.5
a & b 2.0
a & c 2.5
b & b 1.5
b & c 2.0
c & c 2.5
Preferably this would involve manipulating the parameters in 'groupby' but as it is, I'm having to resort to looping and slicing. With the ability to build all combinations of subgroups at some point.

I've revisited this 3 years later with a general solution to this problem.
It's being used in this open source library, which is why I'm now able to do this here and it works with any number of indexes and creates combinations on them using numpy matrix broadcasting
So first of all, that is not a valid dataframe. The indexes aren't unique. Let's add another index to that object and make it a Series:
df = pd.DataFrame({
'unique': [1, 2, 3, 4, 5, 6],
'index': ['a','b','c','a','b','c'],
'column': [1,2,3,4,1,2]
}).set_index(['unique','index'])
s = df['column']
Let's unstack that index:
>>> idxs = ['index'] # set as variable to be used later on
>>> unstacked = s.unstack(idxs)
column
index a b c
unique
1 1.0 NaN NaN
2 NaN 2.0 NaN
3 NaN NaN 3.0
4 4.0 NaN NaN
5 NaN 1.0 NaN
6 NaN NaN 2.0
>>> vals = unstacked.values
array([[ 1., nan, nan],
[ nan, 2., nan],
[ nan, nan, 3.],
[ 4., nan, nan],
[ nan, 1., nan],
[ nan, nan, 2.]])
>>> sum = np.nansum(vals, axis=0)
>>> count = (~np.isnan(vals)).sum(axis=0)
>>> mean = (sum + sum[:, np.newaxis]) / (count + count[:, np.newaxis])
array([[ 2.5, 2. , 2.5],
[ 2. , 1.5, 2. ],
[ 2.5, 2. , 2.5]])
Now recreate the output dataframe:
>>> new_df = pd.DataFrame(mean, unstacked.columns, unstacked.columns.copy())
index_ a b c
index
a 2.5 2.0 2.5
b 2.0 1.5 2.0
c 2.5 2.0 2.5
>>> idxs_ = [ x+'_' for x in idxs ]
>>> new_df.columns.names = idxs_
>>> new_df.stack(idxs_, dropna=False)
index index_
a a 2.5
b 2.0
c 2.5
b a 2.0
b 1.5
c 2.0
c a 2.5
b 2.0
c 2.5

My current implementation is:
import pandas as pd
import itertools
import numpy as np
# get all pair of categories here
def all_pairs(df, ix):
hash = {
ix: [],
'p': []
}
for subset in itertools.combinations(np.unique(np.array(df.index)), 2):
hash[ix].append(subset)
hash['p'].append(df.loc[pd.IndexSlice[subset], :]).mean)
return pd.DataFrame(hash).set_index(ix)
Which gets the combinations and then adds them to the has that then builds back up into a data frame. It's hacky though :(

Here's an implementation that uses a MultiIndex and an outer join to handle the cross join.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
groupedDF = df.groupby(df.index).mean()
# Create new MultiIndex using from_product which gives a paring of the elements in each iterable
p = pd.MultiIndex.from_product([groupedDF.index, groupedDF.index])
# Add column for cross join
groupedDF[0] = 0
# Outer Join
groupedDF = pd.merge(groupedDF, groupedDF, how='outer', on=0).set_index(p)
# get mean for every row (which is the average for each pair)
# unstack to get matrix for deduplication
crossJoinMeans = groupedDF[['column_x', 'column_y']].mean(axis=1).unstack()
# Create Identity matrix because each pair of itself will be needed
b = np.identity(3, dtype='bool')
# set the first column to True because it contains the rest of the unique means (the identity portion covers the first part)
b[:,0] = True
# invert for proper use of DataFrame Mask
b = np.invert(b)
finalDF = crossJoinMeans.mask(b).stack()
I'd guess that this could be cleaned up and made more concise.

Related

How to get next rows after filtered rows in Pandas

I have a Data Frame called Data. I wrote this filter for filter the rows:
data[data["Grow"] >= 1.5]
It returned some rows like these:
PriceYesterday Open High Low
------------------------------------------------------------------
7 6888.0 6881.66 7232.0 6882.0
53 7505.0 7555.72 7735.0 7452.0
55 7932.0 8093.08 8120.0 7974.0
64 7794.0 7787.29 8001.0 7719.0
...
As you see there are some rows in the indexes 7, 53, 55 ,.... Now I want to get rows in indexes 8, 54, 56, ... too. Is there any straight forward way to do this? Thanks
You can use Index.intersection for avoid error if matching last row and want select not exist index values:
data = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'Grow':[0,8,2,0.4,2,3.3],
})
df1 = data[data["Grow"] >= 1.5]
print (df1)
A B Grow
1 b 5 8.0
2 c 4 2.0
4 e 5 2.0
5 f 4 3.3
df2 = data.loc[data.index.intersection(df1.index + 1)]
print (df2)
A B Grow
2 c 4 2.0
3 d 5 0.4
5 f 4 3.3
Another idea is select by shifted values by Series.shift
df1 = data[data["Grow"] >= 1.5]
df2 = data[data["Grow"].shift() >= 1.5]
print (df2)
A B Grow
2 c 4 2.0
3 d 5 0.4
5 f 4 3.3
df1 = data[data["Grow"] >= 1.5]
df2 = data.loc[df1.index + 1]
print (df2)
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([6], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
You should create a mask, and then shift that mask by one:
import numpy as np
df = pd.DataFrame({'a': np.random.random(20)})
print(df)
mask = df['a']>0.8
print("items that fit the mask:")
print(df.loc[mask])
print("items following these:")
print(df.loc[mask.shift().fillna(False)])
In your specific case I believe it would be
data.loc[(data["Grow"] >= 1.5).shift().fillna(False)]
data[data.shift()["Grow"] >= 1.5]
The shift moves every cell one step to the end of the frame. So this says: Give my those entries, which predecessors match my criteria.

how do i use np.nanmin when comparing one column of pandas dataframe with a integer?

import pandas as pd
import numpy as np
a = np.array([[1, 2], [3, np.nan]])
np.nanmin(a, axis=0)
array([1., 2.])
I want to use same logic but on pandas dataframe columns and comparing each value of column with an integer.
use case:
MC_cond = df['MODEL'].isin(["MC"])
df_lgd_type = df['LGD_TYPE'].isin(["FIXED"])
df_without_lgd_type = ~(df_lgd_type)
x = np.nanmin((1,df.loc[MC_cond & df_without_lgd_type,'A'] + df.loc[MC_cond &
df_without_lgd_type,'B']))
comparing sum of column A and column B with 1.
This should do the trick even without np.nanmin. I hope I've understood everything correctly from your sparse description.
I'm assuming you also want to replace those NaN values that are left after summation. So we fill those with 1 and then clip all values to max at 1.
a = df.loc[MC_cond & df_without_lgd_type, 'A']
b = df.loc[MC_cond & df_without_lgd_type, 'B']
x = (a + b).fillna(1).clip(upper=1)
Example:
df = pd.DataFrame({
'A': [-1, np.nan, 2, 3, 4],
'B': [-4, 5, np.nan, 7, -8]
})
(df.A + df.B).fillna(1).clip(upper=1)
# Output:
# 0 -5.0
# 1 1.0
# 2 1.0
# 3 1.0
# 4 -4.0
# dtype: float64
In case you don't want NaN values in one column leading to row sum being NaN too, just fill them before:
x = (a.fillna(0) + b.fillna(0)).fillna(1).clip(upper=1)
Just for completeness, this would be a pure numpy solution resembling your approach:
a = df.loc[MC_cond & df_without_lgd_type, 'A'].to_numpy()
b = df.loc[MC_cond & df_without_lgd_type, 'B'].to_numpy()
# optionally fill NaNs with 0
# a = np.nan_to_num(a)
# b = np.nan_to_num(b)
s = a + b
x = np.nanmin(np.stack(s, np.ones_like(s))), axis=0)

Faster method to multiply column lookup values with vectorization

I have two Dataframes, one contains values and is the working dataset (postsolutionDF), while the other is simply for reference as a lookup table (factorimportpcntDF). The goal is to add a column to postsolutionDF that contains the product of the lookup values from each row of postsolutionDF (new column name = num_predict). That product is then multiplied by 2700. For example, on first row, the working values are 0.5, 2, -6. The equivalent lookup values for these are 0.1182, 0.2098, and 0.8455. The product of those is 0.0209, which when multiplied by 2700 is 56.61 as shown in output.
The code below works for this simplified example, but it is very slow in the real solution (1.6MM rows x 15 numbered columns). I'm sure there is a better way to do this by removing the 'for k in range' loop but am struggling with how since already using apply on rows. I've found many tangential solutions but nothing that has worked for my situation yet. Thanks for any help.
import pandas as pd
import numpy as np
postsolutionDF = pd.DataFrame({'SCRN' : (['2019-01-22-0000001', '2019-01-22-0000002', '2019-01-22-0000003']), '1' : 0.5,
'2' : 2, '3' : ([-6, 1.0, 8.0])})
postsolutionDF = postsolutionDF[['SCRN', '1', '2', '3']]
print('printing initial postsolutionDF..')
print(postsolutionDF)
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433], 'F2_Val'
: [2, 3, np.nan, np.nan], 'F2_Pcnt' : [0.2098, 0.7585, np.nan, np.nan], 'F3_Val' : [-6, 1, 8, np.nan], 'F3_Pcnt' :
[0.8455, 0.1753, 0.072, np.nan]})
print('printing factorimportpcntDF..')
print(factorimportpcntDF)
def zero_filter(row): # row is series
inner_value = 1
for k in range(1, 4): # number of columns in postsolutionDF with numeric headers, dynamic in actual code
inner_value *= factorimportpcntDF.loc[factorimportpcntDF['F'+str(k)+'_Val']==row[0+k], 'F'+str(k)+'_Pcnt'].values[0]
inner_value *= 2700
return inner_value
postsolutionDF['num_predict'] = postsolutionDF.apply(zero_filter, axis=1)
print('printing new postsolutionDF..')
print(postsolutionDF)
Print Output:
C:\ProgramData\Anaconda3\python.exe C:/Users/Eric/.PyCharmCE2017.3/config/scratches/scratch_5.py
printing initial postsolutionDF..
SCRN 1 2 3
0 2019-01-22-0000001 0.5 2 -6.0
1 2019-01-22-0000002 0.5 2 1.0
2 2019-01-22-0000003 0.5 2 8.0
printing factorimportpcntDF..
F1_Pcnt F1_Val F2_Pcnt F2_Val F3_Pcnt F3_Val
0 0.1182 0.5 0.2098 2.0 0.8455 -6.0
1 0.2938 1.0 0.7585 3.0 0.1753 1.0
2 0.4371 1.5 NaN NaN 0.0720 8.0
3 0.5433 2.0 NaN NaN NaN NaN
printing new postsolutionDF..
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
Process finished with exit code 0
I'm not sure how to do this in native pandas, but if you go back to numpy, it is pretty easy.
The numpy.interp function is designed to interpolate between values in the lookup table, but if the input values exactly match the values in the lookup table (like yours do), it becomes just a simple lookup instead of an interpolation.
postsolutionDF['1new'] = np.interp(postsolutionDF['1'].values, factorimportpcntDF['F1_Val'], factorimportpcntDF['F1_Pcnt'])
postsolutionDF['2new'] = np.interp(postsolutionDF['2'].values, factorimportpcntDF['F2_Val'], factorimportpcntDF['F2_Pcnt'])
postsolutionDF['3new'] = np.interp(postsolutionDF['3'].values, factorimportpcntDF['F3_Val'], factorimportpcntDF['F3_Pcnt'])
postsolutionDF['num_predict'] = postsolutionDF['1new'] * postsolutionDF['2new'] * postsolutionDF['3new'] * 2700
postsolutionDF.drop(columns=['1new', '2new', '3new'], inplace=True)
Gives the output:
In [167]: postsolutionDF
Out[167]:
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
I had to pad out the factorimportpcntDF so all the columns had 4 values, otherwise looking up the highest value for a column wouldn't work. You can just use the same value multiple times, or split it into 3 lookup tables if you prefer, then the columns could be different lengths.
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433],
'F2_Val' : [2, 3, 3, 3], 'F2_Pcnt' : [0.2098, 0.7585, 0.7585, 0.7585],
'F3_Val' : [-6, 1, 8, 8], 'F3_Pcnt' : [0.8455, 0.1753, 0.072, 0.072]})
Note that the documentation specifies that your F1_val etc columns need to be in increasing order (yours are here, just an FYI). Otherwise interp will run, but won't necessarily give good results.

In Python DataFrame how to find out number of rows that have valid values of columns

I want to find the number of rows that have certain values such as None or "" or NaN (basically empty values) in all columns of a DataFrame object. How can I do this?
Use pandas dataframe.isin to create a boolean array. Sum by row, then find
the number of rows with a result > 0.
Place one or more values in the search_values list to look for within the rows of the dataframe.
search_values = ['', np.nan, None]
(df.isin(search_values).sum(axis=1) > 0).sum()
If you would like the row count per column:
df.isin(search_values).sum(axis=0)
Use df.isnull().sum() to get number of rows with None and NaN value.
Use df.eq(value).sum() for any kind of values including empty string "".
In a pandas.Series (think of it as the column of a normal pandas.DataFrame):
>> s = pd.Series([np.nan, np.nan, 1, 2, np.nan])
>> s
0 NaN
1 NaN
2 1.0
3 2.0
4 NaN
>> s.isnull().sum()
3
For a pandas.DataFrame is quite similar:
>> pd.DataFrame(np.array([[np.nan, np.nan],
...: [ 0., np.nan],
...: [ 1., 1.],
...: [ 2., 2.],
...: [np.nan, np.nan]]))
>> df
0 1
0 NaN NaN
1 0.0 NaN
2 1.0 1.0
3 2.0 2.0
4 NaN NaN
>> df.isnull().sum(axis=0)
0 2
1 3
dtype: int64
To sum by row, just put .sum(axis=1).

Pandas - Find longest stretch without Nan values

I have a pandas dataframe "df", a sample of which is below:
time x
0 1 1
1 2 Nan
2 3 3
3 4 Nan
4 5 8
5 6 7
6 7 5
7 8 Nan
The real frame is much bigger. I am trying to find the longest stretch of non NaN values in the "x" series, and print out the starting and ending index for this frame. Is this possible?
Thank You
Here's a vectorized approach with NumPy tools -
a = df.x.values # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] )) # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2) # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()] # Get max interval, interval limits
Sample run -
In [474]: a
Out[474]:
array([ 1., nan, 3., nan, nan, nan, nan, 8., 7., 5., 2.,
5., nan, nan])
In [475]: start, stop
Out[475]: (7, 12)
The intervals are set such that the difference between each start and stop would give us the length of each interval. So, by ending index if you meant to get the last index of non-zero element, we need to subtract one from stop.
pandas
f = dict(
Start=pd.Series.first_valid_index,
Stop=pd.Series.last_valid_index,
Stretch='count'
)
agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values
array([ 4., 6.])
numpy
def pir(x):
# pad with np.nan
x = np.append(np.nan, np.append(x, np.nan))
# find where null
w = np.where(np.isnan(x))[0]
# diff to find length of stretch
# argmax to find where largest stretch
a = np.diff(w).argmax()
# return original positions of boundary nulls
return w[[a, a + 1]] + np.array([0, -2])
demo
pir(df.x.values)
array([4, 6])
a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)
array([ 7, 11])
So you can get the index values of the NaN's in the following way:
import numpy as np
index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]
>>> [0, 1, 3, 7]
Then one solution would be to see the largest difference between subsequent index values and that would give you the longest stretch of non NaN values.
Maybe a faster way would be the following (given that you say you have a long dataframe, speed matters):
In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})
In [20]: index = df['x'].isnull()
In [21]: df[index].index.values
Out[21]: array([1, 3, 7])
Another method is to use scipy.ndimage.measurements.label. It will perform a segmentation of your non null index into valid group and label them differently. You can then group your dataframe using the labels and take the biggest group.
Set-up
import pandas as pd
import numpy as np
from scipy.ndimage.measurements import label
df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})
Retrieving longest stretch without nan
valid_rows = ~df.isnull().any(axis=1)
label, num_feature = label(valid_rows)
label_of_biggest_group = valid_rows.groupby(label).count().drop(0).argmax()
print df.loc[label == label_of_biggest_group]
Result
time x
4 5 8.0
5 6 7.0
6 7 5.0
Note
The label 0 contains background data in our case the nan values and it has to be dropped in case of your number of nan is greater or equal to the size of yout biggest group. num_feature is your number of homogeneous stretches without nan.

Categories