Pandas: row comparisons by group based on a condition across 2 columns - python

I want to do row comparisons by group based on a condition across 2 columns. This condition
is: (col1(i)-col1(j))*(col2(i)-col2(j)) <= 0, where we are comparing every row i with row j in columns col1 and col2. If the condition is satisfied for all row comparisons in the group, then set true for that group, else false.
data = {'group':['A', 'A', 'A', 'B', 'B', 'B'],
'col1':[1, 2, 3, 2, 3, 1], 'col2':[4, 3, 2, 2, 3, 1]}
df = pd.DataFrame(data)
df
with output
A True
B False

You can use shift for comparision with next row along with groupby+all for checking if all items in the group is True:
cond=((df['col1']-df['col1'].shift(-1))*(df['col2']-df['col2'].shift(-1))<=0)&(df['group']==df['group'].shift(-1))
cond.groupby(df['group']).all()
group
A True
B False
dtype: bool

Related

Populate Pandas dataframe with random sample from another dataframe if condition is met, when columns to be assigned are not independent

I have two DataFrames, df1 and df2. The information in df1 has to be used to populate cells in df2 if a specific condition is met. This is an example:
df1 = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4],"B":[1, 2, 3, 1, 2, 2, 3, 1, 2, 3, 4],"C":[5, 3, 2,10,11,12, 4, 5, 7, 2, 7], "D":[0.5, 0.3, 0.5, 0.7, 0.5, 0.6, 0.1, 0.6, 0.6, 0.5, 0.6]})
df2 = pd.DataFrame({"A":[5, 5, 6, 6, 6], "B":[1, 2, 1, 2, 3], "C":np.nan, "D":np.nan})
The np.nan entries in df2 are meant to represent the cells that need to be populated. These are empty at the start of the process.
To populate df2, I need to use the values in the column df2['B']. Specifically, in this example, if the value of df2['B'] is equal to 1, then I need to get a random sample, with replacement, from df1[df1['B']==1], for both df1['C'] and df1['D']. Importantly, these values are not independent. Therefore, I need to draw a random row from the subset of rows of df1 where df1['B'] is equal to one. And then I need to do this for all rows in df2.
Doing df1[df1['B']==1][['C','D']].sample(replace = True) draws a random sample for one case when the value of df1['B'] is one, but
How do I assign the corresponding values to df2?
How do I do this for every row in df2?
I have tried several alternatives with loops, such as
for index, value in df2.iterrows():
if df2.loc[index,'B'] == 1:
temp_df = df1[df1['B'] == 1][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 2:
temp_df = df1[df1['B'] == 2][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 3:
temp_df = df1[df1['B'] == 3][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 4:
temp_df = df1[df1['B'] == 4][['C','D']].sample(n = 1, replace = True)
df2.loc[index, 'C'] = temp_df['C']
df2.loc[index, 'D'] = temp_df['D']
but I get an error message saying
---> 15 df2.loc[index, 'C'] = temp_df['C']
16 df2.loc[index, 'D'] = temp_df['D']
...
ValueError: Incompatible indexer with Series
where the ... denotes lines from the error message that I skipped.
Here's one approach:
(i) get the sample sizes from df2 with groupby + size.
(ii) use groupby + apply where we use a lambda function to sample items from df1 with the sample sizes obtained from (i) for each unique "B".
(iii) assign these sampled values to df2 (since "B" is not unique, we sorted df2 by "B" to make the rows align)
cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()
df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
.groupby('B')[cols]
.apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
.droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()
One sample:
A B C D
0 5 1 5 0.6
1 5 2 10 0.7
2 6 1 12 0.6
3 6 2 11 0.5
4 6 3 4 0.1

pandas: comparing non-identical list of panda dataframes based on values from a certain column

I have a two lists of panda dataframes as follows,
import pandas as pd
import numpy as np
list_one = [pd.DataFrame({'sent_a.1': [0, 3, 2, 1], 'sent_a.2': [0, 1, 4, 0], 'sent_b.3': [0, 6, 0, 8],'sent_b.4': [1, 1, 8, 6],'ID':['id_1','id_1','id_1','id_1']}),
pd.DataFrame({'sent_a.1': [0, 3], 'sent_a.2': [0, 2], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']})]
list_two = [pd.DataFrame({'sent_a.1': [0, 5], 'sent_a.2': [0, 1], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']}),
pd.DataFrame({'sent_a.1': [0, 5, 3, 1], 'sent_a.2': [0, 2, 3, 1], 'sent_b.3': [0, 6, 6, 8],'sent_b.4': [1, 5, 8, 5],'ID':['id_1','id_1','id_1','id_1']})]
I would like to compare the dataframes in these two lists and if the values are the same, I would like to replace the value with 'True' and if the values are different, I would like to set them to 'False' and save the result in a different list of panda dataframes. I have done the following,
for dfs in list_one:
for dfs2 in list_two:
g = np.where(dfs == dfs2, 'True', 'False')
print (g)
but I get the error,
ValueError: Can only compare identically-labeled DataFrame objects
how can I sort values in these two lists, based on the values from column 'ID'?
Edit
I would like the dataframes that have the same value for column 'ID' to be compared. meaning that dataframes that have 'ID' == 'id_1' are to be compared with one another and dataframes that have 'ID' == 'id_2' to be compared with each other (not a cross comparison)
so the desired output is:
output = [ sent_a.1 sent_a.2 sent_b.3 sent_b.4 ID
0 True True True True id_1
1 False False True False id_1
2 False False False True id_1
3 False False True True id_1,
sent_a.1 sent_a.2 sent_b.3 sent_b.4 ID
0 True True True True id_2
1 True True False False id_2]
Based on your current example
For your first question:
how can I sort values in these two lists, based on the values from column 'ID'?
list_one = sorted(list_one,key=lambda x: x['ID'].unique()[0][3:], reverse=False)
list_two =sorted(list_two,key=lambda x: x['ID'].unique()[0][3:], reverse=False)
ValueError: Can only compare identically-labeled DataFrame objects
error due to different index values order in dataframes or dataframes are of different shapes
First way of comparison:
for dfs in list_one:
for dfs2 in list_two:
if dfs.shape == dfs2.shape:
g = np.where(dfs == dfs2, 'True', 'False')
print (g)
Second way:
I would like the dataframes that have the same value for column 'ID' to be compared
for dfs in list_one:
for dfs2 in list_two:
if (dfs['ID'].unique() == dfs2['ID'].unique()) and (dfs.shape == dfs2.shape):
g = np.where(dfs == dfs2, 'True', 'False')
print (g)

Pandas groupby get row with max in multiple columns

Looking to get the row of a group that has the maximum value across multiple columns:
pd.DataFrame([{'grouper': 'a', 'col1': 1, 'col2': 3, 'uniq_id': 1}, {'grouper': 'a', 'col1': 2, 'col2': 4, 'uniq_id': 2}, {'grouper': 'a', 'col1': 3, 'col2': 2, 'uniq_id': 3}])
col1 col2 grouper uniq_id
0 1 3 a 1
1 2 4 a 2
2 3 2 a 3
In the above, I'm grouping by the "grouper" column. Within the "a" group, I want to get the row that has the max of col1 and col2, in this case, when I group my DataFrame, I want to get the row with uniq_id of 2 because it has the highest value of col1/col2 with 4, so the outcome would be:
col1 col2 grouper uniq_id
1 2 4 a 2
In my actual example, I'm using timestamps, so I actually don't expect ties. But in the case of a tie, I am indifferent to which row I select in the group, so it would just be first of the group in that case.
One more way you can try:
# find row wise max value
df['row_max'] = df[['col1','col2']].max(axis=1)
# filter rows from groups
df.loc[df.groupby('grouper')['row_max'].idxmax()]
col1 col2 grouper uniq_id row_max
1 2 4 a 2 4
Later you can drop row_max using df.drop('row_max', axis=1)
IIUC using transform then compare with original dataframe
g=df.groupby('grouper')
s1=g.col1.transform('max')
s2=g.col2.transform('max')
s=pd.concat([s1,s2],axis=1).max(1)
df.loc[df[['col1','col2']].eq(s,0).any(1)]
Out[89]:
col1 col2 grouper uniq_id
1 2 4 a 2
Interesting approaches all around. Adding another one just to show the power of apply (which I'm a big fan of) and using some of the other mentioned methods.
import pandas as pd
df = pd.DataFrame(
[
{"grouper": "a", "col1": 1, "col2": 3, "uniq_id": 1},
{"grouper": "a", "col1": 2, "col2": 4, "uniq_id": 2},
{"grouper": "a", "col1": 3, "col2": 2, "uniq_id": 3},
]
)
def find_max(grp):
# find max value per row, then find index of row with max val
max_row_idx = grp[["col1", "col2"]].max(axis=1).idxmax()
return grp.loc[max_row_idx]
df.groupby("grouper").apply(find_max)
value = pd.concat([df['col1'], df['col2']], axis = 0).max()
df.loc[(df['col1'] == value) | (df['col2'] == value), :]
col1 col2 grouper uniq_id
1 2 4 a 2
This probably isn't the fastest way, but it will work in your case. Concat both the columns and find the max, then search the df for where either column equals the value.
You can use numpy and pandas as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [3, 4, 2],
'grouper': ['a', 'a', 'a'],
'uniq_id': [1, 2, 3]})
df['temp'] = np.max([df.col1.values, df.col2.values],axis=0)
idx = df.groupby('grouper')['temp'].idxmax()
df.loc[idx].drop('temp',1)
col1 col2 grouper uniq_id
1 2 4 a 2

Changing output of `pd.DataFrame.isin` to `True` if column is omitted

If I pass a dictionary to the isin method of a dataframe, columns that are omitted in the dictionary are considered False by default.
df = pd.DataFrame()
df['foo'] = pd.Series([0, 0, 1, 1, 5, 5], dtype='category')
df['bar'] = pd.Series([4, 4, 2, 2, 1, 1], dtype='category')
values = {'foo' : [0, 1]}
print(df.isin(values))
Out[1]:
foo bar
0 True False
1 True False
2 True False
3 True False
4 False False
5 False False
This is annoying* because if I have a dataframe with many columns and I only want to impose conditions on a subset of them, I still have to list all names of the other columns with all their possible values. Is there an elegant way to avoid that?
*I later want to select the rows in which a condition holds, row_mask = df.isin(values).all(axis=1) so I would like all columns for which no condition is imposed to be True.

What's the most efficient way to get a variable length of rows w.r.t each group of a dataframe

To illustrate my question clearly, for a dummy dataframe like this:
df = pd.DataFrame({'X' : ['B', 'B', 'A', 'A', 'A'], 'Y' : [1, 2, 3, 4, 5]})
How can I get top 1 row of group A and top 2 rows of group B, and get rid of the rest rows of each group? By the way, the real dataset is big with hundreds of thousands of rows and thousands of groups.
And the output looks like this:
pd.DataFrame({'X' : ['B', 'B', 'A'], 'Y' : [1, 2, 3]})
My main gripe is .groupby().head() only gives me a fixed length of rows within each group, and I want have a different number of rows of different groups.
One way to do this is create a dictionary contains the number of rows each group should keep, and in the groupby.apply, use the g.name as the key to look up the value in the dictionary, with the head method you can keep different rows for each group:
rows_per_group = {"A": 1, "B": 2}
df.groupby("X", group_keys=False).apply(lambda g: g.head(rows_per_group[g.name]))
# X Y
#2 A 3
#0 B 1
#1 B 2

Categories