omit groups in pandas groupby based on a condition - python

This is my dataframe:
df = pd.DataFrame({'sym': list('aaaaaabb'), 'key': [1, 1, 1, 1, 2, 2, 3, 3], 'x': [100, 100, 90, 100, 500, 500, 700, 700]})
I group them by key and sym:
groups = df.groupby(['key', 'sym'])
Now I want to check whether all x in each group are equal or not. If they are not equal, I want to delete it from the df. In this case I want to omit the first group.
This is my desired df:
key sym x
4 2 a 500
5 2 a 500
6 3 b 700
7 3 b 700

Use GroupBy.transform with SeriesGroupBy.nunique and compare by 1, filter by boolean indexing:
df1 = df[df.groupby(['key', 'sym'])['x'].transform('nunique').eq(1)]
print (df1)
sym key x
4 a 2 500
5 a 2 500
6 b 3 700
7 b 3 700

Related

How to get the (relative) place of values in a dataframe when sorted using Python?

How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3

Find duplicate rows among different groups with pandas

Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.
If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.
What you need is essentially the AND of two different duplicated() calls.
~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8 as in the comment.
My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.

Filling Pandas columns with lists of unequal lengths

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

Union of possible combinations of two columns

My DataFrame looks like this:
A B
100 1
100 2
200 2
200 3
I need to find all possible combinations of A and B values and create new dataframe with this combinations and a third column indicating each combination presence in the original df:
A B C
100 1 True
100 2 True
100 3 False
200 1 False
200 2 True
200 3 True
How I'm doing it now:
import pandas as pd
df = pd.DataFrame({'A' : [100,100,200,200], 'B' : [1,2,2,3]})
df['D'] = 42
df2 = df[['A','D']].merge(df[['B','D']], on = 'D')
[['A','B']].drop_duplicates()
i1 = df.set_index(['A','B']).index
i2 = df2.set_index(['A','B']).index
df2['C'] = i2.isin(i1)
print(df2)
It works, but looks ugly. Is there a cleaner way?
You can use:
create new column filled Trues
set_index from columns for all combinations
create MultiIndex.from_product from levels of df1 index
reindex original df and if not exist values add Falses
reset_index for columns from MultiIndex
df['C'] = True
df1 = df.set_index(['A','B'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df = df1.reindex(mux, fill_value=False).reset_index()
print (df)
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
With the help of itertools and tuple
import itertools
newdf = pd.DataFrame(list(itertools.product(df['A'].unique(),df['B'].unique())),columns = df.columns)
dft = list(df.itertuples(index=False))
newdf['C'] = newdf.apply(lambda x: tuple(x) in dft,axis=1)
Output :
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
Using cartesian_product and pd.merge
In [415]: combs = pd.core.reshape.util.cartesian_product(
df.set_index(['A', 'B']).index.levels)
In [416]: combs
Out[416]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]
In [417]: (pd.DataFrame({'A': combs[0], 'B': combs[1]})
.merge(df, how='left', indicator='C')
.replace({'C': {'both': True, 'left_only': False}}) )
Out[417]:
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
For combs, you could also,
In [432]: pd.core.reshape.util.cartesian_product([df.A.unique(), df.B.unique()])
Out[432]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]

Pandas: join series based on category index

I have two pd.Series:
A
idx
200 1
300 2
400 3
with length n and
B
idx
200 4
350 5
360 6
370 7
380 8
with length m.
Note that the length of the Series can be different.
I would like to have a category index:
cat
[200, 300)
[300, 400)
[400, 500)
and perform a correlation between the following pd.Series:
A B
cat
[200, 300) 1 3
[300, 400) 2 4+5+6+7
[400, 500) 3 NaN
So how do I slot my data based on their index into the category index and perform a sum over entries that fall into the same category?
I tried around with groupby but I do not manage to groupby over Categories.
THX
IIUC:
Data setup:
a = pd.Series(data=[1,2,3],index=[200,300,400])
b = pd.Series(data=[4,5,6,7,8], index=[200,350,360,370,380])
Convert to dataframe and create category using pd.cut
df_a = a.to_frame()
df_a['cat'] = pd.cut(df_a.index,bins=[0,100,200,300,400,500,600], labels=['0-99','100-199','200-299','300-399','400-499','500-599'])
df_b = b.to_frame()
df_b['cat'] = pd.cut(df_b.index,bins=[0,100,200,300,400,500,600], labels=['0-99','100-199','200-299','300-399','400-499','500-599'])
Do groupby on cat and use pd.concat
group_b = df_b.groupby('cat')[0].apply(list)
group_b = group_b.where(group_b.str.len())
group_a = df_a.groupby('cat')[0].apply(list)
group_a = group_a.where(group_a.str.len())
pd.concat([group_a,group_b],axis=1,keys=['A','B'])
Output:
A B
cat
0-100 NaN NaN
101-200 [1] [4]
201-300 [2] NaN
301-400 [3] [5, 6, 7, 8]
401-500 NaN NaN

Categories