Groupby Duplicated in python - python

I have a dataset of orderID and ProductID.
Order_ID, Item_ID
101,121
101,121
101,223
101,234
I want to check that which Item_ID came more than once in any particular Order.
output>
Order_ID, Item_ID, freq
101,121,2
Which will be the most efficient way to do this in python?

Use groupby with size or value_counts first and then filter by query or boolean indexing - faster in larger DataFrame:
df1 = df.groupby(['Order_ID','Item_ID']).size().reset_index(name='freq').query('freq > 1')
Alternative:
df1=df.groupby('Order_ID')['Item_ID'].value_counts().reset_index(name='freq').query('freq>1')
Or:
df1 = df.groupby(['Order_ID','Item_ID']).size().reset_index(name='freq')
df1 = df1[df1['freq'] > 1]
print (df1)
Order_ID Item_ID freq
0 101 121 2

Related

filter a pandas df with multiple check of different column equal

,unique_system_identifier,call_sign,date1,date2,date3,date4
0,3929436,WQZL268,14-06-2023,,14-06-2023,
1,3929436,WQZL268,,,,
2,3929437,WQZL269,14-06-2023,,14-06-2023,
3,3929437,WQZL269,,,,
4,3929438,WQZL270,14-06-2023,,14-06-2023,
5,3929438,WQZL270,,,,
6,3929439,WQZL271,14-06-2023,,14-06-2023,
7,3929439,WQZL271,,,,
8,3929440,WQZL272,14-06-2023,,14-06-2023,
9,3929440,WQZL272,,,,
10,3929441,WQZL273,14-06-2023,,14-06-2023,
11,3929441,WQZL273,,,,
12,3929442,WQZL274,14-06-2023,,14-06-2023,
13,3929442,WQZL274,,,,
14,3929443,WQZL275,14-06-2023,,14-06-2023,
I have a df like above need to take only the values which are date1 & date3 are have different or date2 or date4 have different if both different also need
how to do with pandas,
the columns are coming as pandas objectnote as datetime/string
You can replace missing values first and then compare for not eqaul, chain both mask by | for bitwise OR:
df1 = df.fillna('')
df = df[df1['date1'].ne(df1['date3']) | df1['date2'].ne(df1['date4'])]
print (df)
Empty DataFrame
Columns: [Unnamed: 0, unique_system_identifier, call_sign, date1, date2, date3, date4]
Index: []

Filter and change data frame sum() and remove rows based on conditions

From data frame df need to receive data frame result, we need to iterate over df.id if there are more rows with same id and they are negative then we can sum() the quantity one by one, if the result is 0 or negative we remove this row from the data frame if the result is positive we keep it.
df = pd.DataFrame({'id': ['1a','1a','b5','b5','1a','1a'],'date':['11-01-22', '12-01-22', '13-01-22', '21-01-22', '11-01-22', '18-01-22'],'quantity':[2,5,3,-1,-2,2]})
result = pd.DataFrame({'id': ['1a','b5','1a'],'date':['12-01-22', '13-01-22','18-01-22'],'quantity':[5,2,2]})
IIUC, you want to aggregate per id/date and keep the positive quantities?
(df
.groupby(['id', 'date'], as_index=False, sort=False).sum()
.query('quantity > 0')
)
output:
id date quantity
1 1a 12-01-22 5
2 b5 13-01-22 3
4 1a 18-01-22 2

How can I get the index values in DF1 to where DF1's column values match DF2's custom multiindex values?

I have two data frames: DF1 and DF2.
DF2 is essentially a randomly generated subset of rows in DF1.
I want to get the (integer) indexes of DF1 of the rows where there is a complete match of all column values in DF1.
I'm trying to do this with a multi-index:
So if I have the following:
DF1:
Index Name Age Gender Label
0 Kate 24 F 1
1 Bill 23 M 0
2 Bob 22 M 0
3 Billy 21 M 0
DF2:
MultiIndex Name Age Gender Label
(Bob,22,M) Bob 22 M 0
(Billy,21,M) Billy 21 M 0
Desired Output: [2,3]
How can I use that MultiIndex in DF2 to check DF1 for those matches?
I found this while searching but I think this requires you to specify what value you want beforehand? I can't find this exact use case.
df2.loc[(df2.index.get_level_values("Name" =='xxx') &
(df2.index.get_level_values('Age') == x &
(df2.index.get_level_values('Gender') == x)]
Please let me know the best way.
Thanks!
Edit (Code to generate df1):
Pseudocode: Merge two dataframes to get a total of 10 columns and
drop everything except 4 columns
Edit (Code to generate df2):
if amount_needed - len(lowest_value_keys) > 0:
extra_samples = df1[df1.Label==0].sample(n=amount_needed -len(lowest_value_keys) ,replace=False)
lowest_value_df = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = pd.concat([lowest_value_df, extra_samples])
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
else:
all_samples = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = all_samples.sample(n=amount_needed,replace=False)
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
Not sure if this answers your query, but if we first reset the index of df1 to get that as another column 'Index', and then set_index on Name, Age , Gender to find the matches on df2 and just take the resulting Index column would that work ?
So that would be:
df1.reset_index().set_index(['Name','Age','Gender']).loc[df2.set_index(['Name','Age','Gender']).index]['Index'].values

Groupby id and create column boolean column

I have a dataframe of transactions:
id | type | date
453| online | 08-12-19
453| instore| 08-12-19
453| return | 10-5-19
There are 4 possible types: online, instore, return, other. I want to create boolean columns where I see if for each unique customer if they ever had a given transaction type.
I tried the following code but it was not giving me what I wanted.
transactions.groupby('id')['type'].transform(lambda x: x == 'online') == 'online'
Use get_dummies with aggregate max for indicaro columns per groups and last add DataFrame.reindex for custom order and add possible misisng types filled by 0:
t = ['online', 'instore', 'return', 'other']
df = pd.get_dummies(df['type']).groupby(df['id']).max().reindex(t, axis=1, fill_value=0)
print (df)
online instore return other
id
453 1 1 1 0
Another idea with join per groups and Series.str.get_dummies:
t = ['online', 'instore', 'return', 'other']
df.groupby('id')['type'].agg('|'.join).str.get_dummies().reindex(t, axis=1, fill_value=0)

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Categories