Python Dataframe: Dropping duplicates base on certain conditions - python

Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice:
I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area.
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
Using pandas drop_duplicates drop the row duplicates but the condition is base on the first/ last occurring Shop ID which does not allow me to sort by distance:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
I also tried to group by Shop ID then sort, but sort returns error: Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
So far i tried doing up till this stage:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
I think i'm doing it the long way and still haven't figure out dropping the duplicates, anyone knows how to solve this in a shorter way? I'm new to python, thanks for helping out!

Try to first sort the dataframe based on distance, then drop the duplicate shops.
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
Or just as one chained expression (per comment from #chmielcode).
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)

You can use idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7

Related

Pandas filtering based on minimum data occurrences across multiple columns

I have a dataframe like this
country data_fingerprint organization
US 111 Tesco
UK 222 IBM
US 111 Yahoo
PY 333 Tesco
US 111 Boeing
CN 333 TCS
NE 458 Yahoo
UK 678 Tesco
I want those data_fingerprint for where the organisation and country with top 2 counts exists
So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .
So based on that the output of data_fingerprint should be having
data_fingerprint
111
678
What i have tried for organization to exist in my complete dataframe is this
# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
But i am not getting any data here.Once i get data for this I can do it along with country
Can someone please help to get me the output.I have less data so using Pandas
here is one way to do it
df[
df['organization'].isin(df['organization'].value_counts().head(2).index) &
df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()
array([111, 678], dtype=int64)
Annotated code
# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]
# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)
# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()
Result
data_fingerprint
0 111
7 678
You can do
# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
Output - Only Tesco and Yahoo left
df[new]
country data_fingerprint organization
0 US 111 Tesco
2 US 111 Yahoo
3 PY 333 Tesco
6 NE 458 Yahoo
7 UK 678 Tesco
You can do the same for country

Sum one column based on matching values in second column and put total into correct record in another column

I would like to ask for some guidance with a project. The below dataframe scripts are the before and after for your review.
What is listed below is a hotel list of guests, Rates, & Status.
The ID column is for tracking each person staying with the hotel. The Rate column list the rate per person.
The Status column is "P" for primary and "S" for shared.
So the overview is the guests with matching ID numbers are staying in a room together, both rates should be summed together, but listed under the "P"(primary) guest record in the Total column. In the Total column you should have for the "P"'s a total of the two guests staying together and the "S"(shared) should be zero.
I have tried the pandas groupby & sum but this snippet is removing some of the matching ID's records. The Sum is working for my totals, but I still need to figure out how to put the total under the primary guest record. I am still reviewing stackoverflow for likeness solutions that could help.
df = df.groupby(["ID"]).Rate.sum().reset_index()
import pandas as pd
print('Before')
df=pd.DataFrame({'ID':[1182,2554,1182,2658,5489,2658],
'Fname':['Tom','Harry','Trisha','Ben','Susan','Brenda'],
'Rate':[125.00,89.00,135.00,25.00,145.00,19.00],
'Status':['P','S','P','P','P','S'],
'Total':[0,0,0,0,0,0,]})
df1=pd.DataFrame({'ID':[1182,1182,2658,2658,2554,5489,],
'Fname':['Tom','Trisha','Ben','Brenda','Harry','Susan'],
'Rate':[125.00,135.00,245.00,19.00,89.00,25.00],
'Status':['P','S','P','S','P','P'],
'Total':[260.00,0,264,0,89.00,25.00]})
print(df)
print()
print('After')
print(df1)
Assuming you have a unique P per group, you can use a GroupBy.transform with a mask:
df['Total'] = df.groupby('ID')['Rate'].transform('sum').where(df['Status'].eq('P'), 0)
output (using the data from df1):
ID Fname Rate Status Total
0 1182 Tom 125.0 P 260.0
1 1182 Trisha 135.0 S 0.0
2 2658 Ben 245.0 P 264.0
3 2658 Brenda 19.0 S 0.0
4 2554 Harry 89.0 P 89.0
5 5489 Susan 25.0 P 25.0

Changing value to be the maximum value per group

I have this kind of structure:
country product installs purchases
US T 100 100
US A 5 5
AU T 500 500
AU A 20 20
I am trying to get:
country product installs purchases
US T 100 100
US A 100 5
AU T 500 500
AU A 500 20
Each value in the installs columns needs to be the value of installs where product column's value is T.
I tried:
exp.groupby(['country','product'])['date_install_'] = max(exp.groupby(['country','product'])['date_install_'])
Which does not work and I am kind of lost. How can I achieve the result?
Find the rows where the product is T, groupby the country, and get the maxiumum of the installs. Use this as a map to replace the values in installs:
df['installs'] = df['country'].map(df[df['product'] == 'T'].groupby('country')['installs'].max())
Result:
country product installs purchases
0 US T 100 100
1 US A 100 5
2 AU T 500 500
3 AU A 500 20
For clarity, this is what is being passed to map:
>>> df[df['product'] == 'T'].groupby('country')['installs'].max()
country
AU 500
US 100
Name: installs, dtype: int64
So you can use it like a dict with the index (country) as a key and the installs as a value.
If the T values is always the max value, you can use an auxiliary df that holds the max value of installs per country and then merge that with the original df and replace the max value for the install value:
aux = df.groupby('country').installs.max().reset_index
df.drop('installs', axis=1).merge(aux, how='left', on='country')
You reset the index so that you can use country as a column in the first line.
You drop installs before you merge because the aux df already has the value and name of the installs you want.

How to combine two dataframes and average like values?

I'm pretty new to machine learning. I have two dataframes that have movie ratings in them. Some of the movie ratings have the same movie title, but different number ratings while other rows have movie titles that the other data frame doesn't have. I was wondering how I would be able to combine the two dataframes and average any ratings that have the same movie name. Thanks for the help!
You can use pd.concat with GroupBy.agg
# df = pd.DataFrame({'Movie':['IR', 'R'], 'rating':[95, 90], 'director':['SB', 'RC']})
# df1 = pd.DataFrame({'Movie':['IR', 'BH'], 'rating':[93, 88], 'direction':['SB', 'RC']})
(pd.concat([df, df1]).groupby('Movie', as_index=False).
agg({'rating':'mean', 'director':'first'}))
Movie rating director
0 BH 88 RC
1 IR 94 SB
2 R 90 RC
Or df.append
df.append(df1).groupby('Movie',as_index=False).agg({'rating':'mean', 'director':'first'})
Movie rating director
0 BH 88 RC
1 IR 94 SB
2 R 90 RC
If you want Movie column as index, as_index parameter of df.groupby defaults to True, Movie column would be index, remove as_index=False from groupby
If you want to maintain the order then set sort parameter to Truein groupby.
(df.append(df1).groupby('Movie',as_index=False, sort=False).
agg({'rating':'mean', 'director':'first'}))
Movie rating director
0 IR 94 SB
1 R 90 RC
2 BH 88 RC

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

Categories