How can I grab rows with max date from Pandas dataframe? - python

I have a Pandas dataframe that looks like this:
and I want to grab for each distinct ID, the row with the max date so that my final results looks something like this:
My date column is of data type 'object'. I have tried grouping and then trying to grab the max like the following:
idx = df.groupby(['ID','Item'])['date'].transform(max) == df_Trans['date']
df_new = df[idx]
However I am unable to get the desired result.

idxmax
Should work so long as index is unique or the maximal index isn't repeated.
df.loc[df.groupby('ID').date.idxmax()]
OP (edited)
Should work as long as maximal values are unique. Otherwise, you'll get all rows equal to the maximum.
df[df.groupby('ID')['date'].transform('max') == df['date']]
W-B go to solution
And also very good solution.
df.sort_values(['ID', 'date']).drop_duplicates('date', keep='last')

The last bit of code from piRSquared's answer is wrong.
We are trying to get distinct IDs, so the column used in drop_duplicates should be 'ID'. keep='last' would then retrieve the last (and max) date for each ID.
df.sort_values(['ID', 'date']).drop_duplicates('ID', keep='last')

My answer is a generalization of piRSquared's answer:
manykey indicates the keys from which the mapping is desired (many-to)
onekey indicates the keys to which the mapping is desired (-to-one)
sortkey is sortable key and it follows asc set to True (as python standard)
def get_last(df:pd.DataFrame,manykey:list[str],onekey:list[str],sortkey,asc=True):
return df.sort_values(sortkey,asc).drop_duplicates(subset=manykey, keep='last')[manykey+onekey]
In your case the answer should be:
get_last(df,["id"],["item"],"date")
Note that I am using the onekey explicitly because I want to drop the rest of the keys (if they are in the table) and create a mapping.

Related

Drop duplicate rows based on a column value

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.
The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

How to drop duplicated rows in data frame based on certain criteria?

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

Python Pandas sorting of the middle column when using [groupby]

I am using python pandas and would like to sort the output by the middle column of the below tables(i have shown the output I am getting and the desired output that i want to get)
I am using the groupby function within pandas to get the output however it is sorting by count column (see below output table), instead i want to sort by the YOB column (please see desired output table)
Also, how do i calculate the mean Year of birth for each country.
import pandas as pd
xlpath= "C:/Users/Username/documents/Datafile.xlsx"
df = pd.read_excel(eval('xlpath'))
y = df.groupby('COUNTRY').YOB.value_counts(ascending=False)
print(y)
Output:
Desired Output:
Looking forward to your feedback.
Thanks
With the assumption that you do not care about ordering of "Country" column (as you have not specified that in question), here is one way to achieve the count of per country, per year grouping, keeping years in ascending order:
df2 = df.groupby(["Country", "YOB"]).count()
df2 = df2.sort_values(["Country","YOB"], ascending=[True, True])
print(df2)
Or in one line:
print(df.groupby(["Country", "YOB"]).count().sort_values(["Country","YOB"], ascending=[True, True]))
One of the ways, you can try is sort the dataframe on YOB before you apply groupby.

dropping all the columns after a matching value

I have the below data frame
and i have a variable as ID = 1052107168068132864
How I can filter all the values to drop it after that column and can get the result like below. In a way i want to drop all the column after that Id including it as well.
and then update the value of ID as 1052121282324692992 as the current value.
i want to repeat this in a loop so that every time i get a new data frame the same operation will keep going and if that is the top value then nothing should happen.
Assuming IDs are unique, using iloc
df.iloc[:df[df.ID == '1052121282324692992'].index.item()]
Using idxmax
idx = (df['ID'] == ID).idxmax()
new_df = df.iloc[:idx, :]

Categories