How to drop duplicated rows in data frame based on certain criteria? - python

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.

You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')

There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.

Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

Related

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

Pandas DataFrame: info() function for one column only

I have a dataframe named df_train with 20 columns. Is there a pythonic way to just view info on only one column by selecting its name.
Basically I am trying to loop through the df and extract number of unique values and add missing values
print("\nUnique Values:")
for col in df_train.columns:
print(f'{col:<25}: {df_train[col].nunique()} unique values. \tMissing values: {} ')
If you want the total number of null values, this is the pythonic way to achieve it:
df_train[col].isnull().sum()
Yes there is a way to select individual columns from a dataframe.
df_train['your_column_name']
This will extract only the column with <your_column_name>.
PS: This is my first StackOverflow answer. Please be nice.

Multiindex Filterings of grouped data

I have a pandas dataframe where I have done a groupby. The groupby results look like this:
As you can see this dataframe has a multilevel index ('ga:dimension3','ga:data') and a single column ('ga:sessions').
I am looking to create a dataframe with the first level of the index ('ga:dimension3') and the first date for each first level index value :
I can't figure out how to do this.
Guidance appreciated.
Thanks in advance.
Inspired from #ggaurav suggestion for using first(), I think that the following should do the work (df is the data you provided, after the group):
result=df.reset_index(1).groupby('ga:dimension3').first()
You can directly use first. As you need data based on just 'ga:dimension3', so you need to groupby it (or level=0)
df.groupby(level=0).first()
Without groupby, you can get the level 0 index values and delete the duplicated ones and keeping the first one.
df[~df.index.get_level_values(0).duplicated(keep='first')]

Plot the grouped fields in the groupby pandas Function

I need to group by and apply a pandas df with the next rows
['CpuEff',
'my_remote_host',
'GLIDEIN_CMSSite',
'BytesRecvd',
'BytesSent',
'CMSPrimaryPrimaryDataset',
'CMSPrimaryDataTier',
'DESIRED_CMSDataset',
'DESIRED_CMSPileups',
'type_prefix',
'CMS_Jobtype',
'CMS_Type',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount',
'LastRemoteHost']
Then, I apply the group by and calculate the mean of each field and passing into a new df
grouped = df.groupby(['DESIRED_CMSDataset'])
df_mean=grouped.mean()
df_mean
And check the new df fields,
list(df_mean.columns)
['CpuEff',
'BytesRecvd',
'BytesSent',
'CommittedTime',
'CommittedSlotTime',
'CpusProvisioned',
'CpuTimeHr',
'JobRunCount']
The issue is, I want to plot a histogram showing 'DESIRED_CMSDataset' and the respective mean values of each row, but it does not allow me as long as in new dataframe this row disappear.
Is there any way to perform the same operation without losing the gropued row?
I think (i am on mobile rn) if you aggregate this way your group column becomes the index of the new df. Try running df = df.reset_index(). I think adding as_index=False during groupby also works. Will confirm and edit answer tomorrow. You could also plot df.index if you want to keep it that way

How to drop row at certain index in every group in GroupBy object?

I'm trying to drop a row at certain index in every group inside a GroupBy object.
The best I have been able to manage is:
import pandas as pd
x_train = x_train.groupby('ID')
x_train.apply(lambda x: x.drop([0], axis=0))
However, this doesn't work. I have spent a whole day on this to no solution, so have turned to stack.
Edit: A solution for any index value is needed as well
You can do it with cumcount
idx= x_train.groupby('ID').cumcount()
x_train = x_train[idx!=0]
The problem with using drop inside the groupby is the index numbers are still the same as before the groupby. So when using drop([0]), only the row that originally had 0 as index will be dropped. In the other groups, there will not be any row with index 0 as long as the index is unique.
If you want to use drop then what you can do is to first use reset_index inside the grouped data:
x_train.groupby('ID').apply(lambda x: x.reset_index().drop([0]))

Categories