Drop duplicates keeping the row with the highest value in another column

Drop duplicates keeping the row with the highest value in another column - python

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])
Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.
Ex: In the case above the Answer would be:
Name Count
Mary 22
John 50
The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").
In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.
Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D

Either sort_values and drop_duplicates,
df1.sort_values('Count').drop_duplicates('Name', keep='last')
Name Count
1 Mary 22
2 John 50
Or, like miradulo said, groupby and max.
df1.groupby('Name')['Count'].max().reset_index()
Name Count
0 John 50
1 Mary 22

Related

Pandas Group By and Sorting by multiple columns

I have some initial data that looks like this:
code type value
1111 Golf Acceptable
1111 Golf Undesirable
1111 Basketball Acceptable
1111 Basketball Undesirable
1111 Basketball Undesirable
and I'm trying to group it on the code and type columns to get the row with the most occurrences. In the case of a tie, I want to select the row with the value Undesirable. So the example above would become this:
code type value
1111 Golf Undesirable
1111 Basketball Undesirable
Currently I'm doing it this way:
df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))
df = df.sort_values(['type', 'count'])
df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())
I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last() call, and hoping in the case of a tie that Undesirable was sorted last. Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable value?
Performance isn't too much of an issue as I'm only working with around 50k rows or so.

Case 1
If the value column only contains two values i.e. ['Acceptable', 'Undesirable'] then we can rely on the fact that Acceptable < Undesirable alphabetically. In this case you can use the following simplified solution.
Create an auxiliary column called count which contain the count of number of rows per code, type and value. Then sort the dataframe by count and value and drop the dupes per code and type keeping the last row.
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Case 2
If the value column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical type before sorting
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Result
code type value count
1 1111 Golf Undesirable 1
4 1111 Basketball Undesirable 2

Another possible solution, which is based on the following ideas:
Grouping the data by code and type.
If a group has more than one row (len(x) > 1) and its rows have the same count (x['count'] == x['count'].min()).all()), return the row with Undesirable.
Otherwise, return the row where the count is maximum (x.iloc[[x['count'].argmax()]]).
(df.groupby(['code', 'type', 'value'])['value'].size()
.reset_index(name='count').groupby(['code', 'type'])
.apply(lambda x: x.loc[x['value'] == 'Undesirable'] if
((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
x.iloc[[x['count'].argmax()]])
.reset_index(drop=True)
.drop('count', axis=1))
Output:
code type value
0 1111 Basketball Undesirable
1 1111 Golf Undesirable

extract the top values from one column based on another column

so basically I have this dataframe called df:
where the first column have list of user id and the genre that they played and the total number of them. how can I extract the top 10 genres with most streams while showing the total number of users who streamed them?
so what I thought of doing is to sort the column values like this:
df_genre.sort_values(by="total_streams",ascending=False)
and then get the top genre but I got this:
But this is not what i want how can i fix it?

I think this is what you are looking for:
Data:
ID,genre,plays
12345,pop,23
12345,pop,576
12345,dance,18
12345,world,45
12345,dance,23
12345,pop,456
Input:
df = df.groupby(['ID','genre'])['plays'].sum().reset_index()
df.sort_values(by=['plays'], ascending=False)
Output:
ID genre plays
1 12345 pop 1055
2 12345 world 45
0 12345 dance 41

Get info about other columns in same row from pandas search

I have a .csv file that looks like the following:
Country Number
United 19
Ireland 17
Afghan 20
My goal is to use python-pandas to find the row with the smallest number, and get the country name of that row.
I know I can use this to get the value of the smallest number.
min = df['Number'].min()
How can I get the country name at the smallest number?
I couldn't figure out how to put in the variable "min" in an expression.

I would use a combination of finding the min and a iloc
df = pd.DataFrame(data)
min_number = df['Column_2'].min()
iloc_number = df.loc[df['Column_2'] == min_number].index.values[0]
df['Column_1'].iloc[iloc_number]
The only downside to this is if you have multiple countries with the same minimal number, but if that is the case you would have to provide more specs to determine the desired country.

If you expect the minimal value to be unique, use idxmin:
df.loc[df['Number'].idxmin(), 'Country']
Output: Ireland
If you have multiple min, this will yield the first one.

How to append dataframe with selected columns having higher feature score

Hi I am new to python let me know if the question is not clear.
Here is my dataframe:
df = pd.DataFrame(df_test)
age bmi children charges
0 19 27.900 0 16884.92400
1 18 33.770 1 1725.55230
2 28 33.000 3 4449.46200
3 33 22.705 0 21984.47061
I am applying select 'k' best feature selection using chi squared test for this numerical data
X_clf = numeric_data.iloc[:,0:(col_len-1)]
y_clf = numeric_data.iloc[:,-1]
bestfeatures = SelectKBest(score_func=chi2, k=2)
fit = bestfeatures.fit(X_clf,y_clf)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_clf.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
This is my output:
Feature Score
0 age 6703.764216
1 bmi 1592.481991
2 children 1752.136519
I wish to now append my dataframe to contain only the features with 2 highest scores. However I wish to do so without hardcoding the column names while appending into my dataframe.
I have tried to store the column names into a list and append those with highest score but am getting a Value error. Is there any method/function i could try by storing the selected columns and then appending them based on they're scores?
Expected Output: Column 'bmi' is not there as it has lowest of 3 scores
age children charges
0 19 0 16884.92400
1 18 1 1725.55230
2 28 3 4449.46200
3 33 0 21984.47061

So first you want to find out which features have the largest values, then find the Featurename of the columns you do not want to see.
colToDrop = feature.iloc[~feature['Score'].nlargest(2)]['Feature'].values
Next we just filter the original df and remove those columns from the columns list
df[df.columns.drop(colToDrop)]

I believe you need to work on the dataframe featureScores to keep the first 2 features with the highest Score and then use this values as a list to filter the columns in the original dataframe. Something along the lines of:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + ['charges']
filtered_df = df[important_features]
The sort_values() is to make sure the features (in case there are more) are sorted from highest score to lowest score. We then are creating a list of the first 2 values of the column Feature (which has been sorted already) with .values.tolist()[:2]. Since you seem to also want to include the column charges in your output, we are appending it manually with +['charges'] to our list of important_features.
Finally, we're creating a filtered_df by selecing only the important_features columns from the original df.
Edit based on comments:
If you can guarantee charges will be the last column in your original df then you can simply do:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + [df.columns[-1]]
filtered_df = df[important_features]
I see you have previously defined your y column with y_clf = numeric_data.iloc[:,-1] you can then use y_clf.columns or [df.columns[-1]], either should work fine.

How to add the counts of the same number in a for loop and make a lists(Python)

B0e.png
I am still relatively new to python. I am trying to do something more complicated. How can I use a for loop or iteration so that I count the same names ranked 1 and add them but also place them into a list format and also place the counted names in a separate list. The reason for this is that I will create a plot, and that I can do but I am stuck on how to separate the total counts of the same name and the names already counted.

Using this as example DataFrame:
RANK NAME
0 1 EMILY
1 1 DANIEL
2 1 EMILY
3 1 ISABELLA
You can do this to get the counted names:
counted_names = name_file[name_file.RANK == 1]['NAME'].value_counts()
print(counted_names)
EMILY 2
DANIEL 1
ISABELLA 1

pandas.groupby()
To solve any form aggregation in Python, all you need is to crack the groupby Function
For your case if you want to sum over 'Count' for all the unique names and later plot it, use pd.groupby()
Make sure you convert it into a DataFrame first and then apply Groupby Magic
name_file = pd.DataFrame(name_file)
name_file.groupby('Name').agg({'Count':'sum'})
This gives you the aggregated sum of counts for eaxh unique name in your dataframe
To get the Count of each Name, use the size.reset_index() method below
pd.DataFrame(name_file).groupby('Name').size().reset_index()
This returns the frequency of occurence of each unique name in the name_file
Hope this helps! Cheers !

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop duplicates keeping the row with the highest value in another column - python

Either sort_values and drop_duplicates, df1.sort_values('Count').drop_duplicates('Name', keep='last') Name Count 1 Mary 22 2 John 50 Or, like miradulo said, groupby and max. df1.groupby('Name')['Count'].max().reset_index() Name Count 0 John 50 1 Mary 22

Related

Pandas Group By and Sorting by multiple columns

extract the top values from one column based on another column

Get info about other columns in same row from pandas search

How to append dataframe with selected columns having higher feature score

How to add the counts of the same number in a for loop and make a lists(Python)

Categories

Resources