Count the frequency that a value occurs in a dataframe column - python

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1

Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]

If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.

df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()

df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0

In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326

As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992

Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.

#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)

your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Related

I try to count the same numbers of a column in a DF with pandas [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1
Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]
If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.
df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()
df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0
In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.
Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326
As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df
If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.
You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992
Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}
I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.
#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!
The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]
n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)
your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

How can I group by and count similar elements in a pandas dataframe? [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1
Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]
If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.
df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()
df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0
In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.
Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326
As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df
If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.
You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992
Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}
I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.
#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!
The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]
n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)
your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Multiindex pandas groupby + aggregate, keep full index

I have a two-level hierarchically-indexed sequence of integers.
>> s
id1 id2
1 a 100
b 10
c 9
2 a 2000
3 a 5
b 10
c 15
d 20
...
I want to group by id1, and select the maximum value, but have the full index in the result. I have tried the following:
>> s.groupby(level=0).aggregate(np.max)
id1
1 100
2 2000
3 20
But result is indexed by id1 only. I want my output to look like this:
id1 id2
1 a 100
2 a 2000
3 d 20
A related, but more complicated, question was asked here:
Multiindexed Pandas groupby, ignore a level?
As it states, the answer is kind of a hack.
Does anyone know a better solution? If not, what about the special case where every value of id2 is unique?
One way to select full rows after a groupby is to use groupby/transform to build a boolean mask and then use the mask to select the full rows from s:
In [110]: s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
Out[110]:
id1 id2
1 a 100
2 a 2000
3 d 20
Name: s, dtype: int64
Another way, which is faster in some cases -- such as when there are a lot of groups -- is to merge the max values m into a DataFrame along with the values in s, and then select rows based on equality between m and s:
def using_merge(s):
m = s.groupby(level=0).agg(np.max)
df = s.reset_index(['id2'])
df['m'] = m
result = df.loc[df['s']==df['m']]
del result['m']
result = result.set_index(['id2'], append=True)
return result['s']
Here is an example showing using_merge, while more complicated, may be faster than using_transform:
import numpy as np
import pandas as pd
def using_transform(s):
return s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
N = 10**5
id1 = np.random.randint(100, size=N)
id2 = np.random.choice(list('abcd'), size=N)
index = pd.MultiIndex.from_arrays([id1, id2])
ss = pd.Series(np.random.randint(100, size=N), index=index)
ss.index.names = ['id1', 'id2']
ss.name = 's'
Timing these two functions using IPython's %timeit function yields:
In [121]: %timeit using_merge(ss)
100 loops, best of 3: 12.8 ms per loop
In [122]: %timeit using_transform(ss)
10 loops, best of 3: 45 ms per loop

pandas python how to count the number of records or rows in a dataframe

Obviously new to Pandas. How can i simply count the number of records in a dataframe.
I would have thought some thing as simple as this would do it and i can't seem to even find the answer in searches...probably because it is too simple.
cnt = df.count
print cnt
the above code actually just prints the whole df
To get the number of rows in a dataframe use:
df.shape[0]
(and df.shape[1] to get the number of columns).
As an alternative you can use
len(df)
or
len(df.index)
(and len(df.columns) for the columns)
shape is more versatile and more convenient than len(), especially for interactive work (just needs to be added at the end), but len is a bit faster (see also this answer).
To avoid: count() because it returns the number of non-NA/null observations over requested axis
len(df.index) is faster
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(8, 3),columns=['A', 'B', 'C'])
df['A'][5]=np.nan
df
# Out:
# A B C
# 0 0 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
# 4 12 13 14
# 5 NaN 16 17
# 6 18 19 20
# 7 21 22 23
%timeit df.shape[0]
# 100000 loops, best of 3: 4.22 µs per loop
%timeit len(df)
# 100000 loops, best of 3: 2.26 µs per loop
%timeit len(df.index)
# 1000000 loops, best of 3: 1.46 µs per loop
df.__len__ is just a call to len(df.index)
import inspect
print(inspect.getsource(pd.DataFrame.__len__))
# Out:
# def __len__(self):
# """Returns length of info axis, but here we use the index """
# return len(self.index)
Why you should not use count()
df.count()
# Out:
# A 7
# B 8
# C 8
Regards to your question... counting one Field? I decided to make it a question, but I hope it helps...
Say I have the following DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
You could count a single column by
df.A.count()
#or
df['A'].count()
both evaluate to 5.
The cool thing (or one of many w.r.t. pandas) is that if you have NA values, count takes that into consideration.
So if I did
df['A'][1::2] = np.NAN
df.count()
The result would be
A 3
B 5
Simply, row_num = df.shape[0] # gives number of rows, here's the example:
import pandas as pd
import numpy as np
In [322]: df = pd.DataFrame(np.random.randn(5,2), columns=["col_1", "col_2"])
In [323]: df
Out[323]:
col_1 col_2
0 -0.894268 1.309041
1 -0.120667 -0.241292
2 0.076168 -1.071099
3 1.387217 0.622877
4 -0.488452 0.317882
In [324]: df.shape
Out[324]: (5, 2)
In [325]: df.shape[0] ## Gives no. of rows/records
Out[325]: 5
In [326]: df.shape[1] ## Gives no. of columns
Out[326]: 2
The Nan example above misses one piece, which makes it less generic. To do this more "generically" use df['column_name'].value_counts()
This will give you the counts of each value in that column.
d=['A','A','A','B','C','C'," " ," "," "," "," ","-1"] # for simplicity
df=pd.DataFrame(d)
df.columns=["col1"]
df["col1"].value_counts()
5
A 3
C 2
-1 1
B 1
dtype: int64
"""len(df) give you 12, so we know the rest must be Nan's of some form, while also having a peek into other invalid entries, especially when you might want to ignore them like -1, 0 , "", also"""
Simple method to get the records count:
df.count()[0]
I used pandas library for this. Here is the code
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(len(data[required_colum_name]))
# this also works -> data["Post test Number"].count()

Remove duplicates by columns A, keeping the row with the highest value in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

Categories