How to group and get three most frequent value? - python

i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN

First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN

Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN

A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None

Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN

Related

Check values of multiple categorical columns at a same time

I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()

Column value from first df to another df based on condition

I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')

Get a count of combinations and its reverse from two columns

I'm trying to get a count of combinations from a pandas dataframe where it views reversed form of the combinations as the same. ie (A/B will be the same as B/A)
Similar to what this user is trying to do, but on python/pandas
How to get count of two-way combinations from two columns?
Thank you for helping!
I've explored crosstabs and grouping the data and it produces a count of the combinations, but it sees the reverse order as a unique combination.
Origin Destination
City 1 City 2
City 2 City 1
City 3 City 4
City 2 City 1
End result will look like
Route Count
City 1 - City 2 3
City 3 - City 4 1
note: order of the route does not matter. It could be City 2 - City 1, as long as it counts it as the same.
You can define a route using np.sort
import numpy as np
import pandas as pd
df['Route'] = [' - '.join(x) for x in np.sort(df.to_numpy(), axis=1)]
df.groupby('Route').size()
#Route
#City 1 - City 2 3
#City 3 - City 4 1
#dtype: int64
You can also construct a new sorted DataFrame, which could be useful:
df = pd.DataFrame(np.sort(df.to_numpy(), axis=1), index=df.index, columns=df.columns)
# Origin Destination
#0 City 1 City 2
#1 City 1 City 2
#2 City 3 City 4
#3 City 1 City 2
Now you can groupby ['Origin', 'Destintion']
Check with sort
df.values.sort()
df.groupby(list(df)).size()
Origin Destination
City1 City2 3
City3 City4 1
dtype: int64

isin pandas dataframe from 2 other dataframe

i have a pandas dataframe.
df = pd.DataFrame({'countries':['US','UK','Germany','China','India','Pakistan','lanka'],
'id':['a','b','c','d','e','f','g']})
also i have two more dataframes. df2 and df3.
df2 = pd.DataFrame({'countries':['Germany','China'],
'capital':['c','d']})
df3 = pd.DataFrame({'countries':['lanka','USA'],
'capital':['g','a']})
i want to find the rows in df where df is in df2 and df3
i had this code:
df[df.id.isin(df2.capital)]
but it will find the rows which is in df2.
is there any way i can do it for both df2 and df3 in a single code.
i'e rows from df where df is in df2 and df3
I think you need simply sum both list together:
print (df[df.id.isin(df2.capital.tolist() + df3.capital.tolist())])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Another solution is use numpy.setxor1d - set exclusive-or of two arrays:
print (df[df.id.isin(np.setxor1d(df2.capital, df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Or solution with comment with or - |:
print (df[(df.id.isin(df2.capital)) | (df.id.isin(df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g

Python Pandas Dataframe pull column value/index down by one

I am using a pandas DataFrame and I would like to pull one column value/index down by one. So the list Dataframe Length will be one less. Just like this in my example image:
The new DataFrame should be id 2-5, but of course re-index after the manipulation to 1-4. There are more than just name and place rows.
How can I quickly manipulate the DataFrame like this?
Thank you very much.
You can shift the name column and then take a slice using iloc:
In [55]:
df = pd.DataFrame({'id':np.arange(1,6), 'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']})
df
Out[55]:
id name place
0 1 john new york
1 2 bla miami
2 3 tim paris
3 4 walter rome
4 5 john sydney
In [56]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[56]:
id name place
0 1 bla new york
1 2 tim miami
2 3 walter paris
3 4 john rome
If your 'id' column is your index the above still works:
In [62]:
df = pd.DataFrame({'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']},index=np.arange(1,6))
df.index.name = 'id'
df
Out[62]:
name place
id
1 john new york
2 bla miami
3 tim paris
4 walter rome
5 john sydney
In [63]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[63]:
name place
id
1 bla new york
2 tim miami
3 walter paris
4 john rome

Categories