Finding the most frequent combination in DataFrame - python

I have a DataFrame with two columns From and To, and I need to know the most frequent combination of locations From and To.
Example:
From To
------------------
Home Office
Home Office
Home Office
Airport Home
Restaurant Office

if the order does matter:
df['FROM_TO'] = df['FROM'] + df['TO']
df['COUNT'] = 1
df.groupby(['FROM_TO'])['COUNT'].sum()
gives you all the occurrences in one go. Simply take the max to find the largest occurrence.
If the order does matter first sort the values before:
df.loc[:,:] = np.sort(df.values,axis=1) # if the df only consists of the FROM adn TO columns.

You can group by the two columns together and count the number of occurrences of each pair, then sort the pairs by this count.
The following code does the job:
df.groupby(["From", "To"]).size().sort_values(ascending=False)
and, for the example of the question, it returns:
From To
-----------------------
Home Office 3
Restaurant Office 1
Airport Home 1

IIUC, SeriesGroupBy.value_counts and Series.idxmax
df.groupby('From')['To'].value_counts().idxmax()
Output
('Home', 'Office')
in general groupby.value_counts is faster than groupby.size
Another way:
df.apply(tuple, axis=1).value_counts().idxmax()
Or
df.apply(tuple, axis=1).mode()
Output
0 (Home, Office)
dtype: object

Related

how search on a specific rows of a dataframe to find a match in a second dataframe?

I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.
The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?
You have pretty much explained the most efficient method already.
Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
Extract unique values of 2 letter column from df2.
Inner join the result from #2 on to df1.
Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
Left join result of #4 onto df2.

Performing a calculation on a DataFrame based on condition in another DataFrame

I am working with COVID data and am trying to control for population and show incidence per 100,000.
I have one DataFrame with population:
**Country** **Population**
China 1389102
Israel 830982
Iran 868912
I have a second DataFrame showing COVID data:
**Date** **Country** **Confirmed**
01/Jan/2020 China 8
01/Jan/2020 Israel 3
01/Jan/2020 Iran 2
02/Jan/2020 China 15
02/Jan/2020 Israel 5
02/Jan/2020 Iran 5
I wish to perform a calculation on my COVID DataFrame using info from the population DataFrame. That is to say, to normalise cases per 100,000 for each data via:
(Chinese Datapoint/Chinese Population) * 100,000
Likewise for my other countries.
I am stumped on this one and not too sure do I achieve my result via grouping data, zipping data, etc.
Any help welcome.
Edit: I should have added that confirmed cases are cumulative as each day goes on. So for example, I wish to performed for Jan 1st for China: (8/china population)*100000 and like wise for Jan 2nd, Jan 3rd, Jan 4th... And again, likewise for each country. Essentially performing a calculation to the entire DataFrame based on data in another DataFrame.
You could merge 2 dataframes and perform the operation:
# Define the norm operation
def norm_cases(cases, population):
return (cases/population)*100
# If the column name for country is same in both dataframes
covid_df.merge(population_df, on='country_column', how='left')
# For different col names
covid_df.merge(population_df, left_on='covid_country_column', right_on='population_country_column', how='left')
covid_df['norm_cases'] = covid_df.apply(lambda x: norm_cases(x['cases_column'], x['population_column']), axis=1)
Assuming that your dataframes are called df1 and df2 and by "Datapoint" you mean the column **Confirmed**:
normed_cases = (
df2.reset_index().groupby(['**Country**', '**Date**']).sum()['**Confirmed**']
/ df1.set_index('**Country**')['**Population**'] * 100000)
reset the index of df2 to make the date a column (only applicable if **Date** was the index before)
Group by country and date and sum the groups to get the total cases per country and date
set country as index to the first df df1 to allow country-index oriented division
divide by population
I took an approach combining many of your suggestions. Step one, I merged my two dataframes. Step two, I divided my confirmed column by the population. Step three, I multiplied the same column by 100,000. There probably is a more elegant approach but this works.
covid_df = covid_df.merge(population_df, on='Country', how='left')
covid_df["Confirmed"] = covid_df["Confirmed"].divide(covid_df["Population"], axis="index")
covid_df["Confirmed"] = covid_df["Confirmed"] *100000
Suppose Dataframe with population as df_pop, and Covid data as df_data.
# Set index country of df_pop
df_pop = df_pop.set_index(['Country'])
# Norm value
norm = 100000
# Calculate norm cases
df_data['norm_cases'] = [((conf/df_pop.loc[country].Population )*norm
for (conf, country) in zip(df_data.Confirmed,df_data.Country) ]
You can use df1.set_index('Country').join(df2.set_index('Country')) here, then it will be easy for you to perform this actions.

Groupby and frequency count does not return the right value

I am using this code to group companies and to a frequency count. However, the returned result did not group the companies
freq = df.groupby(['company'])['recruitment'].size()
I got some result similar to this.
recruitment
company
Data Co 3
Data Co 8
Apple Co 3
Apple Co 6
I have two questions:
why this groupby did not group same companies?
When I put freq.columns. It only shows recruitment column, company dissapeared. Is there anyway to show two columns both company and recruitment?
If the company name look the 'same' then you have whitespace at front or end, I am adding the upper convert all to upper case as well.
freq = df.groupby(df['company'].str.strip().str.upper())['recruitment'].size()

Group values in one column which share the same value in another column (pandas: groupby.apply/multi-index)

I am working with the following dataframe This is only a portion of the df
I want to create a multi-index dataframe where the top level index is the title_no and the sub-index is all of the release_no values that share the same title_no.
I have tried using the groupby.apply method but this groups the release_no's with the same title_no into lists and eliminates the rest of the columns:
df = pd.DataFrame(df.groupby('title_no')['release_no'].unique()).reset_index()
This is the result
Ideally, I would like my dataframe to look something like this:
title_no release_no name country_id
199034
732644 Jurassic Park III ES
       891376 Jurassic Park III CA
732658 Jurassic Park III TH
199052
1119213 Myth of Fingerprints IT
925041 Myth of Fingerprints ES
448432 Myth of Fingerprints US
564033 Myth of Fingerprints FR
...
Is there a way to do this in pandas, where I could list out the rows under the same title_no and be able to index the rows on one level with title_no and then with release_no on a lower level?
You don't need a groupby for this; sorting will be sufficient:
df.sort_values(['title_no', 'release_no']).set_index(['title_no', 'release_no'])

How do I groupby a dataframe based on values that are common to multiple columns?

I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.

Categories