Get a count of combinations and its reverse from two columns - python

I'm trying to get a count of combinations from a pandas dataframe where it views reversed form of the combinations as the same. ie (A/B will be the same as B/A)
Similar to what this user is trying to do, but on python/pandas
How to get count of two-way combinations from two columns?
Thank you for helping!
I've explored crosstabs and grouping the data and it produces a count of the combinations, but it sees the reverse order as a unique combination.
Origin Destination
City 1 City 2
City 2 City 1
City 3 City 4
City 2 City 1
End result will look like
Route Count
City 1 - City 2 3
City 3 - City 4 1
note: order of the route does not matter. It could be City 2 - City 1, as long as it counts it as the same.

You can define a route using np.sort
import numpy as np
import pandas as pd
df['Route'] = [' - '.join(x) for x in np.sort(df.to_numpy(), axis=1)]
df.groupby('Route').size()
#Route
#City 1 - City 2 3
#City 3 - City 4 1
#dtype: int64
You can also construct a new sorted DataFrame, which could be useful:
df = pd.DataFrame(np.sort(df.to_numpy(), axis=1), index=df.index, columns=df.columns)
# Origin Destination
#0 City 1 City 2
#1 City 1 City 2
#2 City 3 City 4
#3 City 1 City 2
Now you can groupby ['Origin', 'Destintion']

Check with sort
df.values.sort()
df.groupby(list(df)).size()
Origin Destination
City1 City2 3
City3 City4 1
dtype: int64

Related

Python Merging data frames

In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?
From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)

Pandas Dataframe Comparison - specify mismatched columns

I have two dataframes as shown below, df1 and df2:
df1 =
emp_name emp_city counts
emp_id
2 two city2 3
4 fourxxx city4 1
5 five city5 1
df2 =
emp_name emp_city counts
emp_id
2 two city2 1
3 three city3 1
4 four city4 1
Note: 'emp_id' acts as index.
I want to find the difference between df1 and df2 and write the name of the columns which has mismatched values. The below code snippet will do that.
df3 = df2.copy()
df3['mismatch_col'] = df2.ne(df1, axis=1).dot(df2.columns)
Results in df3:
df3 =
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_nameemp_citycounts
4 four city4 1 emp_name
Now the problem I have is with respect to 'mismatch_col'. It is giving me the names of columns where there is a mismatch in df1 and df2. But, the column names are NOT separated. I want to separate the column names by commas. Expected output should look like below:
Expected_df3 =
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_name,emp_city,counts
4 four city4 1 emp_name
Can someone please help me on this?
You can use df2.columns + ',' to add commas and then str[:-1] to remove the last one:
df3['mismatch_col'] = df2.ne(df1, axis=1).dot(df2.columns + ',').str[:-1]
Result:
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_name,emp_city,counts
4 four city4 1 emp_name

Count the number of zeros with Pandas

I have the following dataframe:
import pandas as pd
data = {
'State': ['EU','EU','EU','US','US','US','UK','UK','UK'],
'Var': [4,1,6,2,1,6,2,0,1],
'Mark': [0,1,1,1,0,0,0,0,0]
}
Table = pd.DataFrame(data)
Table
I want to count the number of zeros in the "Mark" column for each country. the output should be a table like this:
State #ofZeros
EU 1
US 2
UK 3
I managed to count the number of "1s" in the "Mark" column for each country with groupby:
Table.groupby('State')['Mark'].sum()
and it would be great to know if it is possible to count also the zeros (or anyother value) with groupby.
Group the dataframe by States then call sum on boolean masking for Mark==0.
>>> Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum())
State
EU 1
US 2
UK 3
Name: Mark, dtype: int64
You can also call to_frame to convert it to a dataframe, then reset the index if needed:
Table.groupby('State', sort=False)['Mark'].agg(lambda x:x.eq(0).sum()).to_frame('#of Zeros').reset_index()
State #of Zeros
0 EU 1
1 US 2
2 UK 3
On a side note, as you have mentioned in the question: I managed to count the number of "1s" in the "Mark" column for each country with groupby: Table.groupby('State')['Mark'].sum(),
No, you are not actually counting the number of 1s, you are just getting the sum of the values in Mark column for each groups. For the sample data you have, Mark column has only 0 and 1 values, that is why sum and count of 1s, both are equal. If it had some other values as well in addition to 0, and 1, the sum of the values would be different than the count of 1s.
You can actually check the occurrences of value 0 in the column "Mark" using the code below.
Table[['State', 'Mark']][Table.Mark == 0].value_counts()
Table[['State', 'Mark']] narrows the columns that are required to be shown.
The output should be
State Mark
UK 0 3
US 0 2
EU 0 1
dtype: int64
You could use value_counts when you filter the dataframe on state. Then you just lookup the count of value 0.
states = set(Table.State)
count_data = [[state, Table[Table.State == state].Mark.value_counts()[0]] for state in states]
df = pd.DataFrame(count_data, columns=['State', 'zeros'])
df
print(df)
>>
State zeros
0 US 2
1 UK 3
2 EU 1

Column value from first df to another df based on condition

I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')

How to group and get three most frequent value?

i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN
First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN
Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN
A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None
Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN

Categories