Multiple aggregated Counting in Pandas - python

I have a DF:
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
data_frame
OP:
Name ID Manager_name Manager_ID
0 John 144 Smith 200
1 Mia 220 John 144
2 Caleb 155 Smith 200
3 Smith 200 Jason 500
I am trying to count the number of people reporting under each person in the column Name.
Logic is:
Count the number of people reporting individually and people reporting under in the chain. For example with Smith; John and Caleb reports to Smith so 2 + 1 with Mia reporting to John (who already reports to Smith) so total 3.
Similarly for Jason -> 1 because Smith reports to him and 3 people already report to Smith so total 4.
I understand how to do it pythonically with some recursion, is there a way to efficiently do it in Pandas. Any suggestions?
Expected OP:
Name Number of people reporting
John 1
Mia 0
Caleb 0
Smith 3
Jason 4

Scott Boston's Networkx solution is the preferred solution...
There are two solutions to this problem. The first one is a vectorized pandas type solution and should be fast over larger datasets, the second is pythonic and does not work well on the size of dataset the OP was looking for, the original df size is (223635,4).
PANDAS SOLUTION
This problem seeks to find out how many people each person in an organization manages, including subordinate's subordinates. This solution will create a dataframe by adding successive columns that are the managers of the previous columns, and then counting the occurance of each employee in that dataframe to determine the total number under them.
First we set up the input.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
df = df[["SID", "Manager_SID"]]
# shortening the columns for convenience
df.columns = ["1", "2"]
print(df)
1 2
0 144 200
1 220 144
2 155 200
3 200 500
First the employees without subordinates must be counted and put into a seperate dictionary.
df_not_mngr = df.loc[~df['1'].isin(df['2']), '1']
non_mngr_dict = {str(key):0 for key in df_not_mngr.values}
non_mngr_dict
{'220': 0, '155': 0}
Next we will modify the dataframe by adding columns of managers of the previous column. The loop is stopped when there are no employees in the right most column
for i in range(2, 10):
df = df.merge(
df[["1", "2"]], how="left", left_on=str(i), right_on="1", suffixes=("_l", "_r")
).drop("1_r", axis=1)
df.columns = [str(x) for x in range(1, i + 2)]
if df.iloc[:, -1].isnull().all():
break
else:
continue
print(df)
1 2 3 4 5
0 144 200 500 NaN NaN
1 220 144 200 500 NaN
2 155 200 500 NaN NaN
3 200 500 NaN NaN NaN
All columns except the first columns are collapsed and each employee counted and added to a dictionary.
from collections import Counter
result = dict(Counter(df.iloc[:, 1:].values.flatten()))
The non manager dictionary is added to the result.
result.update(non_mngr_dict)
result
{'200': 3, '500': 4, nan: 8, '144': 1, '220': 0, '155': 0}
RECURSIVE PYTHONIC SOLUTION
I think this is probably way more pythonic than you were looking for. First I created a list 'all_sids' to make sure we capture all employees as not all are in each list.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
all_sids = pd.unique(df[['SID', 'Manager_SID']].values.ravel('K'))
Then create a pivot table.
dfp = df.pivot_table(values='Name', index='SID', columns='Manager_SID', aggfunc='count')
dfp
Manager_SID 144 200 500
SID
144 NaN 1.0 NaN
155 NaN 1.0 NaN
200 NaN NaN 1.0
220 1.0 NaN NaN
Then a function that will go through the pivot table to total up all the reports.
def count_mngrs(SID, count=0):
if str(SID) not in dfp.columns:
return count
else:
count += dfp[str(SID)].sum()
sid_list = dfp[dfp[str(SID)].notnull()].index
for sid in sid_list:
count = count_mngrs(sid, count)
return count
Call the function for each employee and print the results.
print('SID', ' Number of People Reporting')
for sid in all_sids:
print(sid, " " , int(count_mngrs(sid)))
Results are below, sorry I was a bit lazy in putting the names with the sids.
SID Number of People Reporting
144 1
220 0
155 0
200 3
500 4
Look forward to seeing a more pandas type solution!

This is also, a graph problem and you can use Networkx:
import networkx as nx
import pandas as pd
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
#create a directed graph object using nx.DiGraph
G = nx.from_pandas_edgelist(data_frame,
source='Name',
target='Manager_name',
create_using=nx.DiGraph())
#use nx.ancestors to get set of "ancenstor" nodes for each node in the directed graph
pd.DataFrame.from_dict({i:len(nx.ancestors(G,i)) for i in G.nodes()},
orient='index',
columns=['Num of People reporting'])
Output:
Num of People reporting
John 1
Smith 3
Mia 0
Caleb 0
Jason 4
Draw newtorkx:

Related

Incorrect output with np.random.choice

I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!

Python Dataframe: pivot rows as columns

I have raw files from different stations. When I combine them into a dataframe, I see three columns with matching id and name with different component. I want to convert this into a dataframe where name entries become the column names
Code:
df =
id name component
0 1 Serial Number 103
1 2 Station Name DC
2 1 Serial Number 114
3 2 Station Name CA
4 1 Serial Number 147
5 2 Station Name FL
Expected answer:
new_df =
Station Name Serial Number
0 DC 103
1 CA 114
2 FL 147
My answer:
# Solution1
df.pivot_table('id','name','component')
name
NaN NaN NaN NaN
# Solution2
df.pivot(index=None,columns='name')['component']
name
NaN NaN NaN NaN
I am not getting desired answer. Any help?
First you have to make every 2 rows with the same id, after that you can use pivot table.
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "1", "2", "1", "2"],
'name': ["Serial Number", "Station Name", "Serial Number", "Station Name", "Serial Number", "Station Name"],
'component': ["103", "DC", "114", "CA", "147", "FL"]})
new_column = [x//2+1 for x in range(len(df))]
df["id"] = new_column
df = df.pivot(index='id',columns='name')['component']
If your Serial Number is just before Station Name, you can pivot on name columns then combine the every two rows:
df_ = df.pivot(columns='name', values='component').groupby(df.index // 2).first()
print(df_)
name Serial Number Station Name
0 103 DC
1 114 CA
2 147 FL

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Fuzzy Lookup In Python

I have two CSV files. One that contains Vendor data and one that contains Employee data. Similar to what "Fuzzy Lookup" in excel does, I'm looking to do two types of matches and output all columns from both csv files, including a new column as the similarity ratio for each row. In excel, I would use a 0.80 threshold. The below is sample data and my actual data has 2 million rows in one of the files which is going to be a nightmare if done in excel.
Output 1:
From Vendor file, fuzzy match "Vendor Name" with "Employee Name" from Employee file. Display all columns from both files and a new column for Similarity Ratio
Output 2:
From Vendor file, fuzzy match "SSN" with "SSN" from Employee file. Display all columns from both files and a new column for Similarity Ratio
These are two separate outputs
Dataframe 1: Vendor Data
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
150
9675
GREEN
7412
70
One Time
774801971
200
15789
SMITH, JOHN
80
40
Employee
965214872
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
Dataframe 2: Employee Data
Employee Name
Employee ID
Manager
SSN
BROWN, CLIFFORD
1
Manager 1
668-419-628
BLUE, CITY
2
Manager 2
874126487
SMITH, JOHN
3
Manager 3
965-21-4872
HAROON, SIMON
4
Manager 4
741-98-7820
Expected output 1 - Match Name
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD BROWN
854
500
Misc
668419628
1.00
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
1.00
HAROON, SIMON
4
Manager 4
741-98-7820
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
0.96
BLUE, CITY
2
Manager 2
874126487
0.00
Expected output 2 - Match SSN
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD, BROWN
854
500
Misc
668419628
0.97
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
0.97
BLUE, CITY
2
Manager 2
874126487
0.00
HAROON, SIMON
4
Manager 4
741-98-7820
0.00
I've tried the below code:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')
matched_names = []
for row1 in df1.index:
name1 = df1._get_value(row1, 'Vendor Name')
for row2 in df2.index:
name2 = df2._get_value(row2, 'Full Name')
match = fuzz.ratio(name1, name2)
if match > 80: # This is the threshold
match.append([name1, name2, match])
df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv', encoding='utf-8')
I'm just not getting the results I want and am ready to reinvent the whole script. Any suggestions would help to improve my script. Please note, I'm fairly new to Python so be gentle. I am totally open to a new approach on this example.
September 23 Update:
Still having trouble...I'm able to get the similarity ratio now but not getting all the columns from both CSV files. The issue is that both files are completely different so when I concat, it gives NaN values. Any suggestions? New code below:
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']
df1_name = df1['full_name']
df2_name = df2['full_name']
frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)
To concatenate the two DataFrames horizontally, I aligned the Employees DataFrame by the index of the matched Vendor Name. If no Vendor Name was matched, I just put an empty row instead.
In more details:
I iterated over the vendor names, and for each vendor name, I added the index of the employee name with the highest score to a list of indices. Note that I added at most one matched employee record to each vendor name.
If no match was found (too low score), I added the index of an empty record that I have added manually to the Employees Dataframe.
This list of indices is then used to reorder the Employees DataDrame.
at last, I just merge the two DataFrame horizontally. Note that the two DataFrames at this point doesn't have to be of the same size, but in such a case, the concat method just fill the gap with appending missing rows to the smaller DataFrame.
The code is as follows:
import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)
For the matching according to the SSN columns, it can be done exactly in the same way, by just replacing the column names in the above code. Moreover, The process can be generalize to be a function that accepts DataFrames and column names:
def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)
if __name__ == "__main__":
vendors_df = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
employees_df = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)
merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)
the above code is resulted with the following outputs:
merged_by_names.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
1
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.95
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.92
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0
merged_by_ssn.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.91
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.9
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
0.9
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0

pandas group by conditional frequency for each user wise

i have a dataframe like this
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
and i want to get country and product combination count by user wise like below
first split the countries then combine with product and take the count.
wanted output:
Here is one way combining other answers on SO (which just shows the power of searching :D)
import pandas as pd
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
# Making use of: https://stackoverflow.com/a/37592047/7386332
j = (df.Country.str.split(',', expand=True).stack()
.reset_index(drop=True, level=1)
.rename('Country'))
df = df.drop('Country', axis=1).join(j)
# Reformat to get desired Country_Product
df = (df.drop(['Country','Product'], 1)
.assign(Country_Product=['_'.join(i) for i in zip(df['Country'], df['Product'])]))
df2 = df.groupby(['User','Country_Product'])['User'].count().rename('Count').reset_index()
print(df2)
Returns:
User Country_Product count
0 101 Brazil_x 1
1 101 India_x 2
2 102 Brazil_x 1
3 102 Brazil_z 2
4 102 India_x 1
5 102 India_z 1
6 102 Japan_x 1
How about get_dummies
df.set_index(['User','Product']).Country.str.get_dummies(sep=',').replace(0,np.nan).stack().sum(level=[0,1,2])
Out[658]:
User Product
101 x Brazil 1.0
India 2.0
102 x Brazil 1.0
India 1.0
Japan 1.0
z Brazil 2.0
India 1.0
dtype: float64

Categories