pandas group by conditional frequency for each user wise - python

i have a dataframe like this
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
and i want to get country and product combination count by user wise like below
first split the countries then combine with product and take the count.
wanted output:

Here is one way combining other answers on SO (which just shows the power of searching :D)
import pandas as pd
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
# Making use of: https://stackoverflow.com/a/37592047/7386332
j = (df.Country.str.split(',', expand=True).stack()
.reset_index(drop=True, level=1)
.rename('Country'))
df = df.drop('Country', axis=1).join(j)
# Reformat to get desired Country_Product
df = (df.drop(['Country','Product'], 1)
.assign(Country_Product=['_'.join(i) for i in zip(df['Country'], df['Product'])]))
df2 = df.groupby(['User','Country_Product'])['User'].count().rename('Count').reset_index()
print(df2)
Returns:
User Country_Product count
0 101 Brazil_x 1
1 101 India_x 2
2 102 Brazil_x 1
3 102 Brazil_z 2
4 102 India_x 1
5 102 India_z 1
6 102 Japan_x 1

How about get_dummies
df.set_index(['User','Product']).Country.str.get_dummies(sep=',').replace(0,np.nan).stack().sum(level=[0,1,2])
Out[658]:
User Product
101 x Brazil 1.0
India 2.0
102 x Brazil 1.0
India 1.0
Japan 1.0
z Brazil 2.0
India 1.0
dtype: float64

Related

append missing names to end of a groupby pandas

I am working on a report automation task I used groupby function which yielded a table
function_d= {"AvgLoadCur": 'mean'}
newdf=df.groupby(['sitename']).agg(function_d)
sitename AvgLoadCur
Biocon-SEZD-66/11KV SS 11 23.0
Biocon-SEZD-GT 1 120V DC 24.2
Biocon-SEZD-GT 2 120V DC 23.9
Biocon-SEZD-PLC 24V 21.4
df contains only 4 sitenames hence groupby table also contains only those four how can I append the missing two sitename which are stored in another dataframe site['sitename']
sitename
Biocon-SEZD-GT 1 120V DC
Biocon-SEZD-GT 2 120V DC
Biocon-SEZD-SCADA UPS
Biocon-SEZD-66/11KV SS 11
Biocon-SEZD-PLC 24V DC
BIOCON SEZ-HT PANEL 220 V
The Final Dataframe should look like
sitename AvgLoadCur
Biocon-SEZD-66/11KV SS 11 23.0
Biocon-SEZD-GT 1 120V DC 24.2
Biocon-SEZD-GT 2 120V DC 23.9
Biocon-SEZD-PLC 24V 21.4
Biocon-SEZD-HT PANEL 220 V --
Biocon-SEZD-SCADA UPS --
In short how to append elements which are not present in a groupby table from another dataframe
groupby table:
Fruit Price
apple 34
A df table
Fruit
--------
apple
orange
Final groupby table
Fruit Price
apple 34
orange --
You can first merge your dataframe and then groupby.
df = pd.DataFrame({'Fruit': {0: 'apple'}, 'Price': {0: 34}})
df2 = pd.DataFrame({'Fruit': {0: 'apple', 1: 'orange'}})
(
pd.merge(df,df2,on='Fruit',how='right')
.groupby('Fruit')
.agg(avg=('Price', 'mean'))
.reset_index()
)
Fruit avg
0 apple 34.0
1 orange NaN
I hope this would answer your question:
df = pd.DataFrame([['apple',1],['orange',2]],columns=['Fruit','Price'])
df2 = pd.DataFrame(['guava','apple','orange'],columns=['Fruit'])
for value in df2['Fruit'].values:
if value not in df['Fruit'].values:
df = df.append({'Fruit':value,'Price':'--'},ignore_index=True)
df
Output
Fruit Price
0 apple 1
1 orange 2
2 guava --

Dataframe to extract 2 most recent rows of each group

A simple data-frame and I want to pick the most recent 2 rows (sorted by "Year") with all columns.
import pandas as pd
data = {'People' : ["John","John","John","Kate","Kate","David","David","David","David"],
'Year': ["2018","2019","2006","2017","2012","2006","2019","2018","2017"],
'Sales' : [120,100,60,150,135,140,90,110,160]}
df = pd.DataFrame(data)
I tried below but it doesn't produce what's wanted:
df = df.groupby('People')
df_1 = pd.concat([df.head(2)]).drop_duplicates().sort_values('Year').reset_index(drop=True)
What's the right way to write it? Thank you.
IIUC, use pandas.DataFrame.nlargest:
df['Year'] = df['Year'].astype(int)
df.groupby('People', as_index=False).apply(lambda x: x.nlargest(2, "Year"))
Output:
People Year Sales
0 6 David 2019 90
7 David 2018 110
1 1 John 2019 100
0 John 2018 120
2 3 Kate 2017 150
4 Kate 2012 135

Multiple aggregated Counting in Pandas

I have a DF:
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
data_frame
OP:
Name ID Manager_name Manager_ID
0 John 144 Smith 200
1 Mia 220 John 144
2 Caleb 155 Smith 200
3 Smith 200 Jason 500
I am trying to count the number of people reporting under each person in the column Name.
Logic is:
Count the number of people reporting individually and people reporting under in the chain. For example with Smith; John and Caleb reports to Smith so 2 + 1 with Mia reporting to John (who already reports to Smith) so total 3.
Similarly for Jason -> 1 because Smith reports to him and 3 people already report to Smith so total 4.
I understand how to do it pythonically with some recursion, is there a way to efficiently do it in Pandas. Any suggestions?
Expected OP:
Name Number of people reporting
John 1
Mia 0
Caleb 0
Smith 3
Jason 4
Scott Boston's Networkx solution is the preferred solution...
There are two solutions to this problem. The first one is a vectorized pandas type solution and should be fast over larger datasets, the second is pythonic and does not work well on the size of dataset the OP was looking for, the original df size is (223635,4).
PANDAS SOLUTION
This problem seeks to find out how many people each person in an organization manages, including subordinate's subordinates. This solution will create a dataframe by adding successive columns that are the managers of the previous columns, and then counting the occurance of each employee in that dataframe to determine the total number under them.
First we set up the input.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
df = df[["SID", "Manager_SID"]]
# shortening the columns for convenience
df.columns = ["1", "2"]
print(df)
1 2
0 144 200
1 220 144
2 155 200
3 200 500
First the employees without subordinates must be counted and put into a seperate dictionary.
df_not_mngr = df.loc[~df['1'].isin(df['2']), '1']
non_mngr_dict = {str(key):0 for key in df_not_mngr.values}
non_mngr_dict
{'220': 0, '155': 0}
Next we will modify the dataframe by adding columns of managers of the previous column. The loop is stopped when there are no employees in the right most column
for i in range(2, 10):
df = df.merge(
df[["1", "2"]], how="left", left_on=str(i), right_on="1", suffixes=("_l", "_r")
).drop("1_r", axis=1)
df.columns = [str(x) for x in range(1, i + 2)]
if df.iloc[:, -1].isnull().all():
break
else:
continue
print(df)
1 2 3 4 5
0 144 200 500 NaN NaN
1 220 144 200 500 NaN
2 155 200 500 NaN NaN
3 200 500 NaN NaN NaN
All columns except the first columns are collapsed and each employee counted and added to a dictionary.
from collections import Counter
result = dict(Counter(df.iloc[:, 1:].values.flatten()))
The non manager dictionary is added to the result.
result.update(non_mngr_dict)
result
{'200': 3, '500': 4, nan: 8, '144': 1, '220': 0, '155': 0}
RECURSIVE PYTHONIC SOLUTION
I think this is probably way more pythonic than you were looking for. First I created a list 'all_sids' to make sure we capture all employees as not all are in each list.
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
all_sids = pd.unique(df[['SID', 'Manager_SID']].values.ravel('K'))
Then create a pivot table.
dfp = df.pivot_table(values='Name', index='SID', columns='Manager_SID', aggfunc='count')
dfp
Manager_SID 144 200 500
SID
144 NaN 1.0 NaN
155 NaN 1.0 NaN
200 NaN NaN 1.0
220 1.0 NaN NaN
Then a function that will go through the pivot table to total up all the reports.
def count_mngrs(SID, count=0):
if str(SID) not in dfp.columns:
return count
else:
count += dfp[str(SID)].sum()
sid_list = dfp[dfp[str(SID)].notnull()].index
for sid in sid_list:
count = count_mngrs(sid, count)
return count
Call the function for each employee and print the results.
print('SID', ' Number of People Reporting')
for sid in all_sids:
print(sid, " " , int(count_mngrs(sid)))
Results are below, sorry I was a bit lazy in putting the names with the sids.
SID Number of People Reporting
144 1
220 0
155 0
200 3
500 4
Look forward to seeing a more pandas type solution!
This is also, a graph problem and you can use Networkx:
import networkx as nx
import pandas as pd
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
#create a directed graph object using nx.DiGraph
G = nx.from_pandas_edgelist(data_frame,
source='Name',
target='Manager_name',
create_using=nx.DiGraph())
#use nx.ancestors to get set of "ancenstor" nodes for each node in the directed graph
pd.DataFrame.from_dict({i:len(nx.ancestors(G,i)) for i in G.nodes()},
orient='index',
columns=['Num of People reporting'])
Output:
Num of People reporting
John 1
Smith 3
Mia 0
Caleb 0
Jason 4
Draw newtorkx:

Aggregating column values of dataframe to a new dataframe

I have a dataframe which involves Vendor, Product, Price of various listings on a market among other column values.
I need a dataframe which has the unique vendors, number of products, sum of their product listings, average price/product and (average * no. of sales) as different columns.
Something like this -
What's the best way to make this new dataframe?
Thanks!
First multiple columns Number of Sales with Price, then use DataFrameGroupBy.agg by dictionary of columns names with aggregate functions, then flatten MultiIndex in columns by map and rename. :
df['Number of Sales'] *= df['Price']
d1 = {'Product':'size', 'Price':['sum', 'mean'], 'Number of Sales':'mean'}
df = df.groupby('Vendor').agg(d1)
df.columns = df.columns.map('_'.join)
d = {'Product_size':'No. of Product',
'Price_sum':'Sum of Prices',
'Price_mean':'Mean of Prices',
'Number of Sales_mean':'H Factor'
}
df = df.rename(columns=d).reset_index()
print (df)
Vendor No. of Product Sum of Prices Mean of Prices H Factor
0 A 4 121 30.25 6050.0
1 B 1 12 12.00 1440.0
2 C 2 47 23.50 587.5
3 H 1 45 45.00 9000.0
You can do it using groupby(), like this:
df.groupby('Vendor').agg({'Products': 'count', 'Price': ['sum', 'mean']})
That's just three columns, but you can work out the rest.
You can do this by using pandas pivot_table. Here is an example based on your data.
import pandas as pd
import numpy as np
>>> f = pd.pivot_table(d, index=['Vendor', 'Sales'], values=['Price', 'Product'], aggfunc={'Price': np.sum, 'Product':np.ma.count}).reset_index()
>>> f['Avg Price/Product'] = f['Price']/f['Product']
>>> f['H Factor'] = f['Sales']*f['Avg Price/Product']
>>> f.drop('Sales', axis=1)
Vendor Price Product Avg Price/Product H Factor
0 A 121 4 30.25 6050.0
1 B 12 1 12.00 1440.0
2 C 47 2 23.50 587.5
3 H 45 1 45.00 9000.0

Pandas Expanding Mean with Group By and before current row date

I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

Categories