Aggregating column values of dataframe to a new dataframe - python

I have a dataframe which involves Vendor, Product, Price of various listings on a market among other column values.
I need a dataframe which has the unique vendors, number of products, sum of their product listings, average price/product and (average * no. of sales) as different columns.
Something like this -
What's the best way to make this new dataframe?
Thanks!

First multiple columns Number of Sales with Price, then use DataFrameGroupBy.agg by dictionary of columns names with aggregate functions, then flatten MultiIndex in columns by map and rename. :
df['Number of Sales'] *= df['Price']
d1 = {'Product':'size', 'Price':['sum', 'mean'], 'Number of Sales':'mean'}
df = df.groupby('Vendor').agg(d1)
df.columns = df.columns.map('_'.join)
d = {'Product_size':'No. of Product',
'Price_sum':'Sum of Prices',
'Price_mean':'Mean of Prices',
'Number of Sales_mean':'H Factor'
}
df = df.rename(columns=d).reset_index()
print (df)
Vendor No. of Product Sum of Prices Mean of Prices H Factor
0 A 4 121 30.25 6050.0
1 B 1 12 12.00 1440.0
2 C 2 47 23.50 587.5
3 H 1 45 45.00 9000.0

You can do it using groupby(), like this:
df.groupby('Vendor').agg({'Products': 'count', 'Price': ['sum', 'mean']})
That's just three columns, but you can work out the rest.

You can do this by using pandas pivot_table. Here is an example based on your data.
import pandas as pd
import numpy as np
>>> f = pd.pivot_table(d, index=['Vendor', 'Sales'], values=['Price', 'Product'], aggfunc={'Price': np.sum, 'Product':np.ma.count}).reset_index()
>>> f['Avg Price/Product'] = f['Price']/f['Product']
>>> f['H Factor'] = f['Sales']*f['Avg Price/Product']
>>> f.drop('Sales', axis=1)
Vendor Price Product Avg Price/Product H Factor
0 A 121 4 30.25 6050.0
1 B 12 1 12.00 1440.0
2 C 47 2 23.50 587.5
3 H 45 1 45.00 9000.0

Related

Write python function that sums values from certain rows for each index type using groupby

In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.
First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0
A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700

How can I convert rows to columns (with custom names) after grouping?

I'm trying to get some row data as columns with pandas.
My original dataframe is something like the following (with a lot more columns). Most data repeats for the same employee but some info changes, like salary in this example. Employees have different number of entries (in this case employee 1 has two entries, 2 has 4, and so on).
employee_id salary other1 other2 other3
1 50000 somedata1 somedata2 somedata3
1 48000 somedata1 somedata2 somedata3
2 80000 somedata20 somedata21 somedata22
2 77000 somedata20 somedata21 somedata22
2 75000 somedata20 somedata21 somedata22
2 74000 somedata20 somedata21 somedata22
3 60000 somedata30 somedata31 somedata32
I'm trying to get something like the following. Salary data should span a few columns and use the last available salary for employees with fewer entries (the repeated salary values in this example).
employee_id salary prevsalary1 prevsalary2 prevsalary3 other1 other2 other3
1 50000 48000 48000 48000 somedata1 somedata2 somedata3
2 80000 77000 75000 74000 somedata20 somedata21 somedata22
3 60000 60000 60000 60000 somedata30 somedata31 somedata32
I tried grouping
df.groupby(["employee_id"])['salary'].nlargest(3).reset_index()
But I dont get all columns. I can't find a way to preserve the rest of columns. Do I need to merge, concatenate or something like that with the original dataframe?
Also, I get a column named "level_1". I think I could get rid of it by using reset_index(level=1, drop=True) but I believe this doesn't return a dataframe.
And finally, I guess if I get this grouping right, there's one more step to get the columns... maybe using pivot or unstack?
I'm starting my journey into machine learning and I keep scratching my head with this one, I hope you can help me :)
Creating dataset:
df = pd.DataFrame({'emp_id':[1,1,2,2,2,2,3],'salary':[50000,48000,80000,77000,75000,74000,60000]})
df['other1'] =['somedata1','somedata1','somedata20','somedata20','somedata20','somedata20','somedata30']
df['other2'] = df['other1'].apply(lambda x: x+'1')
df['other3'] = df['other1'].apply(lambda x: x+'2')
df
Out[59]:
emp_id salary other1 other2 other3
0 1 50000 somedata1 somedata11 somedata12
1 1 48000 somedata1 somedata11 somedata12
2 2 80000 somedata20 somedata201 somedata202
3 2 77000 somedata20 somedata201 somedata202
4 2 75000 somedata20 somedata201 somedata202
5 2 74000 somedata20 somedata201 somedata202
6 3 60000 somedata30 somedata301 somedata302
One way is using pd.pivot_table with ffill:
g = df.groupby('employee_id')
cols = g.salary.cumcount()
out = df.pivot_table(index='employee_id', values='salary', columns=cols).ffill(1)
# Crete list of column names matching the expected output
out.columns = ['salary'] + [f'prevsalary{i}' for i in range(1,len(out.columns))]
print(out)
salary prevsalary1 prevsalary2 prevsalary3
employee_id
1 50000.0 48000.0 48000.0 48000.0
2 80000.0 77000.0 75000.0 74000.0
3 60000.0 60000.0 60000.0 60000.0
Now we just need to join with the unique other columns from the original dataframe:
out = out.join(df.filter(like='other').groupby(df.employee_id).first())
print(out)
salary prevsalary1 prevsalary2 prevsalary3 other1 \
employee_id
1 50000.0 48000.0 48000.0 48000.0 somedata1
2 80000.0 77000.0 75000.0 74000.0 somedata20
3 60000.0 60000.0 60000.0 60000.0 somedata30
other2 other3
employee_id
1 somedata2 somedata3
2 somedata21 somedata22
3 somedata31 somedata32
pivot the table of salaries first, then merge with the non-salary data
# first create a copy of the dataset without the salary column
dataset_without_salaries = df.drop('salary', axis=1).drop_duplicates()
# pivot only salary column
temp = pd.pivot_table(data=df[['salary']], index=df['employee_id'], aggfunc=list)
# expand the list
temp2 = temp.apply(lambda x: pd.Series(x['salary']), axis=1)
# merge the two together
final = pd.merge(temp2, dataset_without_salaries)

Calculating total monthly cumulative number of Order

I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7

Drop percentage of a dataframe [pandas]

I have a dataset with a very long tail and wish to sample only 90% of the data.
city score
bangkok 60
kl 20
sydney 10
melbourne 5
dhaka 5
should be:
city score
bangkok 60
kl 20
sydney 10
First, sort the values that you want to filter the highest 90% of the data
df.sort_values('score', ascending=False, inplace=True)
Then, you calculate the cumulative sum and divide with the total in order to make your filtering conditions (you can replace 0.9 with your custom limit)
df = df[df['score'].cumsum() / df['score'].sum() < 0.9]
Now df looks like
city score
bangkok 60
kl 20
sydney 10
I believe need count score by division of sum and then filter by boolean indexing, last sort_values for better performance in filtered rows:
a = 0.9
df = df[df['score'].div(df['score'].sum()) >= 1 - a].sort_values('score', ascending=False)
Or:
df = df[df['score'].div(df['score'].sum()) >= 0.1].sort_values('score', ascending=False)
print (df)
city score
0 bangkok 60
1 kl 20
2 sydney 10
Detail:
print (df['score'].div(df['score'].sum()))
0 0.60
1 0.20
2 0.10
3 0.05
4 0.05
Name: score, dtype: float64

pandas group by conditional frequency for each user wise

i have a dataframe like this
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
and i want to get country and product combination count by user wise like below
first split the countries then combine with product and take the count.
wanted output:
Here is one way combining other answers on SO (which just shows the power of searching :D)
import pandas as pd
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
# Making use of: https://stackoverflow.com/a/37592047/7386332
j = (df.Country.str.split(',', expand=True).stack()
.reset_index(drop=True, level=1)
.rename('Country'))
df = df.drop('Country', axis=1).join(j)
# Reformat to get desired Country_Product
df = (df.drop(['Country','Product'], 1)
.assign(Country_Product=['_'.join(i) for i in zip(df['Country'], df['Product'])]))
df2 = df.groupby(['User','Country_Product'])['User'].count().rename('Count').reset_index()
print(df2)
Returns:
User Country_Product count
0 101 Brazil_x 1
1 101 India_x 2
2 102 Brazil_x 1
3 102 Brazil_z 2
4 102 India_x 1
5 102 India_z 1
6 102 Japan_x 1
How about get_dummies
df.set_index(['User','Product']).Country.str.get_dummies(sep=',').replace(0,np.nan).stack().sum(level=[0,1,2])
Out[658]:
User Product
101 x Brazil 1.0
India 2.0
102 x Brazil 1.0
India 1.0
Japan 1.0
z Brazil 2.0
India 1.0
dtype: float64

Categories