Summarizing a dataset and creating new variables - python

I have a Dataset that lists individual transactions by country, quarter, division, the transaction type and the value. I would like to sum it up based on the first three variables but create new columns for the other two. The dataset looks like this:
Country Quarter Division Type Value
A 1 Sales A 50
A 2 Sales A 150
A 3 Sales B 20
A 1 Sales A 250
A 2 Sales B 50
A 3 Sales B 50
A 2 Marketing A 50
Now I would like to aggregate the data to get the number of transactions by type as a new variable. The overall number of transactions grouped by the first three variables is easy:
df.groupby(['Country', 'Quarter', 'Division'], as_index=False).agg({'Type':'count', 'Value':'sum'})
However, I would like my new dataframe to look as follows:
Country Quarter Division Type_A Type_B Value_A Value_B
A 1 Sales 2 0 300 0
A 2 Sales 1 1 150 50
A 3 Sales 0 2 0 70
A 2 Marketing 1 0 50 0
How do I do that?

Specify column after groupby with tuples in agg functions for new columns names with aggregate functions, then reshape by DataFrame.unstack and last convert MultiIndex in columns by map:
df1 = (df.groupby(['Country', 'Quarter', 'Division', 'Type'])['Value']
.agg([('Type','count'), ('Value','sum')])
.unstack(fill_value=0))
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index()
print (df1)
Country Quarter Division Type_A Type_B Value_A Value_B
0 A 1 Sales 2 0 300 0
1 A 2 Marketing 1 0 50 0
2 A 2 Sales 1 1 150 50
3 A 3 Sales 0 2 0 70

Related

Changing column values based on previous rows values in python

I have a data frame that looks like this
Name
Order
Manufacturer
0
Company1
1
product
2
Company2
1
product
2
product
2
product
2
the only identifier for the value in the Names column is the order, where 0 represents Manufacturers, 1 represents companies, and 2 represents products.
And I want to add a column with a value-based in comparison between previous and current rows under the same group.
basically, I want to identify that company1 relates to Manufacturer1, and product1 relates to company1, etc...
Name
Order
Desired_Output
Manufacturer
0
Manufacturer
Company1
1
Manufacturer_Company1
product
2
Company1_product
Company2
1
Manufacturer_Company2
product
2
Company2_product
product
2
Company2_product
product
2
Company2_product
You can pivot the data, ffill and join the last two items:
df['output'] = (df
.reset_index()
.pivot(index='index', columns='Order', values='Name')
.ffill()
.apply(lambda d: '_'.join(d.dropna().iloc[-2:]), axis=1)
)
NB. This should work with any number of 0 values.
output:
Name Order output
0 Manufacturer 0 Manufacturer
1 Company1 1 Manufacturer_Company1
2 product 2 Company1_product
3 Company2 1 Company2_product
4 product 2 Company2_product
5 product 2 Company2_product
6 product 2 Company2_product
This one is a little tricky but will do the job
df['output'] = df.apply(lambda x: df[df['Order'] == x['Order'] -1]['Name'].iloc[-1]+'_'+x['Name'] if x['Order'] > 0 else x['Name'],axis=1)
However, that's not the most clever solution for large dfs

Python Pandas: Conditional subraction of data between two dataframes?

I'm trying to use conditional subtraction between two dataframes.
Dataframe df1 has columns name and price.name is not unique.
>>df1
name price
0 mark 50
1 mark 200
2 john 10
3 chris 500
Another dataframe has two columns name and paid, Here name is unique
>>df2
name paid
0 mark 150
1 john 10
How can I conditionally subtract both dataframes to get following output
Final Output expected
name price paid
0 mark 50 50
1 mark 200 100
2 john 10 10
3 chris 500 0
IIUC, you can use:
# mapper for paid values
s = df2.set_index('name')['paid']
df1['paid'] = (df1
.groupby('name')['price'] # for each name
.apply(lambda g: g.cumsum() # sum the total owed
.clip(upper=s.get(g.name, default=0)) # in the limit of the paid
.pipe(lambda s: s.diff().fillna(s)) # compute reverse cumsum
)
)
output:
name price paid
0 mark 50 50.0
1 mark 200 100.0
2 john 10 10.0
3 chris 500 0.0

pandas groupby aggregate to find number of days customer made at least 1 transaction

I have customer transaction dataset like this:
ID
Date
Amount
1
1-1-21
5
2
2-1-21
8
1
2-1-21
6
1
3-1-21
5
2
3-1-21
9
2
3-1-21
10
I have to groupby and aggregate the data on customer level like this:
ID
Total Amount
Number of days active
1
16
3
2
27
2
Total Amount = sum of all Amount column
Number of days active = Number of days customer made 1 or more transactions
How do I calculate my column Number of days active? So far I have tried:
df= df.groupby('ID').agg({'Amount': lambda price: price.sum(),
'Date': lambda date: len(date).days})
My Total Amount column is fine but I cannot find the Number of days active
Let us do groupby with agg : nunique + sum
out = df.groupby('ID').agg(Numberofdaysactive = ('Date','nunique'),TotalAmount = ('Amount','sum')).reset_index()
out
Out[384]:
ID Numberofdaysactive TotalAmount
0 1 3 16
1 2 2 27
nunique should be what you need. That is, the aggregate df can be calculated by:
df_agg = df.groupby('ID').agg({"Amount":sum, "Date":pd.Series.nunique})
Note how you can pass function handles directly to agg.

How to get the Percentage of a Column based on a Condition? Python

I want to calculate the percentage of my Products column according to the occurrences per related Country. I would greatly appreciate your help.
Here is what I did so far,
I calculated my new dataframe with this code:
gb = data1.groupby(['Country', 'Products']).size()
df = gb.to_frame(name = 'ProductsCount').reset_index()
df
Which gives me something that look like this:
Countries Products ProductsCount
0 Country 1 Product 1 5
1 Country 1 Product 2 31
2 Country 2 Product 1 2
3 Country 2 Product 2 1
Note: I have a couple of thousands rows of output.
My goal is to get the percentage per each products according to the country directly without calculating the ['ProductsCount'], like this:
Countries Products Percentage
0 Country 1 Product 1 0.138
1 Country 1 Product 2 0.861
2 Country 2 Product 1 0.667
3 Country 2 Product 2 0.333
Otherwise If I can't get the the output to show only the %, then I would like something like this:
Countries Products ProductsCount Products%
0 Country 1 Product 1 5 0.138
1 Country 1 Product 2 31 0.861
2 Country 2 Product 1 2 0.667
3 Country 2 Product 2 1 0.333
I managed to calculate only the % according to the whole dataset using this code:
df['Products%'] = df.ProductsCount/len(df.Country)
Thank you in advance!
Use SeriesGroupBy.value_counts with normalize=True parameter:
df = (data1.groupby('Countries')['Products']
.value_counts(normalize=True,sort=False)
.reset_index(name='Percentage'))
print (df)
Countries Products Percentage
0 Country 1 Product 1 0.138889
1 Country 1 Product 2 0.861111
2 Country 2 Product 1 0.666667
3 Country 2 Product 2 0.333333
EDIT:
df = (data1.groupby('Countries')['Products']
.value_counts(sort=False)
.reset_index(name='ProductsCount')
.assign(Percentage = lambda x: x['ProductsCount'].div(len(x))))
print (df)

Change only a fraction of value for an entire column in dataframe based on another dataframe

Assuming I have a dataframe like the following:
df1
Products Cost
0 rice 12
1 beans 15
2 eggs 17
3 Tomatoes 5
And I have another dataframe with the same headers but that has a certain number which causes the quantity of numbers to become "a" letters. Exemplifying.
df2
Header quantity
0 Products 2
1 Cost 1
Which should give me a result like this:
df3
Products Cost
0 riaa 1a
1 beaaa 1a
2 egaa 1a
3 Tomatoaa NaN
How should this case be resolved? I do not know if the "replace" method works
Using map, after create the mapdict
d={x:x[:-y]+'a'*y for x, y in zip(df2.Header,df2.quantity)}
df1.Products=df1.Products.map(d)
df1
Out[863]:
Products Cost
0 riaa 12
1 beaaa 3
2 egaa 2
3 Tomaaaaa 11

Categories