I have a dataframe df1 like:
cycleName quarter product qty price sell/buy
0 2020 q3 wood 10 100 sell
1 2020 q3 leather 5 200 buy
2 2020 q3 wood 2 200 buy
3 2020 q4 wood 12 40 sell
4 2020 q4 leather 12 40 sell
5 2021 q1 wood 12 80 sell
6 2021 q2 leather 12 90 sell
And another dataframe df2 as below. It has unique products of df1:
product currentValue
0 wood 20
1 leather 50
I want to create new column in df2, called income which will be based on calculations on df1 data. Example if product is wood the income2020 will be created seeing if cycleName is 2020 and if sell/buy is sell then add quantity * price else subtract quantity * price.
product currentValue income2020
0 wood 20 10 * 100 - 2 * 200 + 12 * 40 (=1080)
1 leather 50 -5 * 200 + 12 * 40 (= -520)
I have a problem statement in python, which I am trying to do using pandas dataframes, which I am very new to.
I am not able to understand how to create that column in df2 based on different conditions on df1.
You can map sell as 1 and buy as -1 using pd.Series.map then multiply columns qty, price and sell/buy using df.prod to get only 2020 cycleName values use df.query and groupby by product and take sum using GroupBy.sum
df_2020 = df.query('cycleName == 2020').copy() # `df[df['cycleName'] == 2020].copy()`
df_2020['sell/buy'] = df_2020['sell/buy'].map({'sell':1, 'buy':-1})
df_2020[['qty', 'price', 'sell/buy']].prod(axis=1).groupby(df_2020['cycleName']).sum()
product
leather -520
wood 1080
dtype: int64
Note:
Use .copy else you would get SettingWithCopyWarning
To maintain the order use sort=False in df.groupby
(df_2020[['qty', 'price', 'sell/buy']].
prod(axis=1).
groupby(df_2020['product'],sort=False).sum()
)
product
wood 1080
leather -520
dtype: int64
I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2
I've created a new row for storing mean values of all columns. Now I'm trying to assign name to the very first cell of the new row
I've tried the conventional method of assigning value by pointing to the cell index. It doesn't return any error but it doesn't seems to store the value in the cell.
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 NAN NAN 42500 36000 12250
data11.loc[2,'Items Description'] = 'Average GDP'
Instead of returning below dataframe the code is still giving the previous output.
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 Average GDP NAN 42500 36000 12250
For me working nice, but here are 2 alternatives for set value by last row and column name.
First is DataFrame.loc with specify last index value by indexing:
data11.loc[data11.index[-1], 'Items Description'] = 'Average GDP'
Or DataFrame.iloc with -1 for get last row and Index.get_loc for get position of column Items Description:
data11.iloc[-1, data11.columns.get_loc('Items Description')] = 'Average GDP'
print (data11)
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 Average GDP NAN 42500 36000 12250
I have to do ETL for each day and then add it to a single dataframe.
Eg: After each day ETL following are the outputs..
df1:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
.....
df2:
id category quantity date
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
.....
df3:
id category quantity date
1 abc 500 03-07-18
.....
df4:
id category quantity date
5 jjj 200 04-07-18
7 ddd 100 04-07-18
.....
For each day ETL, one dataframe need to be created like df1,df2,df3,... and after each day ETL that dataframe should be appneded with earlier dates ETL..
Final output expected:
After day 2 output should be:
finaldf:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
.....
After day 4 output should be:
finaldf:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
1 abc 500 03-07-18
5 jjj 200 04-07-18
7 ddd 100 04-07-18
.....
I have done this using Pandas using append function but as the data size is very large I am getting MemoryError.
Answer for PySpark
Put all the DataFrames into a list
df_list = [df1, df2, df3, df4]
finaldf = reduce(lambda x, y: x.union(y), df_list)
finaldf will contain all the data.
I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)