My dataset has some information about price and sales for different years. The problem is each year is actually a different column header for price and for sales as well. For example the CSV looks like
Items
Price in 2018
Price in 2019
Price in 2020
Sales in 2018
Sales in 2019
Sales in 2020
A
100
120
135
5000
6000
6500
B
110
130
150
2000
4000
4500
C
150
110
175
1000
3000
3000
I want to show it something like this
Items
Year
Price
Sales
A
2018
100
5000
A
2019
120
6000
A
2020
135
6500
B
2018
110
2000
B
2019
130
4000
B
2020
150
4500
C
2018
150
1000
C
2019
110
3000
C
2020
175
3000
I used melt function from Pandas like this
df.melt(id_vars = ['Items'], var_name="Year", value_name="Price")
But I'm struggling in getting separate columns for Price and Sales as it gives Price and Sales in one column. Thanks
Let us try pandas wide_to_long
pd.wide_to_long(df, i='Items', j='year',
stubnames=['Price', 'Sales'],
suffix=r'\d+', sep=' in ').sort_index()
Price Sales
Items year
A 2018 100 5000
2019 120 6000
2020 135 6500
B 2018 110 2000
2019 130 4000
2020 150 4500
C 2018 150 1000
2019 110 3000
2020 175 3000
Related
I have this dataframe (sorry, not sure how to format it nicely here):
SRC SRCDate Ticker Coupon Vintage Bal ($bn) WAC WAM WALA LNSZ ... FICO Refi% Month_Assessed CPR Month_key SRC_year SRC_month Year Month Interest_Rate
JPM 02/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 7.536801 M+2 2021 2 2021 2 2.24
JPM 03/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 5.131145 M+1 2021 3 2021 3 2.39
JPM 04/07/2021 FNCI 1.5 2020 28 2.25 173 6 292 / 281 ... 777 91 Apr 7.233214 M 2021 4 2021 4 2.36
JPM 05/07/2021 FNCI 1.5 2020 27.6 2.25 171 7 292 / 279 ... 777 91 Apr 8.900000 M-1 2021 5 2021 5 2.28
And use this code:
cols = ['SRC_year','Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate']
jpm_2021[cols] = jpm_2021[cols].apply(pd.to_numeric, downcast='float', errors='coerce')
for col in cols:
jpm_2021[col] = jpm_2021.groupby(['SRC_year','Ticker', 'Coupon', 'Vintage', 'Month_Assessed'])[col].transform('mean')
To normalize the values of all the cols to their respective means by the grouping in groupby. The reason for this is to be able to create a pivoted table with this code:
jpm_final = jpm_2021.pivot_table(index=['SRC', 'Ticker', 'Coupon', 'Vintage', 'Month_Assessed', 'Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate'],
columns="Month_key", values="CPR").rename_axis(columns=None).reset_index()
The problem is, taking the mean of all of those columns (especially Interest Rate) renders the resulting table less than insightful. Instead, what I'd like to do is to impute all the values in the rows where Month_key is M to all the other rows with the same grouping defined in the groupby function above. Any tips on how to do that?
I'm trying to do a group by and calculate percentage change of revenue? Here is my data frame
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['1177 AVENUE OF THE AMERICAS',2020,10000], ['1177 AVENUE OF THE AMERICAS',2019,25000], ['1177 AVENUE OF THE AMERICAS',2018,5000], ['500 5th AVENUE',2020,30000], ['500 5th AVENUE',2019,5000],['500 5th AVENUE',2018,4000],['5 45th ST',2018,9000]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['site_name', 'year', 'revenue'])
df.sort_values(['site_name','year'], inplace = True, ascending=[False, False])
# print dataframe.
df
I tried this:
df['Percent_Change'] = df.revenue.pct_change()
df
It gives me this:
site_name year revenue Percent_Change
3 500 5th AVENUE 2020 30000 NaN
4 500 5th AVENUE 2019 5000 -0.833333
5 500 5th AVENUE 2018 4000 -0.200000
6 5 45th ST 2018 9000 1.250000
0 1177 AVENUE OF THE AMERICAS 2020 10000 0.111111
1 1177 AVENUE OF THE AMERICAS 2019 25000 1.500000
2 1177 AVENUE OF THE AMERICAS 2018 5000 -0.800000
I also tried this:
df['Percent_Change'] = df.groupby(['site_name','year'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
df
It gives me this:
site_name year revenue Percent_Change
3 500 5th AVENUE 2020 30000 0.0
4 500 5th AVENUE 2019 5000 0.0
5 500 5th AVENUE 2018 4000 0.0
6 5 45th ST 2018 9000 0.0
0 1177 AVENUE OF THE AMERICAS 2020 10000 0.0
1 1177 AVENUE OF THE AMERICAS 2019 25000 0.0
2 1177 AVENUE OF THE AMERICAS 2018 5000 0.0
The tricky part is to get the percent_change to reset when the site_name resets. I would like to end up with something like this.
Remove year
df.groupby(['site_name'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
Also , we usually do transform
s = df.groupby(['site_name'])['revenue'].transform('first')
df['Percent_Change'] = (df['revenue']/s-1)*100
You are grouping by both 'site_name' and 'year', hence the problem. I tried the code after removing 'year' and it gave the desired result.
df['Percent_Change'] = df.groupby(['site_name'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()
I have a df which looks like this:
Date Value
2020 0
2020 100
2020 200
2020 300
2021 100
2021 150
2021 0
I want to get the average of the grouped Value by Date where Value > 0. When I tried:
df['Yearly AVG'] = df[df['Value']>0].groupby('Date')['Value'].mean()
I get NaN Values, when I print the line above I get what I need but with the Date column.
Date
2020 200
2021 125
How Can I have the following:
Date Value Yearly AVG
2020 0 200
2020 100 200
2020 200 200
2020 300 200
2021 100 125
2021 150 125
2021 0 125
Here is trick replace non matched values to missing values and then using GroupBy.transform for new columns filled by aggregate values:
df['Yearly AVG'] = df['Value'].where(df['Value']>0).groupby(df['Date']).transform('mean')
print (df)
Date Value Yearly AVG
0 2020 0 200.0
1 2020 100 200.0
2 2020 200 200.0
3 2020 300 200.0
4 2021 100 125.0
5 2021 150 125.0
6 2021 0 125.0
Detail:
print (df['Value'].where(df['Value']>0))
0 NaN
1 100.0
2 200.0
3 300.0
4 100.0
5 150.0
6 NaN
Name: Value, dtype: float64
Your solution should be changed:
df['Yearly AVG'] = df['Date'].map(df[df['Value']>0].groupby('Date')['Value'].mean())
I have created the following dataframe by parsing mulitple CSVs in spark. I need to group the average sales of each month per-city per-SKU per-year.
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>month</th><th>avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Jan</td><td>100</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Feb</td><td>120</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>Dec</td><td>99</td></tr></tbody></table>
Desired output:
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>Jan_avg_sales</th><th>Feb_avg_sales</th><th>..</th><th>Dec_avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>100</td><td>120</td><td>..</td><td>320</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>98</td><td>118</td><td>..</td><td>318</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>99</td><td>114</td><td>..</td><td>314</td></tr></tbody></table>
I have implemented the summary table creation using python dictionary, but i'm not convinced with the solution.
Here is the code snippet i tried so far:
path = "s3a://bucket/city1*"
cleaned_df = spark.read.format('csv').options(header='true', inferSchema='true').load(path)
cleaned_df = cleaned_df.groupby(['Year','city','sku_id']).mean()
cleaned_df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("mydata4csv")
did u try to group them based on the three attributes(city, SKU, year)?
If you have a dataframe that looks like:
avg_sales city sku_id year
0 300 A sku1 2017
1 210 A sku1 2018
2 200 A sku2 2017
3 10 A sku2 2017
4 10 B sku1 2017
5 190 B sku1 2017
6 130 B sku2 2017
7 130 B sku2 2017
8 50 C sku2 2017
Then you can do:
dataframe.groupby(['year', 'city', 'sku']).mean()
And get:
avg_sales
city sku_id year
A sku1 2017 300
2018 210
sku2 2017 105
B sku1 2017 100
sku2 2017 130
C sku2 2017 50
If you share your python code I can touch up the answer to fit your case.