I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()
Related
I'm trying to add after the Gross profit line in an income statement new line with some values from array.
I tried just to append it in the location but nothing changed.
income_statement.loc[["Gross Profit"]].append(gross)
The only way i succeed doing something similar is by making it another dataframe and concat it to end of the income_statement.
I'm trying to make it look like that:(The 'gross' line in yellow)
How can i do it?
I created a sample df that tried to look similar to yours (see below).
df
Unnamed: 0 2010 2011 2012 2013 ... 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 ... 16 17 18 19 300
1 total revenue 1 2 3 4 ... 7 8 9 10 400
The aim now would be to add a row between them ('gross'), with the values you have listed in the picture.
One way to add the row could be with numpy.insert, which returns an array back so you have to convert back to a pd.DataFrame:
# Store the columns of your df
cols = df.columns
# Add the row (the number indicates the index position for the row to be added,1 is the 2nd row as Python indexes start from 0)
new = pd.DataFrame(np.insert
(df.values, 1, values = ['gross',22, 45, 65,87,108,130,151,152,156,135,133], axis=0),
columns=cols)
Which gets back:
new
Unnamed: 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 14 15 16 17 18 19 300
1 gross 22 45 65 87 108 130 151 152 156 135 133
2 total revenue 1 2 3 4 5 6 7 8 9 10 400
Hopefully this will work for you. Let me know for issues.
I have created the following dataframe by parsing mulitple CSVs in spark. I need to group the average sales of each month per-city per-SKU per-year.
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>month</th><th>avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Jan</td><td>100</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Feb</td><td>120</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>Dec</td><td>99</td></tr></tbody></table>
Desired output:
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>Jan_avg_sales</th><th>Feb_avg_sales</th><th>..</th><th>Dec_avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>100</td><td>120</td><td>..</td><td>320</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>98</td><td>118</td><td>..</td><td>318</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>99</td><td>114</td><td>..</td><td>314</td></tr></tbody></table>
I have implemented the summary table creation using python dictionary, but i'm not convinced with the solution.
Here is the code snippet i tried so far:
path = "s3a://bucket/city1*"
cleaned_df = spark.read.format('csv').options(header='true', inferSchema='true').load(path)
cleaned_df = cleaned_df.groupby(['Year','city','sku_id']).mean()
cleaned_df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("mydata4csv")
did u try to group them based on the three attributes(city, SKU, year)?
If you have a dataframe that looks like:
avg_sales city sku_id year
0 300 A sku1 2017
1 210 A sku1 2018
2 200 A sku2 2017
3 10 A sku2 2017
4 10 B sku1 2017
5 190 B sku1 2017
6 130 B sku2 2017
7 130 B sku2 2017
8 50 C sku2 2017
Then you can do:
dataframe.groupby(['year', 'city', 'sku']).mean()
And get:
avg_sales
city sku_id year
A sku1 2017 300
2018 210
sku2 2017 105
B sku1 2017 100
sku2 2017 130
C sku2 2017 50
If you share your python code I can touch up the answer to fit your case.
I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.
merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3
I have a DataFrame that looks like:
f_period f_year f_month subject month year value
20140102 2014 1 a 1 2018 10
20140109 2014 1 a 1 2018 12
20140116 2014 1 a 1 2018 8
20140202 2014 2 a 1 2018 20
20140209 2014 2 a 1 2018 15
20140102 2014 1 b 1 2018 10
20140109 2014 1 b 1 2018 12
20140116 2014 1 b 1 2018 8
20140202 2014 2 b 1 2018 20
20140209 2014 2 b 1 2018 15
The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.
I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:
f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014 1 a 1 2018 10 10 13
20140109 2014 1 a 1 2018 12 10 13
20140116 2014 1 a 1 2018 8 10 13
20140202 2014 2 a 1 2018 20 17.5 null
20140209 2014 2 a 1 2018 15 17.5 null
20140102 2014 1 b 1 2018 10 10 13
20140109 2014 1 b 1 2018 12 10 13
20140116 2014 1 b 1 2018 8 10 13
20140202 2014 2 b 1 2018 20 17.5 null
20140209 2014 2 b 1 2018 15 17.5 null
Things I tried:
I was able to get mnthly_avg by :
data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
value.transform('mean')
I tried getting the rolling_2_avg :
rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()
rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
value.rolling(2).mean().reset_index(drop=True)
This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg
How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this?
grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp
Output:
value rolling
subject month f_month
a 1 1 30 NaN
2 35 32.5
b 1 1 30 32.5
2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).
An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):
for unique_subject in df['subject'].unique():
df_subject = df[df['subject'] == unique_subject]
df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
print(df_subject) # just to print, you may wanna concatenate these
I have the following DataFrame df and I want to calculate the average number of entries per hour over the year, grouped by runway
year month day hour runway
2017 12 30 10 32L
2017 12 30 11 32L
2017 12 30 11 32L
2017 12 30 11 32L
2017 12 30 11 30R
2018 12 30 10 32L
2018 12 30 10 32L
2018 12 30 11 32L
2018 12 30 11 32L
The expected result is this one:
year runway avg. count per hour
2017 32L 2
2017 30R 0.5
2018 32L 2
2018 32L 0
I tried this, but it does not calculate the average count per hour:
result = df.groupby(['year','runway']).count()
Here's one way of achieving it i.e
#Take the count of unique hours per year
s = df.groupby(['year'])['hour'].nunique()
# Take the count of the the runway
n = df.groupby(['year','runway']).size().reset_index()
# Divide them
n['avg'] = n[0]/n['year'].map(s)
year runway 0 avg
0 2017 30R 1 0.5
1 2017 32L 4 2.0
2 2018 32L 4 2.0