I have a data set, where I'm trying to get an average on days remaining that are equal.
Example:
ship_date Order_date cumulative_ordered days_remaining
2018-07-01 2018-05-06 7 56 days
2018-07-01 2018-05-07 10 55 days
2018-07-01 2018-05-08 15 54 days
The order_date will count down until it reaches the ship_date. by that time the cumulative order equals the total orders up until the shipping date. Then a new ship_date and the process repeats. I want to see the percentage average on each day up until the order date. For instance if ship_date 2018-07-01 has a total of 100 orders and ship_date 2018-08-01 has a total of 200, then I want to see how much percentage wise on average was ordered 54 days prior to ship_date.
Thanks.
You can obtain the average of total_ordered per difference_in_days using groupby:
df.groupby("difference_in_days")['total_ordered'].mean()
This returns a Series with the total_ordered average per each group of rows with some specific difference_in_days for example:
difference_in_days
2 days 10.5
56 days 50.22
...
Name: total_ordered, dtype: float64
In order to extract one of the mean values from that Series, you need to assign it to a variable and use the index. Say you want the average of total_ordered for rows with difference_in_days equal to 56, you should do:
g = df.groupby("difference_in_days")['total_ordered'].mean()
# value is the average total_ordered for rows with 56 days of difference.
value = g[g.index.days == 56].iloc[0]
Related
I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0
I have a pandas dataframe named 'stock_data' with a MultiIndex index of ('Date', 'StockID') and a column 'Price'. The rows are ordered by date, so for the same stock a later date will have a higher row index. I want to add a new column that for each stock (i.e. group by stock) contains a number with the maximum positive difference between the prices of the stock through time, as in max_price - min_price.
To explain this further, one could calculate this in O(stocks*rows^2) by:
for each stock:
max = 0.0
for i in range(len(rows)-1):
for j in range(i+1, len(rows):
if price[j] - price[i] > max:
max = price[j] - price[i]
How do I do this in pandas without actually calculating every value and assigning it to the right spot of a new column of the dataframe one-at-a-time like the above algorithm (which could probably be improved by sorting but this is besides the point)?
So far, I have only figured out that I can group by 'StockID' with:
stock_data.groupby(level='Stock') and pick the column stock_data.groupby(level='Stock')['Price']. But something like:
stock_data.groupby(level='Stock')['Price'].max() - stock_data.groupby(level='Stock')['Price'].min()
is not what I described above because there is no resitriction that the max() must come after the min().
Edit: The accepted solution works. Now I am also wondering if there is a way to penalize that distance by how far the max is from the min, so shorter gains are higher (therefore preferred) over longterm ones with somewhat bigger difference.
For example, maybe we could do cumsum() up to a certain length after min and not till the end? Somehow?
Let's try [::-1] to reverse the order to be able to get the maximum "in the future", then cummin and cummax after the groupby.
# sample data
np.random.seed(1)
stock_data = pd.DataFrame({'Price':np.random.randint(0,100, size=14)},
index=pd.MultiIndex.from_product(
[pd.date_range('2020-12-01', '2020-12-07', freq='D'),
list('ab')],
names=['date','stock'])
)
and assuming the dates are ordered in time, you can do:
stock_data['diff'] = (df.loc[::-1, 'Price'].groupby(level='stock').cummax()
- df.groupby(level='stock')['Price'].cummin())
print(stock_data)
Price diff
date stock
2020-12-01 a 37 42
b 12 59
2020-12-02 a 72 42
b 9 62
2020-12-03 a 75 42
b 5 66
2020-12-04 a 79 42
b 64 66
2020-12-05 a 16 60
b 1 70
2020-12-06 a 76 60
b 71 70
2020-12-07 a 6 0
b 25 24
I have a dataframe, df, where I would like to take an average or mean of a given group per month.
group startdate enddate diff percent
A 04/01/2019 05/01/2019 160 11
A 05/01/2019 06/01/2019 136 8
B 06/01/2020 07/01/2020 202 5
B 07/01/2020 08/01/2020 283 7
For example:
I am wanting to take the mean per id per month. For group 'A',
for the month of April the diff is: 160 and for month of May, the diff is: 136.
The monthly diff mean for 'A' is 148
The monthly percent mean for 'A' would be: 9.5
Desired output
group diff_mean percent_Mean
A 148 9.5
B 242.5 6
This is what I am doing:
df.groupby['group'].mean()
I am not getting the correct output. I will keep researching. Any assistance is appreciated.
df.head()
Most likely you just need a bracket around groupby-call; and also you cant average over the start/end-dates so ignore those from your dataframe:
df[['diff', 'percent', 'group']].groupby(['group']).mean()
I would like to create a column that contains the sum of all amounts that occurred in a given hour. For example, if the row I am looking at has 0 under the column 0, I would like the volume column for that row to be the total volume for all amounts that occurred within that hour.
So:
dat.groupby('Hours')['Amount'].sum()
by performing groupby hours and summing the amount, I get the total amount of transactions that where made in each hour.
Hours
0 257101.87
1 146105.69
2 108819.17
....
45 532181.83
46 448887.69
47 336343.60
Name: Amount, dtype: float64
Problem is my database contains 1000s of rows and I can't simply create a new column with the values from the groupby, I would need a condition stipulating that if the value on the hour column is 0 then return the sum of all the amounts where the hour is 0.
So the desired result would be something like this
Hours Amount Total
0 20 100
0 20 100
0 60 100
1 10 20
1 10 20
2 50 50
In this scenario I would want to create the total column and return the sum of all amounts that occurred in a given hour
Groupby + transform should do it
df["Total"] = df.groupby("Hours")["Amount"].transform(sum)
Why this works...
A transform in pandas is like a split-apply-combine-merge in one go. You keep the same axis length after the groupby reduction.
I would use the output of dat.groupby('Hours')['Amount'].sum(), and merge it with the original set on the Hours column:
totals = dat.groupby('Hours')['Amount'].sum()
dat_with_totals = dat.merge(totals, on='Hours')
I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?
For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100