Creating a new column based on entries from another column in Python

Creating a new column based on entries from another column in Python - python

I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!

Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand

Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0

Related

Calculate column values in pandas based on previous rows of data in another column

Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?

We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64

looking for better iteration approach for slicing a dataframe

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))

For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

How to get maximum difference of column "through time" in pandas dataframe

I have a pandas dataframe named 'stock_data' with a MultiIndex index of ('Date', 'StockID') and a column 'Price'. The rows are ordered by date, so for the same stock a later date will have a higher row index. I want to add a new column that for each stock (i.e. group by stock) contains a number with the maximum positive difference between the prices of the stock through time, as in max_price - min_price.
To explain this further, one could calculate this in O(stocks*rows^2) by:
for each stock:
max = 0.0
for i in range(len(rows)-1):
for j in range(i+1, len(rows):
if price[j] - price[i] > max:
max = price[j] - price[i]
How do I do this in pandas without actually calculating every value and assigning it to the right spot of a new column of the dataframe one-at-a-time like the above algorithm (which could probably be improved by sorting but this is besides the point)?
So far, I have only figured out that I can group by 'StockID' with:
stock_data.groupby(level='Stock') and pick the column stock_data.groupby(level='Stock')['Price']. But something like:
stock_data.groupby(level='Stock')['Price'].max() - stock_data.groupby(level='Stock')['Price'].min()
is not what I described above because there is no resitriction that the max() must come after the min().
Edit: The accepted solution works. Now I am also wondering if there is a way to penalize that distance by how far the max is from the min, so shorter gains are higher (therefore preferred) over longterm ones with somewhat bigger difference.
For example, maybe we could do cumsum() up to a certain length after min and not till the end? Somehow?

Let's try [::-1] to reverse the order to be able to get the maximum "in the future", then cummin and cummax after the groupby.
# sample data
np.random.seed(1)
stock_data = pd.DataFrame({'Price':np.random.randint(0,100, size=14)},
index=pd.MultiIndex.from_product(
[pd.date_range('2020-12-01', '2020-12-07', freq='D'),
list('ab')],
names=['date','stock'])
)
and assuming the dates are ordered in time, you can do:
stock_data['diff'] = (df.loc[::-1, 'Price'].groupby(level='stock').cummax()
- df.groupby(level='stock')['Price'].cummin())
print(stock_data)
Price diff
date stock
2020-12-01 a 37 42
b 12 59
2020-12-02 a 72 42
b 9 62
2020-12-03 a 75 42
b 5 66
2020-12-04 a 79 42
b 64 66
2020-12-05 a 16 60
b 1 70
2020-12-06 a 76 60
b 71 70
2020-12-07 a 6 0
b 25 24

How to resample/reindex/groupby a time series based on a column's data?

SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.
Data looks roughly like this:
Index hotIn(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:00 4 0
2019-03-23T00:00:01 5 0
2019-03-23T00:00:02 4 0
2019-03-23T00:00:03 4 0
2019-03-23T00:00:04 3 0
2019-03-23T00:00:05 4 1
2019-03-23T00:00:06 4 0
2019-03-23T00:00:07 5 0
2019-03-23T00:00:08 3 0
2019-03-23T00:00:09 3 0
2019-03-23T00:00:10 4 0
2019-03-23T00:00:11 4 0
2019-03-23T00:00:12 5 0
2019-03-23T00:00:13 5 1
What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.
For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.
Results would ideally look like this:
Index hotIn sum(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 32 1
I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible.
I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.

IIUC, you can do:
s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
.agg({'Index':'last', 'hotIn(gpm)':'sum'})
.reset_index(drop=True)
)
Output:
Index hotIn(gpm)
0 2019-03-23T00:00:05 24
1 2019-03-23T00:00:13 33

You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.
s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()
(df.reset_index()
.groupby(s, as_index=False)
.agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
.set_index('Index'))
hotIn(gpm) hotOut(pulse=1gal)
Index
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 33 1

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?

For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.