Today I'm a bit stuck with a problem that I'm not being able to somewhat efficiently resolve. I have a DataFrame like this:
id Date Days Value
1 20130101 95 100
1 20130102 100 100
.
1 20140101 120 90
.
1 20150101 150 90
.
1 20180101 190 85
2 20130101 98 80
.
2 20140101 70 80
.
2 20180101 150 80
So, it's monthly data, and I want to create a column named Value_t5 that takes the Value of a given row, five years into the future if in each 12-month gap, Value was over 90 days. So, for the first row, I have to check 20140101, 20150101, 20160101, 20170101 and 20180101. Because Days is over 90 in all of those rows, Value_t5 will take the value 85 for the 20130101 row (nan for the rest, because I didn't add more data). Then, for id number 2, the 20130101 would take a nan value, because on 20140101, Days was below 70. So, the expected output would be:
id Date Days Value Value_t5
1 20130101 95 100 85
1 20130102 100 100 np.nan
.
1 20140101 120 90 np.nan
.
1 20150101 150 90 np.nan
.
1 20180101 190 85 np.nan
2 20130101 98 80 np.nan
.
2 20140101 70 80 np.nan
.
2 20180101 150 80 np.nan
I'm guessing some kind of combination of groupby , .all() and pd.DateOffset(), might be involved in the answer, but I'm not being able to find it without haveing to merge 5 offsetted dataframes.
Also I've got 17 million rows of data, so apply is probably not the best idea.
My best bet would be to create a n x 5 matrix with all yearly Days values for each row and then processing that. Is there any straightforward way to do this ?
If your data is monthly, you can simply do rolling:
# toy data:
reps = 100000
dates = np.tile(pd.date_range('2005-01-01', '2020-12-01', freq='MS'),reps)
ids = np.repeat(np.arange(reps), len(dates)//reps)
np.random.seed(1)
df = pd.DataFrame({'id':ids,
'Date': dates,
'Days': np.random.randint(0,20, len(dates)),
'Values': np.arange(len(dates))})
# threshold, put 90 here
thresh = 5
# rolling months
roll = 5
df['valid'] = df['Days'].ge(thresh).astype(int)
groups = df.groupby('id')
df['5m'] = groups['valid'].rolling(roll).sum().values
df['5m'] = groups['5m'].shift(-roll).values
df['value_t5'] = np.where(df['5m']==roll, groups['Values'].shift(-roll*12), np.nan)
Output (head):
id Date Days Values valid 5m value_t5
0 1 2013-01-01 5 0 1 5.0 60.0
1 1 2013-02-01 11 1 1 5.0 61.0
2 1 2013-03-01 12 2 1 5.0 62.0
3 1 2013-04-01 8 3 1 4.0 NaN
4 1 2013-05-01 9 4 1 4.0 NaN
Performance: On my computer, that took about 40 seconds (for 19MM rows).
Related
I am trying to calculate a running total per customer for the previous 365 days using pandas but my code isn't working. My intended output would be something like this:
date
customer
daily_total_per_customer
rolling_total
2016-07-29
1
100
100
2016-08-01
1
50
150
2017-01-12
1
80
230
2017-10-23
1
180
260
2018-03-03
1
0
180
2018-03-06
1
40
220
2019-03-16
1
500
500
2017-04-07
2
50
50
2017-04-09
2
230
280
2018-02-11
2
80
360
2018-05-12
2
0
80
2019-05-10
2
0
0
I tried the following:
df_3 = df_3.set_index(['customer', 'date']).sort_values(by='date')
rolling_sum = df_3.rolling('365d', on='date')["daily_total_per_customer"].sum()
df_3["rolling_total"] = rolling_sum
And I get the following error
ValueError: invalid on specified as date, must be a column (of DataFrame), an Index or None
To recreate the code:
dates = ['2016-07-29',
'2016-08-01',
'2017-01-12',
'2017-10-23',
'2018-03-03',
'2018-03-06',
'2019-03-16',
'2017-04-07',
'2017-04-09',
'2018-02-11',
'2018-05-12',
'2019-05-10',
]
customer = [1,1,1,1,1,1,1,2,2,2,2,2]
daily_total = [100,50,80,180,0,40,500,50,230,80,0,0]
df = pd.DataFrame({'date': dates,
'customer': customer,
'daily_total_per_customer':daily_total,})
Perhaps someone can point me in the right direction. Thanks!
Annotated code
# Parse the strings to datetime
df['date'] = pd.to_datetime(df['date'])
# Sort the dates in ASC order if not already sorted
df = df.sort_values(['customer', 'date'])
# Group the dataframe by customer then for each group
# calculate rolling sum on 'daily_total_per_customer'
s = df.groupby('customer').rolling('365d', on='date')['daily_total_per_customer'].sum()
# Merge the result with original df
df.merge(s.reset_index(name='rolling_total'))
date customer daily_total_per_customer rolling_total
0 2016-07-29 1 100 100
1 2016-08-01 1 50 150
2 2017-01-12 1 80 230
3 2017-10-23 1 180 260
4 2018-03-03 1 0 180
5 2018-03-06 1 40 220
6 2019-03-16 1 500 500
7 2017-04-07 2 50 50
8 2017-04-09 2 230 280
9 2018-02-11 2 80 360
10 2018-05-12 2 0 80
11 2019-05-10 2 0 0
My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN
I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10
(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)
By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64
i have a table in pandas df
id count
0 10 3
1 20 4
2 30 5
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
also i have another pandas series s
0 1000
1 2000
2 3000
3 4000
what i want to do is replace the NaN values in my df with the respective values from series s.
my final output should be
id count
0 10 3
1 20 4
2 30 5
3 40 1000
4 50 2000
5 60 3000
6 70 4000
Any ideas how do achieve this?
Thanks in advance.
There is problem lenght of Series is different as length of NaN values in column count. So you need reindex Series by length of NaN:
s = pd.Series({0: 1000, 1: 2000, 2: 3000, 3: 4000, 5: 5000})
print (s)
0 1000
1 2000
2 3000
3 4000
5 5000
dtype: int64
df.loc[df['count'].isnull(), 'count'] =
s.reindex(np.arange(df['count'].isnull().sum())).values
print (df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
It's as simple as this:
df.count[df.count.isnull()] = s.values
In this case, I prefer iterrows for its readability.
counter = 0
for index, row in df.iterrows():
if row['count'].isnull():
df.set_value(index, 'count', s[counter])
counter += 1
I might add that this 'merging' of dataframe + series is a bit odd, and prone to bizarre errors. If you can somehow get the series into the same format as the dataframe (aka add some index/column tags, then you might be better served by the merge function).
You can re-index your Series with indexes of np.nan from dataframe and than fillna() with your Series:
s.index = np.where(df['count'].isnull())[0]
df['count'] = df['count'].fillna(s)
print(df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3
Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.
df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)