make a shift by index with a pandas dataframe - python

Is there a pandas way to do that:
predicted_sells = []
for row in df.values:
index_tms = row[0]
delta = index_tms + timedelta(hours=1)
try:
sells_to_predict = df.loc[delta]['cars_sold']
except KeyError:
new_element = None
predicted_sells.append(sells_to_predict)
df['sell_to_predict'] = predicted_sells
example explanation:
sell is the number of cars I sold at the time tms. sell_to_predict is the number of cars I sold the hour after. I want to predict that. So I want to build a new column containing at the time tms the number of cars I will sell at the time tms+1h
before my code it looks like that
tms sell
2015-11-23 15:00:00 6
2015-11-23 16:00:00 2
2015-11-23 17:00:00 10
after it looks like that
tms sell sell_to_predict
2015-11-23 15:00:00 6 2
2015-11-23 16:00:00 2 10
2015-11-23 17:00:00 10 NaN
I create a new column based on a shift of an other column, but that's not a shift in number of columns. That's a shift based on an index (here the index is a timestamp)
Here is an other example, little more complex :
before :
sell random
store hour
1 1 1 9
2 7 7
2 1 4 3
2 2 3
after :
sell random predict
store hour
1 1 1 9 7
2 7 7 NaN
2 1 4 3 2
2 2 3 NaN

have you tried shift?
e.g.
df = pd.DataFrame(list(range(4)))
df.columns = ['sold']
df['predict'] = df.sold.shift(-1)
df
sold predict
0 0 1
1 1 2
2 2 3
3 3 NaN

the answer was to resample so I won't have any hole, and then apply the answer for this question : How do you shift Pandas DataFrame with a multiindex?

Related

pandas get a sum column for next 7 days

I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.
With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0

extract chunk of Pandas dataframe from today's date up to "n" weeks ahead

I want to write code to cut a dataframe that contains weekly predictions data to return a 'n' week prediction length from today's date.
a toy example of my dataframe looks like this:
data4 = pd.DataFrame({'Id' : ['001','002','003'],
'2020-01-01' : [4,5,6],
'2020-01-08':[3,5,6],
'2020-01-15': [2,6,7],
'2020-01-22': [2,6,7],
'2020-01-29': [2,6,7],
'2020-02-5': [2,6,7],
'2020-02-12': [4,4,4]})
Id 2020-01-01 2020-01-08 2020-01-15 2020-01-22 2020-01-29 2020-02-5 \
0 001 4 3 2 2 2 2
1 002 5 5 6 6 6 6
2 003 6 6 7 7 7 7
2020-02-12
0 4
1 4
2 4
I am trying to get:
dataset_for_analysis = pd.DataFrame({'Id' : ['001','002','003'],
'2020-01-15': [2,6,7],
'2020-01-22': [2,6,7],
'2020-01-29': [2,6,7],
'2020-02-5': [2,6,7]})
Id 2020-01-15 2020-01-22 2020-01-29 2020-02-5
0 001 2 2 2 2
1 002 6 6 6 6
2 003 7 7 7 7
I have done this,from what I understood from datetime documentations.
dataset_for_analysis = data4.datetime.datetime.today+ pd.Timedelta('3 weeks')
and gives me the error:
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'datetime'
I am a bit confused about how to use the datetime today and timedelta, especially because i am working with weekly data. is there a way to get the current week of the year i am in, rather than the day? Would anyone has help with this? Thank you!
You can do the following:
today = '2020-01-15'
n_weeks = 10
# get dates by n weeks
cols = [str((pd.to_datetime(today) + pd.Timedelta(weeks=x)).date()) for x in range(n_weeks)]
# pick the columns which exist in cols
use_cols = ['Id'] + [x for x in data4.columns if x in cols]
# select the columns
data4 = data4[use_cols]
Id 2020-01-15 2020-01-22 2020-01-29 2020-02-12
0 001 2 2 2 4
1 002 6 6 6 4
2 003 7 7 7 4

Elegant way to drop records in pandas based on size/count of a record

This isn't a duplicate. I am not trying drop rows based on Index
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14
08:00:00'],
'val' :[5,2,3,1,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
I would like to drop records based on subject_id if their count is <=5.
This is what I tried
df1 = df.groupby(['subject_id']).size().reset_index(name='counter')
df1[df1['counter']>5] # this gives the valid subject_id = 1 has count more than 5)
Now using this subject_id, I have to get the base dataframe rows for that subject_id
There might be an elegant way to do this.
I would like to get the output as shown below. I would like have my base dataframe rows
Use:
df[df.groupby('subject_id')['subject_id'].transform('size')>5]
Output:
subject_id time_1 val day
0 1 2173-04-03 12:35:00 5 3
1 1 2173-04-03 12:50:00 2 3
2 1 2173-04-05 12:59:00 3 5
3 1 2173-05-04 13:14:00 1 4
4 1 2173-05-05 13:37:00 1 5
5 1 2173-07-06 13:39:00 6 6
6 1 2173-07-08 11:30:00 5 8

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

3 Day running sum per user

I have data that looks like this
time. user value
0 2012-01-01 01:01:01 1 1
1 2012-01-02 01:01:01 1 2
2 2012-01-04 01:01:01 2 3
3 2012-01-06 01:01:01 2 1
4 2012-01-07 01:01:01 2 2
5 2012-01-08 01:01:01 2 1
6 2012-01-10 01:01:01 2 2
7 2012-01-13 01:01:01 2 2
8 2012-01-14 01:01:01 3 1
...
and I need to know, for each user, if there are any 3 day periods of time where the sum of the values in those 3 days is greater than 5. 1 will represent yes, 0 no. The result should look like this.
user 3DS
1 0
2 1
3 0
...
I know there's some combination of groupby on the user with some type of apply I think. I've found a windowing function that may be useful
3_days = timedelta(days=7)
lamba x : sum(df['value'][df['time'] <= x['time'] + 3_days])
How do I use pandas to get the second data frame with users and 3 day sum (3DS)?
This looks like you can do a rolling sum over each user.
df_total = df.set_index('time').groupby('user').rolling(3).sum()
df_total.groupby(level='user').agg(lambda x: x.max() > 5) * 1

Categories