I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling(window='10D', on='date').sum() but I can't get the syntax right!!
I have tried
df["Total Units sold in the last 10 days"] = df.rolling(on="date", window="10D", closed="both").sum()["Units Sold"] but get the error
"ValueError: Wrong number of items passed 2, placement implies 1" and "ValueError: Shape of passed values is (500, 2), indices imply (500, 1)"
Please please help!
Based on your sample data, you need to specify on parameter.
df = pd.DataFrame({'Date': [pd.Timestamp('2019-01-01 15:00:00'),
pd.Timestamp('2019-01-01 19:00:00'),
pd.Timestamp('2019-01-05 11:00:00'),
pd.Timestamp('2019-01-12 12:00:00'),
pd.Timestamp('2019-01-15 15:00:00'),
pd.Timestamp('2019-02-04 18:00:00')],
'Units Sold': [4, 5, 1, 3, 2, 7],
'Total Units sold in the last X days': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
df = df.sort_values("Date")
df["Total Units sold in the last X days"] = df.rolling("10D", on="Date").sum()["Units Sold"]
df
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 15:00:00
4
4
1
2019-01-01 19:00:00
5
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
5
5
2019-02-04 18:00:00
7
7
Related
I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling("10D").sum() but I can't get the syntax right!!
Please please help!
Try:
df["Total Units sold in the last 10 days"] = df.rolling(on="Date", window="10D", closed="both").sum()["Units Sold"]
print(df)
Prints:
Date Units Sold Total Units sold in the last 10 days
0 2019-01-01 5 5.0
1 2019-01-01 4 9.0
2 2019-01-05 1 10.0
3 2019-01-12 3 4.0
4 2019-01-15 2 6.0
5 2019-02-04 7 7.0
I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01
I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00
This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object
In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object
A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.
In:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
Out:
ID Time Value
1 2019-01-01 12:00:00 3.0
1 2019-01-01 12:15:00 3.0
1 2019-01-01 12:30:00 2.0
1 2019-01-01 12:45:00 2.0
2 2019-01-01 12:00:00 5.0
2 2019-01-01 12:15:00 7.0
2 2019-01-01 12:30:00 7.0
2 2019-01-01 12:45:00 7.0
I wrote a function to do this, however efficiency goes down drastically when trying to process a larger dataset.
Is there a more efficient way to do this?
import datetime
import pandas as pd
data = pd.DataFrame({'ID': [1,1,2,2],
'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'],
'Value': [3,2,5,7]})
def clean_dataset(data):
ids = data.drop_duplicates(subset='ID')
data['Time'] = pd.to_datetime(data['Time'])
data['Time'] = data['Time'].apply(
lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
df = pd.DataFrame(columns=['Time','ID','Value'])
for i in range(ids.shape[0]):
times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
id_data = data[data['ID']==ids.iloc[i]['ID']]
clean_data = times.join(id_data.set_index('Time'), on='Time')
clean_data = clean_data.interpolate(method='linear', limit_direction='both')
clean_data.drop(clean_data.tail(1).index,inplace=True)
df = df.append(clean_data)
return df
clean_dataset(data)
Linear interpolation does become slow with a large data set. Having a loop in your code is also responsible for a large part of the slowdown. Anything that can be removed from the loop and pre-computed will help increase efficiency. For example, if you pre-define the data frame that you use to initialize times, the code becomes 14% more efficient:
times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
times = times_template.copy()
Profiling your code confirms that the interpolation takes the longest amount of time (22.7%), followed by the join (13.1%), the append (7.71%), and then the drop (7.67%) commands.
You can use:
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create MultiIndex and reindex
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
#interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0
If range cannot be change:
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#end in 13:00
rng = pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min')
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
#remove last row per groups
data = data[data['ID'].duplicated(keep='last')]
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
5 2 2019-01-01 12:00:00 5.0
6 2 2019-01-01 12:15:00 7.0
7 2 2019-01-01 12:30:00 7.0
8 2 2019-01-01 12:45:00 7.0
EDIT:
Another solution with merge and left join instead reindex:
from itertools import product
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create helper DataFrame and merge with left join
df = pd.DataFrame(list(product(data['ID'].unique(), rng)), columns=['ID','Time'])
print (df)
ID Time
0 1 2019-01-01 12:00:00
1 1 2019-01-01 12:15:00
2 1 2019-01-01 12:30:00
3 1 2019-01-01 12:45:00
4 2 2019-01-01 12:00:00
5 2 2019-01-01 12:15:00
6 2 2019-01-01 12:30:00
7 2 2019-01-01 12:45:00
data = df.merge(data, how='left')
##interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0
Say I have a pd.Series of daily S&P 500 values, and I would like to filter this series to get the first business day and the associated value of each week.
So, for instance, my filtered series would contain the 5 September 2017 (Tuesday - no value for the Monday), then 11 September 2017 (Monday).
Source series:
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-06 2465.54
2017-09-07 2465.10
2017-09-08 2461.43
2017-09-11 2488.11
2017-09-12 2496.48
Filtered series
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
My solution currently consists of:
mask = SP500.apply(lambda row: SP500[row.name - datetime.timedelta(days=row.name.weekday()):].index[0], axis=1).unique()
filtered = SP500.loc[mask]
This however feels suboptimal/non-pythonic. Any better/faster/cleaner solutions?
Using resample on pd.Series.index.to_series
s[s.index.to_series().resample('W').first()]
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
dtype: float64
df.sort_index().assign(week=df.index.get_level_values(0).week).drop_duplicates('week',keep='first').drop('week',1)
Out[774]:
price
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
I'm not sure that the solution you give works, since the .apply method for series can't access the index, and doesn't have an axis argument. What you gave would work on a DataFrame, but this is simpler if you have a dataframe:
#Make some fake data
x = pd.DataFrame(pd.date_range(date(2017, 10, 9), date(2017, 10, 23)), columns = ['date'])
x['value'] = x.index
print(x)
date value
0 2017-10-09 0
1 2017-10-10 1
2 2017-10-11 2
3 2017-10-12 3
4 2017-10-13 4
5 2017-10-14 5
6 2017-10-15 6
7 2017-10-16 7
8 2017-10-17 8
9 2017-10-18 9
10 2017-10-19 10
11 2017-10-20 11
12 2017-10-21 12
13 2017-10-22 13
14 2017-10-23 14
#filter
filtered = x.groupby(x['date'].apply(lambda d: d-timedelta(d.weekday())), as_index = False).first()
print(filtered)
date value
0 2017-10-09 0
1 2017-10-16 7
2 2017-10-23 14