I have the following DataFrame:
from datetime import datetime
from pandas import DataFrame
df = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl', 'Carl', 'Joe', 'Carl'],
'Quantity': [18, 3, 5, 1, 9, 3],
'Date': [
datetime(2013, 9, 1, 13, 0),
datetime(2013, 9, 1, 13, 5),
datetime(2013, 10, 1, 20, 0),
datetime(2013, 10, 3, 10, 0),
datetime(2013, 12, 2, 12, 0),
datetime(2013, 9, 2, 14, 0),
]
})
First: I am looking to add another column to this DataFrame which sums up the purchases of the last 5 days for each buyer. In particular the result should look like this:
Quantity
Buyer Date
Carl 2013-09-01 21
2013-09-02 24
2013-10-01 5
2013-10-03 6
Joe 2013-12-02 9
To do so I started with the following:
df1 = (df.set_index(['Date', 'Buyer'])
.unstack(level=[1])
.resample('D', how='sum')
.fillna(0))
However, I do not know how to add another column to this DataFrame which can add up for each row the previous 5 row entries.
Second:
Add another column to this DataFrame which does not only sum up the purchases of the last 5 days like in (1) but also weights these purchases based on their dates. For example: those purchases from 5 days ago should be counted 20%, those from 4 days ago 40%, those from 3 days ago 60%, those from 2 days ago 80% and those from one day ago and from today 100%
Related
I have a question about selecting a range in a pandas DataFrame in Python. I have a column with times and a column with values. I would like to select all the rows with times between 6 a.m. and 6 p.m. (so from 6:00:00 to 18:00:00). I've succeeded at selecting all the night times (between 18:00:00 and 6:00:00), but if I apply the same to the day times, it doesn't work. Is there something wrong with my syntax? Below is a minimal working example. timeslice2 returns an empty DataFrame in my case.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print(df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
print(timeslice1)
print(timeslice2)
I've been able to select the right range with this answer, but it seems strange to me that the above doesn't work. Moreover, if I convert my 'Time' column to 'datetime', as needed, it uses the date of today and I don't want that.
This way it works, the first range if treated like datetime has no results because it will mean two different dates (days) in chronological order.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print('Original df \n',df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
#conver Time column to datetime
df['Time'] = pd.to_datetime(df['Time'])
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
#convert df back to string
timeslice2["Time"] = timeslice2["Time"].dt.strftime('%H:%M:%S')
print('Slice 1 \n', timeslice1)
print('Slice 2 \n', timeslice2)
I have a situation where I have a code with which I am processing data for operated shifts.
In it, I have arrays for start and end of shifts (e.g. shift_start[0] and shift_end[0] for shift #1), and for the time between them, I need to know how many weekdays, holidays or weekend days.
The holidays I have already defined in an array of datetime entries, which should represent the holidays of a specific country (it's not the same as here and I do not seek for further more dynamic options here yet).
So basically I have it like that:
started = [datetime.datetime(2022, 2, 1, 0, 0), datetime.datetime(2022, 2, 5, 8, 0), datetime.datetime(2022, 2, 23, 11, 19, 28)]
ended = [datetime.datetime(2022, 2, 2, 16, 0), datetime.datetime(2022, 2, 5, 17, 19, 28), datetime.datetime(2022, 4, 26, 12, 30)]
holidays = [datetime.datetime(2022, 1, 3), datetime.datetime(2022, 3, 3), datetime.datetime(2022, 4, 22), datetime.datetime(2022, 4, 25)]
I'm seeking for options to go thru each of the 3 ranges and match the number of days it contains (e.g. the first range should contain 2 weekdays, the second - one weekend day)
So based on the suggestion by #gimix, I was able to develop what I needed:
for each_start, each_end in zip(started, ended): # For each period
for single_date in self.daterange(each_start, each_end): # For each day of each period
# Checking if holiday or weekend
if (single_date.replace(hour=0, minute=0, second=0) in holidays) or (single_date.weekday() > 4):
set_special_days_worked(1)
# If not holiday or weekend, then it is regular working day
else:
set_regular_days_worked(1)
I have two lists but one is a datetime. How can I combine to form a date frame with index of this datetime and the values of lista2?
lista1 = [datetime.datetime(2017, 11, 11, 0, 0), datetime.datetime(2017, 11, 12, 0, 0), datetime.datetime(2017, 11, 13, 0, 0)]
lista2 = [31488, 14335, 89]
You can use the index parameter from the constructor to specify a list as indices, and use the other one as data:
pd.DataFrame(lista2,index=lista1)
For your sample data, this gives:
>>> pd.DataFrame(lista2,index=lista1)
0
2017-11-11 31488
2017-11-12 14335
2017-11-13 89
past two list to a list of tuple
pd.DataFrame(list(zip(lista1,lista2))).set_index(0)
Out[646]:
1
0
2017-11-11 31488
2017-11-12 14335
2017-11-13 89
Let's say I have the following data frame. I want to calculate the average number of days between all the activities for a particular account.
Below is my desired result:
Now I know how to calculate the number of days between two dates with the following code. But I don't know how to calculate what I am looking for across multiple dates.
from datetime import date
d0 = date(2016, 8, 18)
d1 = date(2016, 9, 26)
delta = d0 - d1
print delta.days
I would do this as follows in pandas (assuming the Date column is a datetime64):
In [11]: df
Out[11]:
Account Activity Date
0 A a 2015-10-21
1 A b 2016-07-07
2 A c 2016-07-07
3 A d 2016-09-14
4 A e 2016-10-12
5 B a 2015-11-24
6 B b 2015-12-30
In [12]: df.groupby("Account")["Date"].apply(lambda x: x.diff().mean())
Out[12]:
Account
A 89 days 06:00:00
B 36 days 00:00:00
Name: Date, dtype: timedelta64[ns]
If your dates are in a list:
>>> from datetime import date
>>> dates = [date(2015, 10, 21), date(2016, 7, 7), date(2016, 7, 7), date(2016, 9, 14), date(2016, 10, 12), date(2016, 10, 12), date(2016, 11, 22), date(2016, 12, 21)]
>>> differences = [(dates[i]-dates[i-1]).days for i in range(1, len(dates))] #[260, 0, 69, 28, 0, 41, 29]
>>> float(sum(differences))/len(differences)
61.0
>>>
Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.
Long version:
I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having e.g. temp and dens as items (i.e. sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).
My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries (startRec and endRec) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.
I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.
How can I best achieve this?
Example data
Sample Panel with three recordings at two distances from the antenna:
import pandas as pd
import numpy as np
data = pd.Panel(data={'temp': np.array([[21, 20],
[19, 17],
[15, 14]]),
'dens': np.array([[1001, 1002],
[1000, 998],
[997, 995]])},
minor_axis=['1m', '3m'])
Output of data:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m
Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.
I have two TimeSeries I'd like to use as indices:
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
datetime(2013, 11, 12, 15, 00, 00),
datetime(2013, 11, 13, 15, 00, 00)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
datetime(2013, 11, 12, 15, 00, 10),
datetime(2013, 11, 13, 15, 00, 10)])
Output of startRec:
0 2013-11-11 15:00:00
1 2013-11-12 15:00:00
2 2013-11-13 15:00:00
dtype: datetime64[ns]
Being in a Panel makes this a little trickier. I typically stick with DataFrames.
But how does this look:
import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
datetime(2013, 11, 12, 15, 0, 0),
datetime(2013, 11, 13, 15, 0, 0)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
datetime(2013, 11, 12, 15, 0, 10),
datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
'temp': np.array([21, 19, 15]),
'dens': np.array([1001, 1000, 997]),
'start': startRec,
'end': endRec
}
)
_data3m = pd.DataFrame(data={
'temp': np.array([20, 17, 14]),
'dens': np.array([1002, 998, 995]),
'start': startRec,
'end': endRec
}
)
_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)
data = pd.Panel(data={'1m': _data1m, '3m': _data3m})
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or
row[1] < pd.Timestamp('2013-11-13'))
and that outputs:
dens temp
start end
2013-11-11 15:00:00 2013-11-11 15:00:10 1002 20
2013-11-12 15:00:00 2013-11-12 15:00:10 998 17