timedeltas for a groupby column in pandas [duplicate] - python

This question already has an answer here:
How to calculate time difference by group using pandas?
(1 answer)
Closed 4 years ago.
For a given data frame df
timestamps = [
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 1
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 1, 11, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 4, 10, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 5, 12, 0, 0, 0) # person 3
]
df = pd.DataFrame({'person': [1, 2, 2, 2, 3, 3, 3, 3], 'timestamp': timestamps })
I want to calculate for each person (df.groupby('person')) the time differences between all timestamps of that person, which I would to with diff().
df.groupby('person').timestamp.diff()
is just half the way, because the mapping back to the person is lost.
How could a solution look like?

i think you should use
df.groupby('person').timestamp.transform(pd.Series.diff)

There is problem diff no aggregate values, so possible solution is transform:
df['new'] = df.groupby('person').timestamp.transform(pd.Series.diff)
print (df)
person timestamp new
0 1 2018-01-01 10:00:00 NaT
1 2 2018-01-01 10:00:00 NaT
2 2 2018-01-01 11:00:00 0 days 01:00:00
3 2 2018-01-02 11:00:00 1 days 00:00:00
4 3 2018-01-01 10:00:00 NaT
5 3 2018-01-02 11:00:00 1 days 01:00:00
6 3 2018-01-04 10:00:00 1 days 23:00:00
7 3 2018-01-05 12:00:00 1 days 02:00:00

Related

Using Django ORM to sum annotated queries

I am using Django in combination with a TimescaleDB to process energy data from pv installations. Now I would like to calculate, how much energy multiple plants have generated within a given timebucket (usually per day). The plants writing the new values in a table which looks like that:
customer
datasource
value
timestamp
1
1
1
01.05.22 10:00
1
1
5
01.05.22 18:00
1
2
3
01.05.22 09:00
1
2
9
01.05.22 17:00
1
1
5
02.05.22 10:00
1
1
12
02.05.22 18:00
1
2
9
02.05.22 09:00
1
2
16
02.05.22 17:00
Now what I would like to have is, get the overal daily gain of values (so, for each day: Last Entry Value - (minus) First Entry Value) for each customer (which means, sum the daily generated energy values from all datasource which belong to the customer).
In the above example that would be for customer 1:
Day 01.05.22:
Daily Gain of Datasource 1: 5 - 1 = 4
Daily Gain of Datasource 2: 9 - 3 = 6
Overall: 4 + 6 = 10
Day 02.05.22:
Daily Gain of Datasource 1: 12 - 5 = 7
Daily Gain of Datasource 2: 16 - 9 = 7
Overall: 7 + 7 = 14
The result should look like that:
customer
timestamp
value
1
01.05.22 00:00
10
1
02.05.22 00:00
14
What I have now in Code is this:
dpes = (DataPointEnergy.timescale
.filter(source__in=datasources, time__range=(_from, _to))
.values('source', interval_end=timebucket)
.order_by('interval_end')
.annotate(value=Last('value', 'time') - First('value', 'time'))
.values('interval_end', 'value', 'source')
)
Which gives me the following result:
{'source': 16, 'timestamp': datetime.datetime(2022, 1, 9, 0, 0, tzinfo=<UTC>), 'value': 2.0}
{'source': 17, 'timestamp': datetime.datetime(2022, 1, 9, 0, 0, tzinfo=<UTC>), 'value': 2.0}
{'source': 16, 'timestamp': datetime.datetime(2022, 1, 10, 0, 0, tzinfo=<UTC>), 'value': 2.0}
{'source': 17, 'timestamp': datetime.datetime(2022, 1, 10, 0, 0, tzinfo=<UTC>), 'value': 2.0}
{'source': 16, 'timestamp': datetime.datetime(2022, 1, 11, 0, 0, tzinfo=<UTC>), 'value': 2.0}
{'source': 17, 'timestamp': datetime.datetime(2022, 1, 11, 0, 0, tzinfo=<UTC>), 'value': 2.0}
However I would still need to group the results by timestamp and sum the value column (which I am doing now using Pandas). Is there a possibility to let the database do the work?
I tried to use another .annotate() to sum up the values but this results in the error:
django.core.exceptions.FieldError: Cannot compute Sum('value'): 'value' is an aggregate

Grouping data by id, var1 into consecutive dates in python using pandas

I have some data that looks like:
df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
"val": [9, 2, 4, 7, 6, 3, 2],
"dates": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.
For example the final output should be:
df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
"val": [13, 2, 13, 3, 2],
"start_date": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)],
"end_date": [pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.
For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'
s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)
I have also tried:
df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()
Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.
EDIT: adding more context
Another approach is:
df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)
which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window
Extended example that has a break in the date range for group (102,'a').
df_raw_dates = pd.DataFrame(
{
"id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
"var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
"val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
"dates": [
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 7),
pd.Timestamp(2020, 1, 8),
pd.Timestamp(2020, 1, 9),
pd.Timestamp(2020, 1, 21),
pd.Timestamp(2020, 1, 25),
],
}
)
Further example
This is using the anwser below from wwii
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?
Not sure what the difference with this example and the extended example above is and why this seems to not work?
I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.
Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.
import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])
for k,g in gb:
print(g)
dt = g['dates']
# find difference in days between rows
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
# create a Series to identify consecutive ranges to group by
# this cumsum trick can be found in many SO answers
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
# split into date ranges
date_groups = g.groupby(groups)
for _,date_range in date_groups:
print(date_range)
print(sep)
You can see that the (102,'a') group has been split into two groups.
id var1 val dates
0 102 a 9 2020-01-01
2 102 a 4 2020-01-02
7 102 a 1 2020-01-03
id var1 val dates
8 102 a 2 2020-01-07
9 102 a 3 2020-01-08
10 102 a 4 2020-01-09
Going a bit further: while iterating construct a dictionary to make a new DataFrame with.
import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed,var = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(g.val.mean())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(),date_range['dates'].max()
val = date_range.val.mean()
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>>
id var1 val start end
0 102 a 4.666667 2020-01-01 2020-01-03
1 102 a 3.000000 2020-01-07 2020-01-09
2 102 b 2.000000 2020-01-01 2020-01-01
3 103 a 3.000000 2020-01-05 2020-01-05
4 103 b 6.500000 2020-01-02 2020-01-03
5 104 c 2.000000 2020-03-12 2020-03-12
6 108 a 99.000000 2020-01-21 2020-01-25
Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.
It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.
id var1 val dates
11 108 a 99 2020-01-21
12 108 a 99 2020-01-25
You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

Is there a way to set a number to equal a date in python?

I have a few files that have a randomly generated number that corresponds with a date:
736815 = 01/05/2018
I need to create a function or process that applies logic to sequential numbers so that the next number equals the next calendar date.
Ideally i would need it in a key:pair format, so that when i convert the file to a new format i can apply the date in place of the auto file name.
Hopefully this makes more sense, it is to be used to name a converted file.
I think origin parameter is possible use here, also add unit='D' to to_datetime:
df = pd.DataFrame({'col':[5678, 5679, 5680]})
df['date'] = pd.to_datetime(df['col'] - 5678, unit='D', origin=pd.Timestamp('2020-01-01'))
print (df)
col date
0 5678 2020-01-01
1 5679 2020-01-02
2 5680 2020-01-03
Non pandas solution, only pure python with same idea:
from datetime import datetime, timedelta
L = [5678, 5679, 5680]
a = [timedelta(x-5678) + datetime(2020,1,1) for x in L]
print (a)
[datetime.datetime(2020, 1, 1, 0, 0),
datetime.datetime(2020, 1, 2, 0, 0),
datetime.datetime(2020, 1, 3, 0, 0)]
The number doesn't need to translate into the date directly in any way. You just need to pick a start date and a number, and add another number either via simple addition or via a timedelta:
from datetime import date, timedelta
from random import randint
start_date = date.today()
start_int = randint(1000, 10000)
for i in range(10):
print(start_int + i, start_date + timedelta(days=i))
6964 2020-01-29
6965 2020-01-30
6966 2020-01-31
6967 2020-02-01
6968 2020-02-02
6969 2020-02-03
6970 2020-02-04
6971 2020-02-05
6972 2020-02-06
6973 2020-02-07
If you're getting your list of numbers from somewhere else, add/subtract appropriately from a start int/date for the same effect.
Another solution is to create an object, which encapsulates the date and the base number to count from. Each call to this object (implemented using the __call__ special method) will create a new date object using the time delta between the base number and the supplied number.
import datetime
class RelativeDate:
def __init__(self, date, base):
self.date = date
self.base = base
def __call__(self, number):
delta = datetime.timedelta(days=number - self.base)
return self.date + delta
def create_base_date(number, date):
return RelativeDate(
date=datetime.datetime.strptime(date, '%d/%m/%Y'),
base=number,
)
base_date = create_base_date(1, '03/01/2020')
base_date(3)
datetime.datetime(2020, 1, 5, 0, 0)
Example snippet:
base_date = create_base_date(1, '03/01/2020')
{i: base_date(i) for i in range(1, 10)}
Output:
{1: datetime.datetime(2020, 1, 3, 0, 0),
2: datetime.datetime(2020, 1, 4, 0, 0),
3: datetime.datetime(2020, 1, 5, 0, 0),
4: datetime.datetime(2020, 1, 6, 0, 0),
5: datetime.datetime(2020, 1, 7, 0, 0),
6: datetime.datetime(2020, 1, 8, 0, 0),
7: datetime.datetime(2020, 1, 9, 0, 0),
8: datetime.datetime(2020, 1, 10, 0, 0),
9: datetime.datetime(2020, 1, 11, 0, 0)}

export to csv and read multiIndex dataframe pandas

I need to export to csv and then import again a DataFrame that looks like this:
price ................................................................................................................... hold buy balance long_size short_size minute hour day week month
close high low open CCI12 ROC12 CCI15 ROC15 CCI21 ROC21 ...
Time
2015-01-02 14:20:00 97.8515 97.8595 97.8205 97.8345 91.168620 0.000557 95.323467 0.000394 68.073065 0.000348 ... 0.0 0.0 0.0 0.0 0.0 8.660254e-01 -0.500000 0.974928 1.205367e-01 5.000000e-01
where the row index is represented by the timestamp and the first 39 columns are subcolumns of 'price' while the remaining ones are on the same level of 'price'. The multiIndex looks like this
MultiIndex(levels=[['price', 'tick_counts', 'sell', 'hold', 'buy', 'balance', 'long_size', 'short_size', 'minute', 'hour', 'day', 'week', 'month'], [0, 'close', 'high', 'low', 'open', 'CCI12', 'ROC12', 'CCI15', 'ROC15', 'CCI21', 'ROC21', 'CCI30', 'ROC30', 'CCI40', 'ROC40', 'CCI100', 'ROC100', 'SMA12', 'EWMA12', 'SMA21', 'EWMA21', 'SMA26', 'EWMA26', 'SMA50', 'EWMA50', 'SMA100', 'EWMA100', 'SMA200', 'EWMA200', 'MACD', 'UpperBB10', 'LowerBB10', 'UpperBB20', 'LowerBB20', 'UpperBB30', 'LowerBB30', 'UpperBB40', 'LowerBB40', 'UpperBB50', 'LowerBB50', '']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 0, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40]])
I have no idea on how to preserve this structure easily while exporting with df.to_csv() and importing with df.read_csv(). All my attempts have been a mess so far.
EDIT: if I simply use as suggested pd.to_csv("/", index=True) and then I read it back with read_csv("/"), I get:
Unnamed: 0 price price.1 price.2 price.3 price.4 price.5 price.6 price.7 price.8 ... hold buy balance long_size short_size minute hour day week month
0 NaN close high low open CCI12 ROC12 CCI15 ROC15 CCI21 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Time NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2015-01-02 14:20:00 97.85149999999999 97.8595 97.82050000000001 97.83449999999999 91.16862020296143 0.0005572768080819476 95.32346677471595 0.0003936082115872622 68.07306512447788 ... 0.0 0.0 0.0 0.0 0.0 8.660254e-01 -0.500000 0.974928 1.205367e-01 5.000000e-01
where the second layer of the header became the first row of the dataFrame.
EDIT2: Nevermind, I've just discovered hdf5 and apparently, contrary to csv, it preserves the structure even with multiIndex without additional work, so I will use pd.to_hdf().
I think if you use - pd.to_csv("/", index=True)
It saved it with index and then read back as normal.

Pandas: Combine different timespans and cumsum

I have the following DataFrame:
from datetime import datetime
from pandas import DataFrame
df = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl', 'Carl', 'Joe', 'Carl'],
'Quantity': [18, 3, 5, 1, 9, 3],
'Date': [
datetime(2013, 9, 1, 13, 0),
datetime(2013, 9, 1, 13, 5),
datetime(2013, 10, 1, 20, 0),
datetime(2013, 10, 3, 10, 0),
datetime(2013, 12, 2, 12, 0),
datetime(2013, 9, 2, 14, 0),
]
})
First: I am looking to add another column to this DataFrame which sums up the purchases of the last 5 days for each buyer. In particular the result should look like this:
Quantity
Buyer Date
Carl 2013-09-01 21
2013-09-02 24
2013-10-01 5
2013-10-03 6
Joe 2013-12-02 9
To do so I started with the following:
df1 = (df.set_index(['Date', 'Buyer'])
.unstack(level=[1])
.resample('D', how='sum')
.fillna(0))
However, I do not know how to add another column to this DataFrame which can add up for each row the previous 5 row entries.
Second:
Add another column to this DataFrame which does not only sum up the purchases of the last 5 days like in (1) but also weights these purchases based on their dates. For example: those purchases from 5 days ago should be counted 20%, those from 4 days ago 40%, those from 3 days ago 60%, those from 2 days ago 80% and those from one day ago and from today 100%

Categories