Group timestamps by day in pandas - python

I want to combine multiple datestamps (datetime64) to a single row representing one day. Then I want to sum up the amount in the last column getting the total sales per day.
In this case I want to have two lines, with the two days and the total sales.
I have tried to solve my problem with the groupby operation, but it won't work.

You could try to use resample
df_1d=df.resample('1d', on='timestamp').sum()
It will sum all data for all day or from another time

The one-liner df.resample('1d', on='timestamp').sum()
from Aeroxer Support is perfect, but it does not explain why your attempts with groupby failed.
In order to groupby to work, you would need a column with just the day in it. Then you could groupby by that day column.
Below is the example code. I add the extra column with just the day in it at In [4] and then df.groupby('day').sum() is what you are looking for.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'timestamp': map(pd.Timestamp, ['2022-09-30 11:21', '2022-09-30 20:55', '2022-10-01 10:35', '2022-10-01 22:42']),
...: 'sales': [99.90, 10.20, 5.99, 21.00]
...: })
In [3]: df
Out[3]:
timestamp sales
0 2022-09-30 11:21:00 99.90
1 2022-09-30 20:55:00 10.20
2 2022-10-01 10:35:00 5.99
3 2022-10-01 22:42:00 21.00
In [4]: df['day'] = df.timestamp.dt.floor('1D')
In [5]: df
Out[5]:
timestamp sales day
0 2022-09-30 11:21:00 99.90 2022-09-30
1 2022-09-30 20:55:00 10.20 2022-09-30
2 2022-10-01 10:35:00 5.99 2022-10-01
3 2022-10-01 22:42:00 21.00 2022-10-01
In [6]: df.groupby('day').sum()
Out[6]:
sales
day
2022-09-30 110.10
2022-10-01 26.99
You don't have to explicitly save the day in a new column, this works just as well:
df.groupby(df.timestamp.dt.floor('1D')).sum()
although I find it hard to read. See the docs on Series.dt.floor().

Related

Subtracking into from Dates

I have some data that will looks like this:
Dates Delta
0 2022-10-01 10
1 2022-10-01 21
2 2022-10-01 34
I am trying to add a new column, where I can subtract the number in the Delta column from the date in the Dates column. Ideally, the output will look like this (i did this by hand so if the dates are wrong, please excuse me).
Dates Delta CalculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-23
I've tried various versions of this and I'm not having any luck.
# importing libraries to create and manipulate toy data
import pandas as pd
from datetime import datetime, timedelta
# create toy data
df = pd.DataFrame({'Dates': ['2022-10-01', '2022-10-01', '2022-10-01'],
'Delta': [10, 21, 34]})
# cast the `Dates` column as dates
df['Dates'] = pd.to_datetime(df['Dates'])
##### Need help here
# Create a new column, showing the calculated date
df['CalculatedDate'] = df['Dates'] - timedelta(days=df['Delta'])
df['CalculatedDate'] = pd.to_datetime(df['Dates']) - pd.TimedeltaIndex(df['Delta'], unit='D')
df
Dates Delta CalculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-28
Here is one way to do it
# for each row subtract the delta from the date in the row
# using Day offset
df['calculatedDate']= df.apply(lambda x: x['Dates'] - pd.offsets.Day(x['Delta']), axis=1)
df
Dates Delta calculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-28
I see Naveed and Panda has a fix that works great, suggesting the one I came up with as well:
for x in range(len(df)):
df.loc[x,'CalculatedDate'] = df.loc[x, 'Dates'] - timedelta(days=int(df.loc[x, 'Delta']))
print(df)
Put it in a for loop so that you can index through each row and do each row individually. Also make df['CalculatedDate'] into df.loc[x,'CalculatedDate']. This way you will do each row individually. Hope this helps

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Difference between pandas aggregators .first() and .last()

I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.

Count business day between using pandas columns

I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.

Change year/quarter date format to previous period in python

I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125

Categories