Downsample to quarter level and get quarter end date value in Pandas

Downsample to quarter level and get quarter end date value in Pandas - python

my data frame has daily value from 2005-01-01 to 2021-10-31.
| C1 | C2
-----------------------------
2005-01-01 | 2.7859 | -7.790
2005-01-02 |-0.7756 | -0.97
2005-01-03 |-6.892 | 2.770
2005-01-04 | 2.785 | -0.97
. . .
. . .
2021-10-28 | 6.892 | 2.785
2021-10-29 | 2.785 | -6.892
2021-10-30 |-6.892 | -0.97
2021-10-31 |-0.7756 | 2.34
I want to downsample this data frame to get quarter value as follows.
| C1 | C2
------------------------------
2005-03-01 | 2.7859 | -7.790
2005-06-30 |-0.7756 | -0.97
2005-09-30 |-6.892 | 2.770
2005-12-31 | 2.785 | -0.97
I tried to do it with Pandas resample method but it requires an aggregation method.
df = df.resample('Q').mean()
I don't want the aggregated value I want the current value at the quarter-end date as it is.

Your code works except you are not using the right function. Replace mean by last:
dti = pd.date_range('2005-01-01', '2021-10-31', freq='D')
df = pd.DataFrame(np.random.random((len(dti), 2)), columns=['C1', 'C2'], index=dti)
dfQ = df.resample('Q').last()
print(dfQ)
# Output:
C1 C2
2005-03-31 0.653733 0.334182
2005-06-30 0.425229 0.316189
2005-09-30 0.055675 0.746406
2005-12-31 0.394051 0.541684
2006-03-31 0.525208 0.413624
... ... ...
2020-12-31 0.662081 0.887147
2021-03-31 0.824541 0.363729
2021-06-30 0.064824 0.621555
2021-09-30 0.126891 0.549009
2021-12-31 0.126217 0.044822
[68 rows x 2 columns]

You can do this,
df = df[df.index.is_quarter_end]
You will filter out the dates only at the end of each quarter.

Related

find start and end date of previous 12 month from current date in python

I need to find the start and end date of the previous 12 months from the current date.
If the current date is 05-May-2022
Then, it should display past 12 months first date and last date for all months including current month.
how to achieve this as each month have a different number of days? Do we have any function in datetime to achieve this?
code only display previous month first and last date, so i want to print previous 12 months
from datetime import date, timedelta
this_first = date.today().replace(day=1)
prev_last = this_first - timedelta(days=1)
prev_first = prev_last.replace(day=1)
prev_first, prev_last
Output:
(datetime.date(2021, 1, 1), datetime.date(2021, 1, 31))
Expected Output:
[('2021-06-01', '2021-06-30'), ('2021-07-01', '2021-07-31'),
('2021-08-01', '2021-08-31'), ('2021-09-01', '2021-09-30'),
('2021-10-01', '2021-10-31'), ('2021-11-01', '2021-11-30'),
('2021-12-01', '2021-12-31'), ('2022-01-01', '2022-01-31'),
('2022-02-01', '2022-02-28'), ('2022-03-01', '2022-03-31'),
('2022-04-01', '2022-04-30'), ('2022-05-01', '2022-05-31')]
Note:
dtype should be only in datetime.

You can use current_date.replace(day=1) to get first day in current month.
And if you substract datetime.timedelta(days=1) then you get last day in previous month.
And you can use again replace(day=1) to get first day in previous month.
If you repeate it in loop then you can get first day and last day for all 12 months.
import datetime
current = datetime.datetime(2022, 5, 5)
start = current.replace(day=1)
for x in range(1, 13):
end = start - datetime.timedelta(days=1)
start = end.replace(day=1)
print(f'{x:2} |', start.date(), '|', end.date())
Result:
1 | 2022-04-01 | 2022-04-30
2 | 2022-03-01 | 2022-03-31
3 | 2022-02-01 | 2022-02-28
4 | 2022-01-01 | 2022-01-31
5 | 2021-12-01 | 2021-12-31
6 | 2021-11-01 | 2021-11-30
7 | 2021-10-01 | 2021-10-31
8 | 2021-09-01 | 2021-09-30
9 | 2021-08-01 | 2021-08-31
10 | 2021-07-01 | 2021-07-31
11 | 2021-06-01 | 2021-06-30
12 | 2021-05-01 | 2021-05-31
EDIT:
And if you use pandas then you can use pd.date_range() but it can't for previous dates so you would have to first get '2021.04.05' (for MS) and '2021.05.05' (for M)
import pandas as pd
#all_starts = pd.date_range('2021.04.05', '2022.04.05', freq='MS')
all_starts = pd.date_range('2021.04.05', periods=12, freq='MS')
print(all_starts)
#all_ends = pd.date_range('2021.05.05', '2022.05.05', freq='M')
all_ends = pd.date_range('2021.05.05', periods=12, freq='M')
print(all_ends)
for start, end in zip(all_starts, all_ends):
print(start.to_pydatetime().date(), '|', end.to_pydatetime().date())
DatetimeIndex(['2021-05-01', '2021-06-01', '2021-07-01', '2021-08-01',
'2021-09-01', '2021-10-01', '2021-11-01', '2021-12-01',
'2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01'],
dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
'2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
'2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[ns]', freq='M')
2021-05-01 | 2021-05-31
2021-06-01 | 2021-06-30
2021-07-01 | 2021-07-31
2021-08-01 | 2021-08-31
2021-09-01 | 2021-09-30
2021-10-01 | 2021-10-31
2021-11-01 | 2021-11-30
2021-12-01 | 2021-12-31
2022-01-01 | 2022-01-31
2022-02-01 | 2022-02-28
2022-03-01 | 2022-03-31
2022-04-01 | 2022-04-30
EDIT:
I found out that standard module calendar can gives number of days and weeks in month.
weeks, days = calendar.monthrange(year, month)
Working example:
import calendar
year = 2022
month = 5
for number in range(1, 13):
if month > 1:
month -= 1
else:
month = 12
year -= 1
weeks, days = calendar.monthrange(year, month)
print(f'{number:2} | {year}.{month:02}.01 | {year}.{month:02}.{days}')
Result:
1 | 2022.04.01 | 2022.04.30
2 | 2022.03.01 | 2022.03.31
3 | 2022.02.01 | 2022.02.28
4 | 2022.01.01 | 2022.01.31
5 | 2021.12.01 | 2021.12.31
6 | 2021.11.01 | 2021.11.30
7 | 2021.10.01 | 2021.10.31
8 | 2021.09.01 | 2021.09.30
9 | 2021.08.01 | 2021.08.31
10 | 2021.07.01 | 2021.07.31
11 | 2021.06.01 | 2021.06.30
12 | 2021.05.01 | 2021.05.31

Differences in one column based on differences in another, pandas

How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you

This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0

#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)

Remove rows with values repeated on specific columns [duplicate]

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)

I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00

One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

Pandas: get the first occurrence grouping by keys

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)

I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00

One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

How to include strings with Pandas resample

I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:
DateTime Safe1 Safe2 Safe3 Measurement
1/8/2013 6:06 N Y N
1/8/2013 6:23 N Y N
1/8/2013 6:40 N N N 28
1/8/2013 6:57 N N N 31
I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.
However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.
How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?
Desired Output based upon above:
DateTime Safe1 Measurement
1/8/2013 6:00 Y
1/8/2013 6:30 N 29.5

I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.
First creating the data :
import pandas as pd
index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']
data = {'Safe1' : ['N', 'N', 'N', 'N'],
'Safe2': ['Y', 'Y', 'N', 'N'],
'Safe3': ['N', 'N', 'N', 'N'],
'Measurement': [0,0,28,31]}
df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df
output :
Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00 0 N Y N
2013-01-08 06:23:00 0 N Y N
2013-01-08 06:40:00 28 N N N
2013-01-08 06:57:00 31 N N N
Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.
df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df
output :
Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00 0 N Y N NYN
2013-01-08 06:23:00 0 N Y N NYN
2013-01-08 06:40:00 28 N N N NNN
2013-01-08 06:57:00 31 N N N NNN
finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.
That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :
def func(x):
x = ''.join(x.values)
return 'Y' if 'Y' in x else 'N'
df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })
output :
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5

Here's an answer using pandas built-in resample function.
First combine the 3 Safe values into a single column:
df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3
Turn the 3-letter strings into a 0-1 variable:
df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)
Write a custom resampling function for the 'Safes' column:
def f(x):
if sum(x) > 0: return 'Y'
else: return 'N'
Finally, resample:
df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())
Output:
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5

I manually resample the date (easy if it is rounding)....
Here is an example
from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd
def df_to_md(df):
print(tabulate(df, tablefmt="pipe",headers="keys"))
seed(42)
people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]
df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10%
# Tom should score well
df_to_md(df[df.Who=='tom'].head())
The table is in Markdown format - just to easy my cut and paste....
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 |
Harry scores badly
df_to_md(df[df.Who=='harry'].head())
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 2 | 2019-06-18 14:17:17.457124 | harry | 0.013 |
| 5 | 2019-06-18 14:32:17.457124 | harry | 0.038 |
| 8 | 2019-06-18 14:47:17.457124 | harry | 0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry | 0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry | 0.064 |
Lets get the average per hour per person
def round_to_hour(t):
# Rounds to nearest hour by adding a timedelta hour if minute >= 30
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
And generate a new column using this method.
df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())
This should be tom's data - showing original and rounded.
| | When | Who | Status | WhenRounded |
|---:|:---------------------------|:------|---------:|:--------------------|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 | 2019-06-18 14:00:00 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 | 2019-06-18 14:00:00 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 | 2019-06-18 15:00:00 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 | 2019-06-18 15:00:00 |
We can resample ... by grouping and using a grouping function
Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.
df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
| | WhenRounded | Who | Status |
|---:|:--------------------|:------|---------:|
| 2 | 2019-06-18 14:00:00 | tom | 0.873 |
| 5 | 2019-06-18 15:00:00 | tom | 0.83925 |
| 8 | 2019-06-18 16:00:00 | tom | 0.86175 |
| 11 | 2019-06-18 17:00:00 | tom | 0.84375 |
| 14 | 2019-06-18 18:00:00 | tom | 0.8505 |
Lets check the mean for tom # 14:00
print("Check tom 14:00 .86850 ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850 ... 0.86850
Hope this assists

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downsample to quarter level and get quarter end date value in Pandas - python

You can do this, df = df[df.index.is_quarter_end] You will filter out the dates only at the end of each quarter.

Related

find start and end date of previous 12 month from current date in python

Differences in one column based on differences in another, pandas

Remove rows with values repeated on specific columns [duplicate]

Pandas: get the first occurrence grouping by keys

How to include strings with Pandas resample

Categories

Resources