Downsample to quarter level and get quarter end date value in Pandas - python

my data frame has daily value from 2005-01-01 to 2021-10-31.
| C1 | C2
-----------------------------
2005-01-01 | 2.7859 | -7.790
2005-01-02 |-0.7756 | -0.97
2005-01-03 |-6.892 | 2.770
2005-01-04 | 2.785 | -0.97
. . .
. . .
2021-10-28 | 6.892 | 2.785
2021-10-29 | 2.785 | -6.892
2021-10-30 |-6.892 | -0.97
2021-10-31 |-0.7756 | 2.34
I want to downsample this data frame to get quarter value as follows.
| C1 | C2
------------------------------
2005-03-01 | 2.7859 | -7.790
2005-06-30 |-0.7756 | -0.97
2005-09-30 |-6.892 | 2.770
2005-12-31 | 2.785 | -0.97
I tried to do it with Pandas resample method but it requires an aggregation method.
df = df.resample('Q').mean()
I don't want the aggregated value I want the current value at the quarter-end date as it is.

Your code works except you are not using the right function. Replace mean by last:
dti = pd.date_range('2005-01-01', '2021-10-31', freq='D')
df = pd.DataFrame(np.random.random((len(dti), 2)), columns=['C1', 'C2'], index=dti)
dfQ = df.resample('Q').last()
print(dfQ)
# Output:
C1 C2
2005-03-31 0.653733 0.334182
2005-06-30 0.425229 0.316189
2005-09-30 0.055675 0.746406
2005-12-31 0.394051 0.541684
2006-03-31 0.525208 0.413624
... ... ...
2020-12-31 0.662081 0.887147
2021-03-31 0.824541 0.363729
2021-06-30 0.064824 0.621555
2021-09-30 0.126891 0.549009
2021-12-31 0.126217 0.044822
[68 rows x 2 columns]

You can do this,
df = df[df.index.is_quarter_end]
You will filter out the dates only at the end of each quarter.

Related

find start and end date of previous 12 month from current date in python

I need to find the start and end date of the previous 12 months from the current date.
If the current date is 05-May-2022
Then, it should display past 12 months first date and last date for all months including current month.
how to achieve this as each month have a different number of days? Do we have any function in datetime to achieve this?
code only display previous month first and last date, so i want to print previous 12 months
from datetime import date, timedelta
this_first = date.today().replace(day=1)
prev_last = this_first - timedelta(days=1)
prev_first = prev_last.replace(day=1)
prev_first, prev_last
Output:
(datetime.date(2021, 1, 1), datetime.date(2021, 1, 31))
Expected Output:
[('2021-06-01', '2021-06-30'), ('2021-07-01', '2021-07-31'),
('2021-08-01', '2021-08-31'), ('2021-09-01', '2021-09-30'),
('2021-10-01', '2021-10-31'), ('2021-11-01', '2021-11-30'),
('2021-12-01', '2021-12-31'), ('2022-01-01', '2022-01-31'),
('2022-02-01', '2022-02-28'), ('2022-03-01', '2022-03-31'),
('2022-04-01', '2022-04-30'), ('2022-05-01', '2022-05-31')]
Note:
dtype should be only in datetime.
You can use current_date.replace(day=1) to get first day in current month.
And if you substract datetime.timedelta(days=1) then you get last day in previous month.
And you can use again replace(day=1) to get first day in previous month.
If you repeate it in loop then you can get first day and last day for all 12 months.
import datetime
current = datetime.datetime(2022, 5, 5)
start = current.replace(day=1)
for x in range(1, 13):
end = start - datetime.timedelta(days=1)
start = end.replace(day=1)
print(f'{x:2} |', start.date(), '|', end.date())
Result:
1 | 2022-04-01 | 2022-04-30
2 | 2022-03-01 | 2022-03-31
3 | 2022-02-01 | 2022-02-28
4 | 2022-01-01 | 2022-01-31
5 | 2021-12-01 | 2021-12-31
6 | 2021-11-01 | 2021-11-30
7 | 2021-10-01 | 2021-10-31
8 | 2021-09-01 | 2021-09-30
9 | 2021-08-01 | 2021-08-31
10 | 2021-07-01 | 2021-07-31
11 | 2021-06-01 | 2021-06-30
12 | 2021-05-01 | 2021-05-31
EDIT:
And if you use pandas then you can use pd.date_range() but it can't for previous dates so you would have to first get '2021.04.05' (for MS) and '2021.05.05' (for M)
import pandas as pd
#all_starts = pd.date_range('2021.04.05', '2022.04.05', freq='MS')
all_starts = pd.date_range('2021.04.05', periods=12, freq='MS')
print(all_starts)
#all_ends = pd.date_range('2021.05.05', '2022.05.05', freq='M')
all_ends = pd.date_range('2021.05.05', periods=12, freq='M')
print(all_ends)
for start, end in zip(all_starts, all_ends):
print(start.to_pydatetime().date(), '|', end.to_pydatetime().date())
DatetimeIndex(['2021-05-01', '2021-06-01', '2021-07-01', '2021-08-01',
'2021-09-01', '2021-10-01', '2021-11-01', '2021-12-01',
'2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01'],
dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
'2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
'2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[ns]', freq='M')
2021-05-01 | 2021-05-31
2021-06-01 | 2021-06-30
2021-07-01 | 2021-07-31
2021-08-01 | 2021-08-31
2021-09-01 | 2021-09-30
2021-10-01 | 2021-10-31
2021-11-01 | 2021-11-30
2021-12-01 | 2021-12-31
2022-01-01 | 2022-01-31
2022-02-01 | 2022-02-28
2022-03-01 | 2022-03-31
2022-04-01 | 2022-04-30
EDIT:
I found out that standard module calendar can gives number of days and weeks in month.
weeks, days = calendar.monthrange(year, month)
Working example:
import calendar
year = 2022
month = 5
for number in range(1, 13):
if month > 1:
month -= 1
else:
month = 12
year -= 1
weeks, days = calendar.monthrange(year, month)
print(f'{number:2} | {year}.{month:02}.01 | {year}.{month:02}.{days}')
Result:
1 | 2022.04.01 | 2022.04.30
2 | 2022.03.01 | 2022.03.31
3 | 2022.02.01 | 2022.02.28
4 | 2022.01.01 | 2022.01.31
5 | 2021.12.01 | 2021.12.31
6 | 2021.11.01 | 2021.11.30
7 | 2021.10.01 | 2021.10.31
8 | 2021.09.01 | 2021.09.30
9 | 2021.08.01 | 2021.08.31
10 | 2021.07.01 | 2021.07.31
11 | 2021.06.01 | 2021.06.30
12 | 2021.05.01 | 2021.05.31

Differences in one column based on differences in another, pandas

How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you
This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0
#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)

Remove rows with values repeated on specific columns [duplicate]

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)
I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

Pandas: get the first occurrence grouping by keys

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)
I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

How to include strings with Pandas resample

I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:
DateTime Safe1 Safe2 Safe3 Measurement
1/8/2013 6:06 N Y N
1/8/2013 6:23 N Y N
1/8/2013 6:40 N N N 28
1/8/2013 6:57 N N N 31
I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.
However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.
How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?
Desired Output based upon above:
DateTime Safe1 Measurement
1/8/2013 6:00 Y
1/8/2013 6:30 N 29.5
I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.
First creating the data :
import pandas as pd
index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']
data = {'Safe1' : ['N', 'N', 'N', 'N'],
'Safe2': ['Y', 'Y', 'N', 'N'],
'Safe3': ['N', 'N', 'N', 'N'],
'Measurement': [0,0,28,31]}
df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df
output :
Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00 0 N Y N
2013-01-08 06:23:00 0 N Y N
2013-01-08 06:40:00 28 N N N
2013-01-08 06:57:00 31 N N N
Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.
df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df
output :
Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00 0 N Y N NYN
2013-01-08 06:23:00 0 N Y N NYN
2013-01-08 06:40:00 28 N N N NNN
2013-01-08 06:57:00 31 N N N NNN
finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.
That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :
def func(x):
x = ''.join(x.values)
return 'Y' if 'Y' in x else 'N'
df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })
output :
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
Here's an answer using pandas built-in resample function.
First combine the 3 Safe values into a single column:
df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3
Turn the 3-letter strings into a 0-1 variable:
df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)
Write a custom resampling function for the 'Safes' column:
def f(x):
if sum(x) > 0: return 'Y'
else: return 'N'
Finally, resample:
df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())
Output:
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
I manually resample the date (easy if it is rounding)....
Here is an example
from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd
def df_to_md(df):
print(tabulate(df, tablefmt="pipe",headers="keys"))
seed(42)
people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]
df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10%
# Tom should score well
df_to_md(df[df.Who=='tom'].head())
The table is in Markdown format - just to easy my cut and paste....
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 |
Harry scores badly
df_to_md(df[df.Who=='harry'].head())
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 2 | 2019-06-18 14:17:17.457124 | harry | 0.013 |
| 5 | 2019-06-18 14:32:17.457124 | harry | 0.038 |
| 8 | 2019-06-18 14:47:17.457124 | harry | 0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry | 0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry | 0.064 |
Lets get the average per hour per person
def round_to_hour(t):
# Rounds to nearest hour by adding a timedelta hour if minute >= 30
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
And generate a new column using this method.
df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())
This should be tom's data - showing original and rounded.
| | When | Who | Status | WhenRounded |
|---:|:---------------------------|:------|---------:|:--------------------|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 | 2019-06-18 14:00:00 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 | 2019-06-18 14:00:00 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 | 2019-06-18 15:00:00 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 | 2019-06-18 15:00:00 |
We can resample ... by grouping and using a grouping function
Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.
df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
| | WhenRounded | Who | Status |
|---:|:--------------------|:------|---------:|
| 2 | 2019-06-18 14:00:00 | tom | 0.873 |
| 5 | 2019-06-18 15:00:00 | tom | 0.83925 |
| 8 | 2019-06-18 16:00:00 | tom | 0.86175 |
| 11 | 2019-06-18 17:00:00 | tom | 0.84375 |
| 14 | 2019-06-18 18:00:00 | tom | 0.8505 |
Lets check the mean for tom # 14:00
print("Check tom 14:00 .86850 ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850 ... 0.86850
Hope this assists

Categories