Pandas add missing weeks from range to dataframe - python

I am computing a DataFrame with weekly amounts and now I need to fill it with missing weeks from a provided date range.
This is how I'm generating the dataframe with the weekly amounts:
df['date'] = pd.to_datetime(df['date']) - timedelta(days=6)
weekly_data: pd.DataFrame = (df
.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
.reset_index()
)
Which outputs:
date sum
0 2020-10-11 78
1 2020-10-18 673
If a date range is given as start='2020-08-30' and end='2020-10-30', then I would expect the following dataframe:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0.0
So far, I have managed to just add the missing weeks and set the sum to 0, but it also replaces the existing values:
weekly_data = weekly_data.reindex(pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN')).fillna(0)
Which outputs:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 0.0 # should be 78
7 2020-10-18 0.0 # should be 673
8 2020-10-25 0.0

Remove reset_index for DatetimeIndex, because reindex working with index and if RangeIndex get 0 values, because no match:
weekly_data = (df.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
)
Then is possible use fill_value=0 parameter and last add reset_index:
r = pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN', name='date')
weekly_data = weekly_data.reindex(r, fill_value=0).reset_index()
print (weekly_data)
date sum
0 2020-08-30 0
1 2020-09-06 0
2 2020-09-13 0
3 2020-09-20 0
4 2020-09-27 0
5 2020-10-04 0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0

Related

How to get 1 for 8 days after a date in pandas and 0 otherwise?

I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).

How to groupby for dataframe which has a datetimeindex only using hours

I've got a dataframe called new_dh of web request that looks like (there are more columns
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 W3SVC1 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
What I would like to do is group by the hours(the actual date of the request does not matter, just the hour and all the times have already been rounded down to not include minutes) for the datetimeindex and instead return
count
hour
0 2
01 2
02 2
10 1
23 3
Any help would be much appreciated.
I have tried
new_dh.groupby([new_dh.index.hour]).count()
but find myself printing many columns of the same value whereas I only want the above version
If need DatetimeIndex in output use DataFrame.resample:
new_dh.resample('H')['s-sitename'].count()
Or DatetimeIndex.floor:
new_dh.groupby(new_dh.index.floor('H'))['s-sitename'].count()
Problem of your solution is if use GroupBy.count it count all columns value per Hours with exclude missing values, so if no missing values get multiple columns with same values. Possible solution is specify column after groupby:
new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
So data was changed for see how count with exclude missing values:
print (new_dh)
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 NaN 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 W3SVC1 0.0
df = new_dh.groupby([new_dh.index.hour]).count()
print (df)
s-sitename sc-win32-status
date_time
0 2 2
1 2 2
2 1 2
10 1 1
23 1 3
So if column is specified:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
print (s)
date_time
0 2
1 2
2 1
10 1
23 1
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].count().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 1
10 1
23 1
If need count also missing values then use GroupBy.size:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].size()
print (s)
date_time
0 2
1 2
2 2
10 1
23 3
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].size().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 2
10 1
23 3
new_dh['hour'] = new_dh.index.map(lambda x: x.hour)
new_dh.groupby('hour')['hour'].count()
Result
hour
0 2
1 2
2 2
10 1
23 3
Name: hour, dtype: int64
If you need a DataFrame as result:
new_dh.groupby('hour')['hour'].count().rename('count').to_frame()
In this case, the result will be:
count
hour
0 2
1 2
2 2
10 1
23 3
You can also do this by using groupby() and assign() method:
If 'date_time' column is not your index:
result=df.assign(hour=df['date_time'].dt.hour).groupby('hour').agg(count=('s-sitename','count'))
If It's your index then use:
result=df.groupby(df.index.hour)['s-sitename'].count().to_frame('count')
result.index.name='hour'
Now if you print result then you will get your desired output:
count
hour
0 1
1 2
2 2
10 1
23 3

Subtract value in row based on condition in pandas

I need to subtract dates based on the progression of fault count.
Below is the table that has the two input columns Date and Fault_Count. The output columns I need are Option1 and Option2. The last two columns show the date difference calculations. Basically when the Fault_Count changes I need to count the number of days from when the Fault_Count changed to the initial start of fault count. For example the Fault_Count changed to 2 on 1/4/2020, I need to get the number of days from when the Fault_Count started at 0 and changed to 2 (i.e. 1/4/2020 - 1/1/2020 = 3).
Date Fault_Count Option1 Option2 Option1calc Option2calc
1/1/2020 0 0 0
1/2/2020 0 0 0
1/3/2020 0 0 0
1/4/2020 2 3 3 1/4/2020-1/1/2020 1/4/2020-1/1/2020
1/5/2020 2 0 0
1/6/2020 2 0 0
1/7/2020 4 3 3 1/7/2020-1/4/2020 1/7/2020-1/4/2020
1/8/2020 4 0 0
1/9/2020 5 2 2 1/9/2020-1/7/2020 1/9/2020-1/7/2020
1/10/2020 5 0 0
1/11/2020 0 2 -2 1/11/2020-1/9/2020 (1/11/2020-1/9/2020)*-1 as the fault resets
1/12/2020 1 1 1 1/12/2020-1/11/2020 1/12/2020-1/11/2020
Below is the code.
import pandas as pd
d = {'Date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '1/10/2020', '1/11/2020', '1/12/2020'], 'Fault_Count' : [0, 0, 0, 2, 2, 2, 4, 4, 5, 5, 0, 1]}
df = pd.DataFrame(d)
df['Date'] = pd.to_datetime(df['Date'])
df['Fault_count_diff'] = df.Fault_Count.diff().fillna(0)
df['Cumlative_Sum'] = df.Fault_count_diff.cumsum()
I thought I could use cumulative sum and group by to get the groups and get the differences of the first value of groups. That's as far as I could get, also I noticed that using cumulative sum was not giving me ordered groups as some of the Fault_Count get reset.
Date Fault_Count Fault_count_diff Cumlative_Sum
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 2.0 2.0
4 2020-01-05 2 0.0 2.0
5 2020-01-06 2 0.0 2.0
6 2020-01-07 4 2.0 4.0
7 2020-01-08 4 0.0 4.0
8 2020-01-09 5 1.0 5.0
9 2020-01-10 5 0.0 5.0
10 2020-01-11 0 -5.0 0.0
11 2020-01-12 1 1.0 1.0
Desired output:
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0
Thanks for the help.
Use:
m1 = df['Fault_Count'].ne(df['Fault_Count'].shift(fill_value=0))
m2 = df['Fault_Count'].eq(0) & df['Fault_Count'].shift(fill_value=0).ne(0)
s = df['Date'].groupby(m1.cumsum()).transform('first')
df['Option1'] = df['Date'].sub(s.shift()).dt.days.where(m1, 0)
df['Option2'] = df['Option1'].where(~m2, df['Option1'].mul(-1))
Details:
Use Series.ne + Series.shift to create boolean mask m1 which represent the boundary condition when Fault_count changes, similarly use Series.eq + Series.shift and Series.ne to create a boolean mask m2 which represent the condition where Fault_count resets:
m1 m2
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 True False
7 False False
8 True False
9 False False
10 True True # --> Fault count reset
11 True False
Use Series.groupby on consecutive fault counts obtained using m1.cumsum and transform the Date column using groupby.first:
print(s)
0 2020-01-01
1 2020-01-01
2 2020-01-01
3 2020-01-04
4 2020-01-04
5 2020-01-04
6 2020-01-07
7 2020-01-07
8 2020-01-09
9 2020-01-09
10 2020-01-11
11 2020-01-12
Name: Date, dtype: datetime64[ns]
Use Series.sub to subtract Date for s shifted using Series.shift and use Series.where to fill 0 based on mask m2 and assign this to Option1. Similary we obtain Option2 from Option1 based on mask m2:
print(df)
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0
Instead of df['Fault_count_diff'] = ... and the next line, do:
df['cycle'] = (df.Fault_Count.diff() < 0).cumsum()
Then to get the dates in between each count change.
Option1. If all calendar dates are present in df:
ndays = df.groupby(['cycle', 'Fault_Count']).Date.size()
Option2. If there's the possibility of a date not showing up in df and you still want to get the calendar days between incidents:
ndays = df.groupby(['cycle', 'Fault_Count']).Date.min().diff().dropna()

Check if my time series index data has any missing values for weekdays

I have time series data from "5 Jan 2015" to "28 Dec 2018". I observed some of the working days' dates and their values are missing. How to check how many weekdays are missing in my time range? and what are those dates so that i can extrapolate the values for those dates.
Example:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
On observing calendar, 21st Dec, 2018 was Friday. Then excluding Saturday and Sunday, the dataset should be having "24th Dec 2018" in the list, but its missing. I need to identify such missing dates from range.
My approach till now:
I tried using
pd.date_range('2015-01-05','2018-12-28',freq='W')
to identify the number of weeks and calculate the no. of weekdays from them manually, to identify number of missing dates.
But it dint solved purpose as I need to identify missing dates from range.
Let's say this is your full dataset:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
And dates were:
dates = pd.date_range('2018-12-15', '2018-12-31')
First, make sure the Date column is actually a date type:
df['Date'] = pd.to_datetime(df['Date'])
Then set Date as the index:
df = df.set_index('Date')
Then reindex with unutbu's solution:
df = df.reindex(dates, fill_value=0.0)
Then reset the index to make it easier to work with:
df = df.reset_index()
It now looks like this:
index Price Volume
0 2018-12-15 0.0 0.0
1 2018-12-16 0.0 0.0
2 2018-12-17 0.0 0.0
3 2018-12-18 0.0 0.0
4 2018-12-19 0.0 0.0
5 2018-12-20 0.0 0.0
6 2018-12-21 172.8 800.0
7 2018-12-22 0.0 0.0
8 2018-12-23 0.0 0.0
9 2018-12-24 0.0 0.0
10 2018-12-25 171.0 2200.0
11 2018-12-26 170.4 500.0
12 2018-12-27 173.6 400.0
13 2018-12-28 172.0 800.0
14 2018-12-29 0.0 0.0
15 2018-12-30 0.0 0.0
16 2018-12-31 0.0 0.0
Do:
df['weekday'] = df['index'].dt.dayofweek
Finally, how many weekdays are missing in your time range:
missing_weekdays = df[(~df['weekday'].isin([5,6])) & (df['Volume'] == 0.0)]
Result:
>>> missing_weekdays
index Price Volume weekday
2 2018-12-17 0.0 0.0 0
3 2018-12-18 0.0 0.0 1
4 2018-12-19 0.0 0.0 2
5 2018-12-20 0.0 0.0 3
9 2018-12-24 0.0 0.0 0
16 2018-12-31 0.0 0.0 0

Time difference within group by objects in Python Pandas

I have a dataframe that looks like this:
from to datetime other
-------------------------------------------------
11 1 2016-11-06 22:00:00 -
11 1 2016-11-06 20:00:00 -
11 1 2016-11-06 15:45:00 -
11 12 2016-11-06 15:00:00 -
11 1 2016-11-06 12:00:00 -
11 18 2016-11-05 10:00:00 -
11 12 2016-11-05 10:00:00 -
12 1 2016-10-05 10:00:59 -
12 3 2016-09-06 10:00:34 -
I want to groupby "from" and then "to" columns and then sort the "datetime" in descending order and then finally want to calculate the time difference within these grouped by objects between the current time and the next time. For eg, in this case,
I would like to have a dataframe like the following:
from to timediff in minutes others
11 1 120
11 1 255
11 1 225
11 1 0 (preferrably subtract this date from the epoch)
11 12 300
11 12 0
11 18 0
12 1 25
12 3 0
I can't get my head around figuring this out!! Is there a way out for this?
Any help will be much much appreciated!!
Thank you so much in advance!
df.assign(
timediff=df.sort_values(
'datetime', ascending=False
).groupby(['from', 'to']).datetime.diff(-1).dt.seconds.div(60).fillna(0))
I think you need:
groupby with apply sort_values with diff, convert Timedelta to minutes by seconds and floor division 60
fillna and sort_index, remove level 2 in index
df = df.groupby(['from','to']).datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.sort_index()
.reset_index(level=2, drop=True)
.reset_index(name='timediff in minutes')
print (df)
from to timediff in minutes
0 11 1 120.0
1 11 1 255.0
2 11 1 225.0
3 11 1 0.0
4 11 12 300.0
5 11 12 0.0
6 11 18 0.0
7 12 3 0.0
8 12 3 0.0
df = df.join(df.groupby(['from','to'])
.datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.reset_index(level=[0,1], drop=True)
.rename('timediff in minutes'))
print (df)
from to datetime other timediff in minutes
0 11 1 2016-11-06 22:00:00 - 120.0
1 11 1 2016-11-06 20:00:00 - 255.0
2 11 1 2016-11-06 15:45:00 - 225.0
3 11 12 2016-11-06 15:00:00 - 300.0
4 11 1 2016-11-06 12:00:00 - 0.0
5 11 18 2016-11-05 10:00:00 - 0.0
6 11 12 2016-11-05 10:00:00 - 0.0
7 12 3 2016-10-05 10:00:59 - 0.0
8 12 3 2016-09-06 10:00:34 - 0.0
Almost as above, but without apply:
result = df.sort_values(['from','to','datetime'])\
.groupby(['from','to'])['datetime']\
.diff().dt.seconds.fillna(0)

Categories