Resampling a pandas DataFrame by business days gives bad results - python

When I run the following code, the results appear to add the non-business day data to the result.
Code
import pandas as pd
df = pd.DataFrame({'id': [30820864, 32295510, 30913444, 30913445],
'ticket_id': [100, 101, 102, 103],
'date_time': [
'6/1/17 9:48',
'6/2/17 13:11',
'6/3/17 13:15',
'6/5/17 13:15'],
})
df['date_time'] = pd.to_datetime(df['date_time'])
df.index = df['date_time']
x = df.resample('B').count()
print(x)
Result
id ticket_id date_time
date_time
2017-06-01 1 0 1
2017-06-02 2 0 2
2017-06-05 1 0 1
I would expect that the count for 2017-06-02 would be 1 and not 2. Shouldn't the data from a non-business day (6/3/17) be ignored?

This seems to be standard behaviour, events on weekends are grouped with friday (another post similar to this, and here it says that this is convention)
One solution, drop the weekends:
df = df[df['date_time'].apply(lambda x: x.weekday() not in [5,6])]
Output:
date_time id ticket_id
date_time
2017-06-01 1 1 1
2017-06-02 1 1 1
2017-06-05 1 1 1

Related

compare a two date columns of a data frame with another two data frames of second data frame in python

I have two dataframes df1 and df2
df1 contains month and two date columns
df1
Month Month_Start Month_End
Month1 2022-03-27 2022-04-30
Month2 2022-05-01 2022-05-28
Month3 2022-05-01 2022-06-25
another data frame df2
start_Month end_Month price
2022-03-27 2260-12-31 1
2022-03-27 2260-12-31 2
2022-03-27 2260-12-31 3
if Month_Start and Month_end of df1 is in between start_Month and end_Month of df2, assign price column value to Month column of df1
like following result
Month price
Month1 1
Month2 1
Month3 1
I tried using for loops
for i in range(len(df2)):
for j in range(len(df1)):
if df2['start_Month'][i] <= df1['Month_Start'][j]<= df1['Month_End'][j] <= df2['end_Month'][i]:
new.loc[len(new.index)] = [df1['month'][j], df2['price'][i]]
but taking lot of time for execution for 1000+ rows.
ANY IDEAS?
Is there a common column where you can combine these two dataframes? such as id. If there is, it would be much more accurate to apply the conditions after combining these two tables. You can try the code below based on current data and conditions (Dataframes that are not the same size may have a problem.).
import pandas as pd
import numpy as np
df1=pd.DataFrame(data={'Month':['Month1','Month2','Month3'],
'Month_Start':['2022-03-27','2022-05-01','2022-05-01'],
'Month_End':['2022-04-30','2022-05-28','2022-06-25']})
df2=pd.DataFrame(data={'start_Month':['2022-03-27','2022-03-27','2022-03-27'],
'end_Month':['2260-12-31','2260-12-31','2260-12-31'],
'price':[1,2,3]})
con=[(df1['Month_Start']>= df2['start_Month']) & (df1['Month_End']<= df2['end_Month'])]
cho=[df2['price']]
df1['price']=np.select(con,cho,default=np.nan)#
Assuming these are your dataframes:
import pandas as pd
df1 = pd.DataFrame({ 'Month': ['Month1', 'Month2', 'Month3'],
'Month_Start': ['2022-03-27', '2022-05-01', '2022-05-01'],
'Month_End': ['2022-04-30', '2022-05-28', '2022-06-25'] })
df1['Month_Start'] = pd.to_datetime(df1['Month_Start'])
df1['Month_End'] = pd.to_datetime(df1['Month_End'])
df2 = pd.DataFrame({ 'start_Month': ['2022-03-01', '2022-05-01', '2022-06-01'],
'end_Month': ['2022-04-30', '2022-05-30', '2022-06-30'],
'price': [1, 2, 3] })
df2['start_Month'] = pd.to_datetime(df2['start_Month'])
df2['end_Month'] = pd.to_datetime(df2['end_Month'])
print(df1)
Month Month_Start Month_End
0 Month1 2022-03-27 2022-04-30
1 Month2 2022-05-01 2022-05-28
2 Month3 2022-05-01 2022-06-25
print(df2) #note validity periods do not overlap, so only 1 price is valid!
start_Month end_Month price
0 2022-03-01 2022-04-30 1
1 2022-05-01 2022-05-30 2
2 2022-06-01 2022-06-30 3
I would define an external function to check the validity period, then return the corresponding price. Note that if more than 1 corresponding validity periods are found, the first one will be returned. If no corresponding period is found, a null value is returned.
def check_validity(row):
try:
return int(df2['price'][(df2['start_Month']<=row['Month_Start']) & (row['Month_End']<=df2['end_Month'])].values[0])
except:
return
df1['price'] = df1.apply(lambda x: check_validity(x), axis=1)
print(df1)
Output:
Month Month_Start Month_End price
0 Month1 2022-03-27 2022-04-30 1.0
1 Month2 2022-05-01 2022-05-28 2.0
2 Month3 2022-05-01 2022-06-25 NaN

Python: how to groupby a pandas dataframe to count by hour and day?

I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.

Days before end of month in pandas

I would like to get the number of days before the end of the month, from a string column representing a date.
I have the following pandas dataframe :
df = pd.DataFrame({'date':['2019-11-22','2019-11-08','2019-11-30']})
df
date
0 2019-11-22
1 2019-11-08
2 2019-11-30
I would like the following output :
df
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
The package pd.tseries.MonthEnd with rollforward seemed a good pick, but I can't figure out how to use it to transform a whole column.
Subtract all days of month created by Series.dt.daysinmonth with days extracted by Series.dt.day:
df['date'] = pd.to_datetime(df['date'])
df['days_end_month'] = df['date'].dt.daysinmonth - df['date'].dt.day
Or use offsets.MonthEnd, subtract and convert timedeltas to days by Series.dt.days:
df['days_end_month'] = (df['date'] + pd.offsets.MonthEnd(0) - df['date']).dt.days
print (df)
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

pd.to_datetime is getting half my dates with flipped day / months

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Categories