I need to add 12 hours to dropoff_datetime column for all the trips with negative duration.
This is the prompt I was given:
Use where function with three arguments: within the condition compare values of df['duration'] with timedelta(0) object. inplace should be set to True, and other argument need to be set to result of addition dropoff_datetime column and timedelta object with 12 hours.
Below is the code I have written, but the output still seems to come back incorrect. I think my "other" statement is the issue.
# Load libraries
import pandas as pd
from datetime import timedelta
# Loading dataset, creating duration column, and filtering to negative durations
url = 'https://drive.google.com/uc?id=1YV5bKobzYxVAWyB7VlxNH6dmfP4tHBui'
df = pd.read_csv(url, parse_dates = ['pickup_datetime', 'dropoff_datetime', 'dropoff_calculated'])
df["duration"] = pd.to_timedelta(df["duration"])
# Task 1: add 12 hours to dropoff duration for negative durations
df['duration'].where(~(df['duration'] < timedelta(0)), other = df['dropoff_datetime'] + timedelta(12), inplace = True)
# Task 2: recalculate duration column
df['duration'] = df['dropoff_datetime'] - df['pickup_datetime']
# Task 3: inspect first 10 rows with negative duration
print(df[df['duration'] < timedelta(0)][["pickup_datetime", "dropoff_datetime", "trip_duration", "dropoff_calculated"]].head(5))
Output:
pickup_datetime dropoff_datetime trip_duration \
34 2016-09-19 11:47:23 2016-09-19 02:21:19 0 days 02:33:56
66 2016-09-20 12:11:43 2016-09-20 02:15:55 0 days 02:04:13
74 2016-09-20 12:55:00 2016-09-20 01:03:36 0 days 00:08:36
132 2017-04-22 12:38:41 2017-04-22 01:20:13 0 days 00:41:32
231 2017-04-24 12:56:31 2017-04-24 01:06:18 0 days 00:09:47
dropoff_calculated
34 2016-09-19 14:21:19
66 2016-09-20 14:15:56
74 2016-09-20 13:03:36
132 2017-04-22 13:20:13
231 2017-04-24 13:06:18
Two things: (1) Task 2 is overwriting what you're doing in Task 1 with the where command, so that has to be removed, and (2) you had the wrong column in the where command, should be duration. With those 2 small changes I believe your code works as expected:
# Load libraries
import pandas as pd
from datetime import timedelta
# Loading dataset, creating duration column, and filtering to negative durations
url = 'https://drive.google.com/uc?id=1YV5bKobzYxVAWyB7VlxNH6dmfP4tHBui'
df = pd.read_csv(url, parse_dates = ['pickup_datetime', 'dropoff_datetime', 'dropoff_calculated'])
df["duration"] = pd.to_timedelta(df["duration"])
# Task 1: add 12 hours to dropoff duration for negative durations
df['duration'].where(~(df['duration'] < timedelta(0)), other = df['duration'] + timedelta(12), inplace = True)
# Task 2: recalculate duration column
# df['duration'] = df['dropoff_datetime'] - df['pickup_datetime']
# Task 3: inspect first 10 rows with negative duration
print(df[df['duration'] < timedelta(0)][["pickup_datetime", "dropoff_datetime", "trip_duration", "dropoff_calculated"]].head(5))
Outputs:
Empty DataFrame
Columns: [pickup_datetime, dropoff_datetime, trip_duration, dropoff_calculated]
Index: []
Related
My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.
I scraped a dataset of running finishing times, which includes runners finishing under the hour and above the hour. Runners under the hour are coded like M:S e.g. 48:12 for a runner who finished in 48 mins and 12 seconds. Runners above the hour are coded like H:M:S e.g. 1:12:45.
Is there a way to pass two formats to datetime and have it encode all of them as H:M:S?
I tried:
df['Time'] = pd.to_datetime(df['Time'],format="%H:%M:%S")
this (rightly) gives an error for runners under the hour.
for obs in range(1,len(df)):
text = df.iloc[obs].loc['Time']
for fmt in ('%M:%S', '%H:%M:%S'):
try:
datetime.strptime(text, fmt)
except ValueError:
pass
raise ValueError('no valid date format found')
This gives the valuerror that no valid format was found.
I want the solution to be something I can use for different datasets so just finding the first runner over the hour and using a different format from then on doesn't really work.
Try this:
df = pd.DataFrame({'Time': ['1:01:02', '3:20', 'xyz']})
tmp = (df.Time
.str.extract('(\d*):?(\d+):(\d+)$')
.replace('',0).astype(float)
)
which gives you
0 1 2
0 1.0 1.0 2.0
1 0.0 3.0 20.0
2 NaN NaN NaN
and you can get total number of seconds by:
tmp[0] * 3600 + tmp[1] * 60 + tmp[2]
from which, you can convert to timedelta type.
Use pd.to_timedelta (or pd.to_datetime), but first ensure the time is in the 'HH:MM:SS` format by padding it appropriately.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time': ['1', '8:12', '48:11', '1:12:13', '123:12:12']})
fill = '00:00:00'
s = df.Time.str.len()
pd.to_timedelta([fill[0:x] for x in np.clip(len(fill)-s, a_min=0, a_max=None)] + df.Time)
Output
0 0 days 00:00:01
1 0 days 00:08:12
2 0 days 00:48:11
3 0 days 01:12:13
4 5 days 03:12:12
Name: Time, dtype: timedelta64[ns]
I have the Following Dataframe:
ID Minutes Datetime
1 30 6/4/2018 23:47:00
2 420
3 433 6/10/2018 2:50
4 580 6/9/2018 3:10
5 1020
I want to count the number of times the Minutes occur between a certain range. I want to do a similar count for datetime field (timestamp falls within certain range of time).
Below is the output I want:
MIN_RANGE COUNT
6-8 hours 2
8-10 hours 1
10-12 hours 0
12-14 hours 0
14-16 hours 0
16+ hours 1
RANGE COUNT
8pm - 10pm 0
10pm - 12am 1
12am - 2am 0
2am-4am 2
4am-6am 0
6am-8am 0
8am -10am 0
10am - 12pm 0
12pm - 2pm 0
2pm - 4pm 0
4pm - 6pm 0
6pm - 8pm 0
I have searched around google and stackoverflow on how to do this (searching bins and stuff) but couldn't find anything directly related to what I am trying to do.
Help?
This is a complex problem that can be achieved by using pd.date_range and pd.cut, and then some index manipulation.
First of all, you can start by cutting your data frame using pd.cut
cuts = pd.cut(pd.to_datetime(df.Datetime), pd.date_range('02:00:00', freq='2H', periods=13))
0 (2018-07-09 22:00:00, 2018-07-10]
1 NaN
2 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
3 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
4 NaN
This will yield the cuts based on your Datetime column and the ranges defined.
Lets start by having a base data frame with values set to 0, such that we will update it later with your counts. Using your cuts from above,
cats = cuts.cat.categories
bases = ["{}-{}".format(v.left.strftime("%H%p"),v.right.strftime("%H%p")) for v in cats]
df_base = pd.DataFrame({"Range": bases, "Count":0}).set_index("Range")
which yields
COUNT
Range
02AM-04AM 0
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 0
00AM-02AM 0
Now, you can use collections.Counter to quickly count your occurrences
x = Counter(cuts.dropna())
Notice that I have used dropna() not to count NaNs. With your x variable, we can
values = {"{}-{}".format(k.left.strftime("%H%p"), k.right.strftime("%H%p")) : v for k,v in x.items()}
counts_df = pd.DataFrame([values]).T
which yields
0
02AM-04AM 2
22PM-00AM 1
Finally, we just update our previous data frame with these values
df_base.loc[counts_df.index, "Count"] = counts_df[0]
COUNT
Range
02AM-04AM 2
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 1
00AM-02AM 0
import numpy as np
counts = np.histogram(df['Minutes'],
bins = list(range(6*60,18*60,2*60))+[24*60])[0]
bin_labels = [ '6-8 hours',
'8-10 hours',
'10-12 hours',
'12-14 hours',
'14-16 hours',
'16+ hours']
pd.Series(counts, index = bin_labels)
You can do a similar thing with the hours, using the hour attribute of datetime objects. You will have to fill in the empty parts of the Datetime column first.
#RafaelC has already addressed the binning and counting, but I'll make a note about reading the data from a file.
First, let's assume you separate your columns by commas (CSV), and start with:
dates.csv
ID,Minutes,Datetime
1,30,6/4/2018 23:47:00
2,420,
3,433,6/10/2018 2:50
4,580,6/9/2018 3:10
5,1020,
Then, you can read the values and parse the third column as dates as follows.
from datetime import datetime
import pandas as pd
def my_date_parser(date_str):
# Allow empty values to be coerced to NaT (Not a Time)
# rather than throw an exception
return pd.to_datetime(date_str, errors='coerce')
df = pd.read_csv(
'./dates.csv',
date_parser=my_date_parser,
parse_dates=['Datetime']
)
You can get also get the counts by using the built in floor attribute of datetime objects. In this case, you want to use the frequency of '2h' so that you are looking at 2 hour bins. Then just grab the time part
import pandas as pd
df['Datetime'] = pd.to_datetime(df.Datetime)
df.Datetime.dt.floor('2h').dt.time
#0 22:00:00
#1 NaT
#2 02:00:00
#3 02:00:00
#4 NaT
(Alternatively you could also just use df.Datetime.dt.hour//2 to get the same grouping logic, but slightly different labels)
So you can easily just groupby this and then count:
df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
#Datetime
#02:00:00 2
#22:00:00 1
#dtype: int64
Now to get the full list, we can just reindex, and change the index labels to be a bit more informative.
import datetime
import numpy as np
df_counts = df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
ids = [datetime.time(2*x,0) for x in range(12)]
df_counts = df_counts.reindex(ids).fillna(0).astype('int')
# Appropriately label the ranges with more info if needed
df_counts.index = '['+df_counts.index.astype(str) + ' - ' + np.roll(df_counts.index.astype(str), -1)+')'
Output:
df_counts
[00:00:00 - 02:00:00) 0
[02:00:00 - 04:00:00) 2
[04:00:00 - 06:00:00) 0
[06:00:00 - 08:00:00) 0
[08:00:00 - 10:00:00) 0
[10:00:00 - 12:00:00) 0
[12:00:00 - 14:00:00) 0
[14:00:00 - 16:00:00) 0
[16:00:00 - 18:00:00) 0
[18:00:00 - 20:00:00) 0
[20:00:00 - 22:00:00) 0
[22:00:00 - 00:00:00) 1
dtype: float64
A column in my pandas data frame represents a time delta that I calculated with datetime then exported into a csv and read back into a pandas data frame. Now the column's dtype is object whereas I want it to be a timedelta so I can perform a groupby function on the dataframe. Below is what the strings look like. Thanks!
0 days 00:00:57.416000
0 days 00:00:12.036000
0 days 16:46:23.127000
49 days 00:09:30.813000
50 days 00:39:31.306000
55 days 12:39:32.269000
-1 days +22:03:05.256000
Update, my best attempt at writing a for-loop to iterate over a specific column in my pandas dataframe:
def delta(i):
days, timestamp = i.split(" days ")
timestamp = timestamp[:len(timestamp)-7]
t = datetime.datetime.strptime(timestamp,"%H:%M:%S") +
datetime.timedelta(days=int(days))
delta = datetime.timedelta(days=t.day, hours=t.hour,
minutes=t.minute, seconds=t.second)
delta.total_seconds()
data['diff'].map(delta)
Use pd.to_timedelta
pd.to_timedelta(df.iloc[:, 0])
0 0 days 00:00:57.416000
1 0 days 00:00:12.036000
2 0 days 16:46:23.127000
3 49 days 00:09:30.813000
4 50 days 00:39:31.306000
5 55 days 12:39:32.269000
6 -1 days +22:03:05.256000
Name: 0, dtype: timedelta64[ns]
import datetime
#Parse your string
days, timestamp = "55 days 12:39:32.269000".split(" days ")
timestamp = timestamp[:len(timestamp)-7]
#Generate datetime object
t = datetime.datetime.strptime(timestamp,"%H:%M:%S") + datetime.timedelta(days=int(days))
#Generate a timedelta
delta = datetime.timedelta(days=t.day, hours=t.hour, minutes=t.minute, seconds=t.second)
#Represent in Seconds
delta.total_seconds()
You could do something like this, looping through each value from the CSV in place of stringdate:
stringdate = "2 days 00:00:57.416000"
days_v_hms = string1.split('days')
hms = days_v_hms[1].split(':')
dt = datetime.timedelta(days=int(days_v_hms[0]), hours=int(hms[0]), minutes=int(hms[1]), seconds=float(hms[2]))
Cheers!
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1