I have a pandas dataframe that has columns:
['A'] of departure time (listed as integer ex: 700 or 403 which is 7:00 and 4:03);
['B'] of elapsed time (listed as integer ex: 70 or 656 which is 70 mins and 656 mins);
['C'] of arrival time (listed as integer: 1810 and 355 which is 18:10 and 03:55).
I need to find a way to develop a new column ['D'] with a boolean value that returns True if the arrival is on the following day and False if arrival is on the same day.
I thought of accessing the -2 index of column A to convert hour to minute and then add the remainder minutes to normalize the values but not sure how to do that, or if there's a simpler way to find this. The idea behind this would be to get total minutes elapsed from moment the day started and if exceeds total minutes in a day, then I'd have my answer but unsure if this would work.
Similar to the method you outlined, you can accomplish the task by converting the integers in column A to a 24-hour datetime (starting from 1900-01-01), adding the integer number of minutes from column B as a timedelta and then check to see if the result is still in day 1 of the month. As a sanity check, I made sure the last row should return True.
You can probably combine these steps without creating a new column, but I think the code is more readable this way.
import numpy as np
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'A':[700,403,2359],
'B':[70,656,2],
'C':[810,1059,1]
})
# convert to string, add leading zeros, then convert column A to datetime
df['arrival'] = pd.to_datetime(df['A'].astype(str).str.zfill(4), format='%H%M') + pd.to_timedelta(df['B'],'m')
# check if you are on day 1 of the month still
df['D'] = np.where(df.arrival.dt.day > 1, True, False)
Output:
A B C arrival D
0 700 70 810 1900-01-01 08:10:00 False
1 403 656 1059 1900-01-01 14:59:00 False
2 2359 2 1 1900-01-02 00:01:00 True
Related
I need to add 12 hours to dropoff_datetime column for all the trips with negative duration.
This is the prompt I was given:
Use where function with three arguments: within the condition compare values of df['duration'] with timedelta(0) object. inplace should be set to True, and other argument need to be set to result of addition dropoff_datetime column and timedelta object with 12 hours.
Below is the code I have written, but the output still seems to come back incorrect. I think my "other" statement is the issue.
# Load libraries
import pandas as pd
from datetime import timedelta
# Loading dataset, creating duration column, and filtering to negative durations
url = 'https://drive.google.com/uc?id=1YV5bKobzYxVAWyB7VlxNH6dmfP4tHBui'
df = pd.read_csv(url, parse_dates = ['pickup_datetime', 'dropoff_datetime', 'dropoff_calculated'])
df["duration"] = pd.to_timedelta(df["duration"])
# Task 1: add 12 hours to dropoff duration for negative durations
df['duration'].where(~(df['duration'] < timedelta(0)), other = df['dropoff_datetime'] + timedelta(12), inplace = True)
# Task 2: recalculate duration column
df['duration'] = df['dropoff_datetime'] - df['pickup_datetime']
# Task 3: inspect first 10 rows with negative duration
print(df[df['duration'] < timedelta(0)][["pickup_datetime", "dropoff_datetime", "trip_duration", "dropoff_calculated"]].head(5))
Output:
pickup_datetime dropoff_datetime trip_duration \
34 2016-09-19 11:47:23 2016-09-19 02:21:19 0 days 02:33:56
66 2016-09-20 12:11:43 2016-09-20 02:15:55 0 days 02:04:13
74 2016-09-20 12:55:00 2016-09-20 01:03:36 0 days 00:08:36
132 2017-04-22 12:38:41 2017-04-22 01:20:13 0 days 00:41:32
231 2017-04-24 12:56:31 2017-04-24 01:06:18 0 days 00:09:47
dropoff_calculated
34 2016-09-19 14:21:19
66 2016-09-20 14:15:56
74 2016-09-20 13:03:36
132 2017-04-22 13:20:13
231 2017-04-24 13:06:18
Two things: (1) Task 2 is overwriting what you're doing in Task 1 with the where command, so that has to be removed, and (2) you had the wrong column in the where command, should be duration. With those 2 small changes I believe your code works as expected:
# Load libraries
import pandas as pd
from datetime import timedelta
# Loading dataset, creating duration column, and filtering to negative durations
url = 'https://drive.google.com/uc?id=1YV5bKobzYxVAWyB7VlxNH6dmfP4tHBui'
df = pd.read_csv(url, parse_dates = ['pickup_datetime', 'dropoff_datetime', 'dropoff_calculated'])
df["duration"] = pd.to_timedelta(df["duration"])
# Task 1: add 12 hours to dropoff duration for negative durations
df['duration'].where(~(df['duration'] < timedelta(0)), other = df['duration'] + timedelta(12), inplace = True)
# Task 2: recalculate duration column
# df['duration'] = df['dropoff_datetime'] - df['pickup_datetime']
# Task 3: inspect first 10 rows with negative duration
print(df[df['duration'] < timedelta(0)][["pickup_datetime", "dropoff_datetime", "trip_duration", "dropoff_calculated"]].head(5))
Outputs:
Empty DataFrame
Columns: [pickup_datetime, dropoff_datetime, trip_duration, dropoff_calculated]
Index: []
I have created a dataframe with this code:
The objective of this is to find the weekly low and to get the dates date at which the weekly low took place.
To do this:
import pandas as pd
from pandas_datareader import data as web
import pandas_datareader
import datetime
df = web.DataReader('GOOG', 'yahoo', start, end)
df2 = web.DataReader('GOOG', 'yahoo', start, end)
start = datetime.datetime(2021,1,1)
end = datetime.datetime.today()
df['Date1'] = df.index
df['month'] = df.index.month
df['week'] = df.index.week
df['day'] = df.index.day
df.set_index('week',append=True,inplace=True)
df.set_index('day',append=True,inplace=True)
To get the weekly low :
df['Low'].groupby(['week']).min().tail(50)
I trying to find out the date on which weekly low occured: Such as 1735.420044
If i try to do this :
df['Low'].isin([1735.420044])
I get :
Date week day
2020-12-31 53 31 False
2021-01-04 1 4 False
2021-01-05 1 5 False
2021-01-06 1 6 False
2021-01-07 1 7 False
...
2021-08-02 31 2 False
2021-08-03 31 3 False
2021-08-04 31 4 False
2021-08-05 31 5 False
2021-08-06 31 6 False
Name: Low, Length: 151, dtype: bool
How can i get the actual dates for the low?
To get the weekly lows, you could simply access the index.
res = df['Low'].groupby(['week']).min()
res is the series of lowest prices with the date in the index. You can access the raw numpy array that represents the index with res.index.values. This will include week and day levels as well.
To get just the dates as a series, this should work:
dates = res.index.get_level_values("Date").to_series()
PS: Clarification from the comments
df['Low'].isin([1735.420044]).any() # returns False
The above doesn't work for you (should return True if there's a match) because when you say .isin([<bunch of floats>]), you are essentially comparing floats for equality. Which doesn't work because floating point comparisons can never be guaranteed to be exact, they always have to be in ranges of tolerance (this is not Python specific, true for all languages). Sometimes it might seem to work in Python, but that is entirely coincidental and is a result of underlying memory optimisations. Have a look at this thread to gain some (Python specific) insight into this.
I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00
I'm working in python with a pandas df and trying to convert a column that contains nanoseconds to a time with days, hours, minutes and seconds, but I'm not succeeding.
My original df looks like this:
ID TIME_NANOSECONDS
1 47905245000000000
2 45018244000000000
3 40182582000000000
The result should look like this:
ID TIME_NANOSECONDS TIME
1 47905245000000000 554 days 11:00:45.000000000
2 45018244000000000 521 days 01:04:04.000000000
3 40182582000000000 465 days 01:49:42.000000000
I've found some answers that advised to use timedelta, but following code, returns a date, which is not what I want.
temp_fc_col['TIME_TO_REPAIR_D'] = datetime.timedelta(temp_fc_col['TIME_TO_REPAIR'], unit='ns')
alternatively,
temp_fc_col['TIME_TO_REPAIR_D'] = (timedelta(microseconds=round(temp_fc_col['TIME_TO_REPAIR'], -3)
Returns an error: unsupported type for timedelta microseconds component: Series. Probably, because this staement can only process one value at a time.
Use to_timedelta working well with Series, also unit='ns' should be omit:
temp_fc_col['TIME_TO_REPAIR_D'] = pd.to_timedelta(temp_fc_col['TIME_NANOSECONDS'])
print (temp_fc_col)
ID TIME_NANOSECONDS TIME_TO_REPAIR_D
0 1 47905245000000000 554 days 11:00:45
1 2 45018244000000000 521 days 01:04:04
2 3 40182582000000000 465 days 01:49:42
I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]