I have a Data Frame that looks like this:
df
Date Hr CO2_resp
0 5/1/02 600 0.000889
1 5/2/02 600 0.000984
2 5/4/02 900 0.000912
How would I go about creating a column Ind that represents a number index of hours elapsed since midnight 5/1/02? Such that the column would read
df
Date Hr Ind CO2_resp
0 5/1/02 600 6 0.000889
1 5/2/02 600 30 0.000984
2 5/4/02 800 80 0.000912
Thanks.
You can use to_datetime with to_timedelta. Then convert timedelta to hours by np.timedelta64(1, 'h') and last if type of output is always int, cast by astype:
#convert column Date to datetime
df['Date'] = pd.to_datetime(df.Date)
df['Ind'] = ((df.Date
- pd.to_datetime('2002-05-01')
+ pd.to_timedelta(df.Hr / 100, unit='h')) / np.timedelta64(1, 'h')).astype(int)
print (df)
Date Hr CO2_resp ind
0 2002-05-01 600 0.000889 6
1 2002-05-02 600 0.000984 30
2 2002-05-04 900 0.000912 81
If not dividing by 100 column Hr, output is different:
df['Ind'] = ((df.Date
- pd.to_datetime('2002-05-01')
+ pd.to_timedelta(df.Hr,unit='h')) / np.timedelta64(1, 'h')).astype(int)
print (df)
Date Hr CO2_resp Ind
0 2002-05-01 600 0.000889 600
1 2002-05-02 600 0.000984 624
2 2002-05-04 900 0.000912 972
Assuming that the Date is a string, and Hr is an integer, you could apply a function to parse the Date, get the hours (days * 24) from the timedelta with your reference date, and add the hours.
Something like this -
df.apply(lambda x:
(datetime.datetime.strptime(x['Date'], '%m/%d/%y')
- datetime.datetime.strptime('5/1/02', '%m/%d/%y')).days
* 24 + x['Hr'] / 100,
axis=1)
Related
My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.
I would to calculate the days difference between all the days in the "last_review" column and
2018-08-01, and I want the output to be exact days, like if the observation is 2018-07-31, the output should be 2. And do this for every observation of the dataframe column. The output should be
48894 * 1
You can it like so:
df['last_review'] = pd.to_datetime(df['last_review'])
df['num_days'] = pd.to_datetime("2019-08-01") - df['last_review']
Output:
last_review num_days
0 2018-10-19 286 days
1 2019-05-21 72 days
2 2011-03-28 3048 days
You can use:
sub_date = datetime(2018,8,1)
df['last_review'] = pd.to_datetime(df['last_review'])
df['diff'] = (sub_date - df['last_review']).dt.days
I have a column with timedelta and I would like to create an extra column extracting the hour and minute from the timedelta column.
df
time_delta hour_minute
02:51:21.401000 2h:51min
03:10:32.401000 3h:10min
08:46:43.401000 08h:46min
This is what I have tried so far:
df['rh'] = df.time_delta.apply(lambda x: round(pd.Timedelta(x).total_seconds() \
% 86400.0 / 3600.0) )
Unfortunately, I'm not quite sure how to extract the minutes without incl. the hour
Use Series.dt.components for get hours and minutes and join together:
td = pd.to_timedelta(df.time_delta).dt.components
df['rh'] = (td.hours.astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 08:46:43.401000 08h:46min 08h:46min
If possible values of hour are more like 24hours is necessary also add days:
print (df)
time_delta hour_minute
0 02:51:21.401000 2h:51min
1 03:10:32.401000 3h:10min
2 28:46:43.401000 28h:46min
td = pd.to_timedelta(df.time_delta).dt.components
print (td)
days hours minutes seconds milliseconds microseconds nanoseconds
0 0 2 51 21 401 0 0
1 0 3 10 32 401 0 0
2 1 4 46 43 401 0 0
df['rh'] = ((td.days * 24 + td.hours).astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 28:46:43.401000 28h:46min 28h:46min
See also this post which defines the function
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
Then, e.g.
strfdelta(pd.Timedelta('02:51:21.401000'), '{hours}h:{minutes}min')
gives '2h:51min'.
For your full dataframe
df['rh'] = df.time_delta.apply(lambda x: strfdelta(pd.Timedelta(x), '{hours}h:{minutes}min'))
I would love some guidance how I can compare the same dates over different years. I have daily mean temperature data for all March days from 1997-2018 and my goal is to see the mean temperature of each day over my time period. My df is simple and the head and tail looks like the following:
IndexType = Datetime
Date temp
1997-03-01 6.00
1997-03-02 6.22
1997-03-03 6.03
1997-03-04 4.41
1997-03-05 5.29
Date temp
2018-03-27 -2.44
2018-03-28 -1.01
2018-03-29 -1.08
2018-03-30 -0.53
2018-03-31 -0.11
I imagine the goal could be either 1) a dataframe with days as an index and years as column or 2) a Series with days as index and and the average daily temperature of 1997-2018.
My code:
df = pd.read_csv(file, sep=';', skiprows=9, usecols=[0, 1, 2, 3], parse_dates=[['Datum', 'Tid (UTC)']], index_col=0)
print(df.head())
df.columns = ['temp']
df.index.names = ['Date']
df_mar = df.loc[df.index.month == 3]
df_mar = df_mar.resample('D').mean().round(2)
You can use groupby to see lots of comparisons. Not sure if that's exactly what you're looking for?
Make sure your date column is a Timestamp.
import pandas as pd
df = df.reset_index(drop=False)
df['Date'] = pd.to_datetime(df['Date'])
I'll initialize a dataframe to practice on:
import datetime
import random
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 100000)]
df = pd.DataFrame({'date':date_list, 'temp':[random.randint(-30, 100) for x in range(100000)]})
march = df[df['date'].dt.month == 3]
g = march.groupby(march['date'].dt.day).agg({'temp':['max', 'min', 'mean']})
alternatively you can do this across your whole dataframe, not just march.
df.groupby(df['date'].dt.month).agg({'temp':['max', 'min', 'mean', 'nunique']})
temp
max min mean nunique
date
1 100 -30 34.999765 131
2 100 -30 35.167485 131
3 100 -30 35.660215 131
4 100 -30 34.436264 131
5 100 -30 35.424371 131
6 100 -30 35.086253 131
7 100 -30 35.188133 131
8 100 -30 34.772781 131
9 100 -30 34.839173 131
10 100 -30 35.248528 131
11 100 -30 34.666302 131
12 100 -30 34.575583 131
I want to convert a column in dataset of hh:mm format to minutes. I tried the following code but it says " AttributeError: 'Series' object has no attribute 'split' ". The data is in following format. I also have nan values in the dataset and the plan is to compute the median of values and then fill the rows which has nan with the median
02:32
02:14
02:31
02:15
02:28
02:15
02:22
02:16
02:22
02:14
I have tried this so far
s = dataset['Enroute_time_(hh mm)']
hours, minutes = s.split(':')
int(hours) * 60 + int(minutes)
I suggest you avoid row-wise calculations. You can use a vectorised approach with Pandas / NumPy:
df = pd.DataFrame({'time': ['02:32', '02:14', '02:31', '02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14', np.nan]})
values = df['time'].fillna('00:00').str.split(':', expand=True).astype(int)
factors = np.array([60, 1])
df['mins'] = (values * factors).sum(1)
print(df)
time mins
0 02:32 152
1 02:14 134
2 02:31 151
3 02:15 135
4 02:28 148
5 02:15 135
6 02:22 142
7 02:16 136
8 02:22 142
9 02:14 134
10 NaN 0
If you want to use split you will need to use the str accessor, ie s.str.split(':').
However I think that in this case it makes more sense to use apply:
df = pd.DataFrame({'Enroute_time_(hh mm)': ['02:32', '02:14', '02:31',
'02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14']})
def convert_to_minutes(value):
hours, minutes = value.split(':')
return int(hours) * 60 + int(minutes)
df['Enroute_time_(hh mm)'] = df['Enroute_time_(hh mm)'].apply(convert_to_minutes)
print(df)
# Enroute_time_(hh mm)
# 0 152
# 1 134
# 2 151
# 3 135
# 4 148
# 5 135
# 6 142
# 7 136
# 8 142
# 9 134
I understood that you have a column in a DataFrame with multiple Timedeltas as Strings. Then you want to extract the total minutes of the Deltas. After that you want to fill the NaN values with the median of the total minutes.
import pandas as pd
df = pd.DataFrame(
{'hhmm' : ['02:32',
'02:14',
'02:31',
'02:15',
'02:28',
'02:15',
'02:22',
'02:16',
'02:22',
'02:14']})
Your Timedeltas are not Timedeltas. They are strings. So you need to convert them first.
df.hhmm = pd.to_datetime(df.hhmm, format='%H:%M')
df.hhmm = pd.to_timedelta(df.hhmm - pd.datetime(1900, 1, 1))
This gives you the following values (Note the dtype: timedelta64[ns] here)
0 02:32:00
1 02:14:00
2 02:31:00
3 02:15:00
4 02:28:00
5 02:15:00
6 02:22:00
7 02:16:00
8 02:22:00
9 02:14:00
Name: hhmm, dtype: timedelta64[ns]
Now that you have true timedeltas, you can use some cool functions like total_seconds() and then calculate the minutes.
df.hhmm.dt.total_seconds() / 60
If that is not what you wanted, you can also use the following.
df.hhmm.dt.components.minutes
This gives you the minutes from the HH:MM string as if you would have split it.
Fill the na-values.
df.hhmm.fillna((df.hhmm.dt.total_seconds() / 60).mean())
or
df.hhmm.fillna(df.hhmm.dt.components.minutes.mean())