I have a dataframe with a 4 digit int column:
df['time'].head(10)
0 1844
1 2151
2 1341
3 2252
4 2252
5 1216
6 2334
7 2247
8 2237
9 1651
Name: DepTime, dtype: int64
I have verified that max is 2400 and min is 1. I would like to convert this to a date time column with hours and minutes. How would I do that?
If these are 4 digits, timedelta is more appropriate than datetime:
pd.to_timedelta(df['time']//100 * 60 + df['time'] % 100, unit='m')
Output:
0 18:44:00
1 21:51:00
2 13:41:00
3 22:52:00
4 22:52:00
5 12:16:00
6 23:34:00
7 22:47:00
8 22:37:00
9 16:51:00
Name: time, dtype: timedelta64[ns]
If you have another column date, you may want to merge date and time to create a datetime column.
IIUC
pd.to_datetime(df.time.astype(str),format='%H%M').dt.strftime('%H:%M')
Out[324]:
0 21:51
1 13:41
2 22:52
3 22:52
4 12:16
5 23:34
6 22:47
7 22:37
8 16:51
Name: col2, dtype: object
Try this!
df['conversion'] = (df['time'].apply(lambda x: pd.to_datetime(x, format = '%H%M')).dt.strftime('%H:%M'))
If you want output in string format of HH:MM, you just need to convert column to string and use str.slice_replace with : (Note: I change your sample to include case of 3-digit integer)
sample df:
time
0 1844
1 2151
2 1341
3 2252
4 2252
5 216
6 2334
7 2247
8 2237
9 1651
s = df['time'].map('{0:04}'.format)
out = s.str.slice_replace(2,2,':')
Out[666]:
0 18:44
1 21:51
2 13:41
3 22:52
4 22:52
5 02:16
6 23:34
7 22:47
8 22:37
9 16:51
Name: time, dtype: object
Or split and concat with :
s = df['time'].map('{0:04}'.format)
out = s.str[:2] + ':' + s.str[2:]
Out[665]:
0 18:44
1 21:51
2 13:41
3 22:52
4 22:52
5 02:16
6 23:34
7 22:47
8 22:37
9 16:51
Name: time, dtype: object
Related
I have a problem. I have a dataframe that contains the customerId and a date fromDate. Now I want to calculate for each customer individually when the next delivery is. For example, I have the customer with the customerId = 1 and he has bought something on 2021-03-18 I would now like to find the next date and output this distance in days e.g. 2021-03-22 and 4 days. In simple terms I want to calculate the next date in the future - from Date or n - (n-1). Unless the date has a next date, it should be None e.g. 2022-01-18 should be None.
I have a problem, I get a lot of None values, moreover, I should look at each customer separately. How can I do this?
Mathematical with an example
n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
customerId fromDate next_day_in_days
1 1 2021-03-18 4
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
Code
import pandas as pd
import datetime
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
print(df)
def nearest(items, pivot):
try:
return min(items, key=lambda x: abs(x - pivot))
except:
return None
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x))
Output
[OUT]
customerId fromDate next_in_days
0 1 2021-02-22 None
1 1 2021-03-18 None
2 1 2021-03-22 None
3 1 2021-02-10 None
4 1 2021-09-07 None
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 None
9 3 2021-07-17 None
10 3 2021-02-22 None
11 3 2021-02-22 None
Name: next_in_days, dtype: object
What I want
customerId fromDate next_day_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 61
9 3 2021-07-17 None
10 3 2021-02-22 133
11 3 2021-02-22 133
First sorting columns per customerId and fromDate, because possible duplicates remove them by same columns, so possible use DataFrameGroupBy.diff with Series.dt.days:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
.groupby('customerId')['fromDate']
.diff(-1)
.dt.days
.abs())
Get original ordering of index if necessary.
df = df.sort_index()
Last repeat duplicated values per ['customerId', 'fromDate'], here last value 84.0 by GroupBy.ffill:
df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
customerId fromDate next_day_in_days
0 1 2021-02-22 24.0
1 1 2021-03-18 4.0
2 1 2021-03-22 169.0
3 1 2021-02-10 12.0
4 1 2021-09-07 133.0
5 1 NaT NaN
6 1 2022-01-18 NaN
7 2 2021-05-17 NaN
8 3 2021-05-17 61.0
9 3 2021-07-17 NaN
10 3 2021-02-22 84.0
11 3 2021-02-22 84.0
i want to combine months from years into sequence, for example, i have dataframe like this:
stuff_id date
1 2015-02-03
2 2015-03-03
3 2015-05-19
4 2015-10-13
5 2016-01-07
6 2016-03-20
i want to sequence the months of the date. the desired output is:
stuff_id date month
1 2015-02-03 1
2 2015-03-03 2
3 2015-05-19 4
4 2015-10-13 9
5 2016-01-07 12
6 2016-03-20 14
which means feb'15 is the first month in the date list and jan'2016 is the 12th month after feb'2015
If your date column is a datetime (if it's not, cast it to one), you can use the .dt.month and .dt.year properties for this!
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html
recast
(text copy from Answer to Pasting data into a pandas dataframe)
>>> df = pd.read_table(io.StringIO(s), delim_whitespace=True) # text from SO
>>> df["date"] = pd.to_datetime(df["date"])
>>> df
stuff_id date
0 1 2015-02-03
1 2 2015-03-03
2 3 2015-05-19
3 4 2015-10-13
4 5 2016-01-07
5 6 2016-03-20
>>> df.dtypes
stuff_id int64
date datetime64[ns]
dtype: object
extract years and months to decimal months and reduce to relative
>>> months = df["date"].dt.year * 12 + df["date"].dt.month # series
>>> df["months"] = months - min(months) + 1
>>> df
stuff_id date months
0 1 2015-02-03 1
1 2 2015-03-03 2
2 3 2015-05-19 4
3 4 2015-10-13 9
4 5 2016-01-07 12
5 6 2016-03-20 14
I am trying to filter out my dataframe on the basis of number of days. I want it to be more that 5 days.
x = df['gift_date'] - min(df['gift_date'])
The output I'm getting is:
6213959 196 days 00:01:45
6213960 196 days 00:01:48
6213961 197 days 00:01:49
6213962 196 days 00:01:48
6213963 196 days 00:01:48
6213964 197 days 00:01:50
Name: invitation_date, Length: 6213965, dtype: timedelta64[ns]
I only want number of days from this result.
Is there any other process?
You can use .dt.days for that, so in your case x = (df['gift_date'] - min(df['gift_date'])).dt.days:
In [69]: s = pd.Series(pd.timedelta_range(0, 1e15, periods=10))
In [70]: s
Out[70]:
0 0 days 00:00:00
1 1 days 06:51:51.111111
2 2 days 13:43:42.222222
3 3 days 20:35:33.333333
4 5 days 03:27:24.444444
5 6 days 10:19:15.555555
6 7 days 17:11:06.666666
7 9 days 00:02:57.777777
8 10 days 06:54:48.888888
9 11 days 13:46:40
dtype: timedelta64[ns]
In [71]: s.dt.days
Out[71]:
0 0
1 1
2 2
3 3
4 5
5 6
6 7
7 9
8 10
9 11
dtype: int64
In my dataframe there is a column containing values like this:
PowerPlayTimeOnIce
0:05
0:05
1:24
3:29
1:34
0
0:05
0
0
How do I convert these to floats?
This method didn't work:
df["powerPlayTimeOnIce"] = df["powerPlayTimeOnIce"].astype('float')
EDIT:Updated data-example to fit problem better
Using to_datetime
s=pd.to_datetime(df.PowerPlayTimeOnIce,format='%M:%S')
s.dt.minute*60+s.dt.second
Out[881]:
0 5
1 5
2 84
3 209
4 94
5 5
Name: PowerPlayTimeOnIce, dtype: int64
Update
s=pd.to_datetime(df.PowerPlayTimeOnIce,format='%M:%S',errors='coerce')
(s.dt.minute*60+s.dt.second).fillna(0)
Out[886]:
0 5.0
1 5.0
2 84.0
3 209.0
4 94.0
5 5.0
6 0.0
Name: PowerPlayTimeOnIce, dtype: float64
Data input
PowerPlayTimeOnIce
0 0:05
1 0:05
2 1:24
3 3:29
4 1:34
5 0:05
6 0
You could do something like this:
import pandas as pd
data = ['0:05',
'0:05',
'1:24',
'3:29',
'1:34',
'0:05']
def convert(s):
minutes, seconds = map(int, s.split(":"))
return 60 * minutes + seconds
df = pd.DataFrame(data=data, columns=['powerPlayTimeOnIce'])
print(df['powerPlayTimeOnIce'].apply(convert))
Output
0 5
1 5
2 84
3 209
4 94
5 5
Name: powerPlayTimeOnIce, dtype: int64
If you want a very verbose flow and you don’t have a huge dataset. You could do:
df[['min', 'sec']] = df['powerPlayTimeOnIce'].str.split(':', expand=True)
df[['min'] = df[['min'].astype('int')
df['sec'] = df['sec'].apply(lambda x: float('0.'+x), axis=1)
df['diff_in_seconds'] = df['min']/60 + df['sec']
So I split your data to min and sec. and from there you can turn to whatever format.
You can use pd.to_timedelta + the .total_seconds() accessor. First you need to format the strings properly (HH:mm:ss) as you cannot specify a format. Though perhaps not relevant for hockey times, this can deal with large time without much issue.
import pandas as pd
s = df.PowerPlayTimeOnIce.replace(':', '', regex=True).str.zfill(6)
pd.to_timedelta(s.str[0:-4]+':'+s.str[-4:-2]+':'+s.str[-2::]).dt.total_seconds()
Output:
0 5.0
1 5.0
2 84.0
3 209.0
4 94.0
5 5.0
6 0.0
7 446161.0
8 4046161.0
Name: PowerPlayTimeOnIce, dtype: float64
Input Data
PowerPlayTimeOnIce
0 0:05
1 0:05
2 1:24
3 3:29
4 1:34
5 0:05
6 0
7 123:56:01
8 1123:56:01
The datetime is given in the format YY-MM-DD HH:MM:SS in a dataframe.I want new Series of year,month and hour for which I am trying the below code.
But the problem is that Month and Hour are getting the same value,Year is fine.
Can anyone help me with this ? I am using Ipthon notebook and Pandas and numpy.
Here is the code :
def extract_hour(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.hour
def extract_month(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.month
def extract_year(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.year
#month column
train['Month']=train['datetime'].apply((lambda x: extract_month(x)))
test['Month']=test['datetime'].apply((lambda x: extract_month(x)))
#year column
train['Year']=train['datetime'].apply((lambda x: extract_year(x)))
test['Year']=test['datetime'].apply((lambda x: extract_year(x)))
#Hour column
train['Hour']=train['datetime'].apply((lambda x: extract_hour(x)))
test['Hour']=test['datetime'].apply((lambda x: extract_hour(x)))
you can use .dt accessors instead: train['datetime'].dt.month, train['datetime'].dt.year, train['datetime'].dt.hour (see the full list below)
Demo:
In [81]: train = pd.DataFrame(pd.date_range('2016-01-01', freq='1999H', periods=10), columns=['datetime'])
In [82]: train
Out[82]:
datetime
0 2016-01-01 00:00:00
1 2016-03-24 07:00:00
2 2016-06-15 14:00:00
3 2016-09-06 21:00:00
4 2016-11-29 04:00:00
5 2017-02-20 11:00:00
6 2017-05-14 18:00:00
7 2017-08-06 01:00:00
8 2017-10-28 08:00:00
9 2018-01-19 15:00:00
In [83]: train.datetime.dt.year
Out[83]:
0 2016
1 2016
2 2016
3 2016
4 2016
5 2017
6 2017
7 2017
8 2017
9 2018
Name: datetime, dtype: int64
In [84]: train.datetime.dt.month
Out[84]:
0 1
1 3
2 6
3 9
4 11
5 2
6 5
7 8
8 10
9 1
Name: datetime, dtype: int64
In [85]: train.datetime.dt.hour
Out[85]:
0 0
1 7
2 14
3 21
4 4
5 11
6 18
7 1
8 8
9 15
Name: datetime, dtype: int64
In [86]: train.datetime.dt.day
Out[86]:
0 1
1 24
2 15
3 6
4 29
5 20
6 14
7 6
8 28
9 19
Name: datetime, dtype: int64
List of all .dt accessors:
In [77]: train.datetime.dt.
train.datetime.dt.ceil train.datetime.dt.hour train.datetime.dt.month train.datetime.dt.to_pydatetime
train.datetime.dt.date train.datetime.dt.is_month_end train.datetime.dt.nanosecond train.datetime.dt.tz
train.datetime.dt.day train.datetime.dt.is_month_start train.datetime.dt.normalize train.datetime.dt.tz_convert
train.datetime.dt.dayofweek train.datetime.dt.is_quarter_end train.datetime.dt.quarter train.datetime.dt.tz_localize
train.datetime.dt.dayofyear train.datetime.dt.is_quarter_start train.datetime.dt.round train.datetime.dt.week
train.datetime.dt.days_in_month train.datetime.dt.is_year_end train.datetime.dt.second train.datetime.dt.weekday
train.datetime.dt.daysinmonth train.datetime.dt.is_year_start train.datetime.dt.strftime train.datetime.dt.weekday_name
train.datetime.dt.floor train.datetime.dt.microsecond train.datetime.dt.time train.datetime.dt.weekofyear
train.datetime.dt.freq train.datetime.dt.minute train.datetime.dt.to_period train.datetime.dt.year