Convert day-time column to integer - python

I've looked everywhere for a solution to this issue but nothing seems to work.
I have a column in my dataframe df_jan
459984 0
451375 0
660585 0
722735 78 days 00:00:00
448295 0
...
585781 4 days 00:00:00
612351 22 days 00:00:00
631985 16 days 00:00:00
462341 0
450073 0
Name: delta_sale, Length: 12978, dtype: object
I want to change it so that it is simply the integer value of days.
I've tried the following:
pd.to_datetime()
df_jan['delta_sale'] / np.timedelta64(1, 'D')
.astype(int)
However, none of them have worked and I'm struggling to find any other questions that have the same issue. All I'm trying to achieve is this,
459984 0
451375 0
660585 0
722735 78
448295 0
...
585781 4
612351 22
631985 16
462341 0
450073 0
Name: delta_sale, Length: 12978, dtype: int
Any help would be greatly appreciated.

You can use .apply() in combination with a short temporary function lambda x: x.day
import pandas as pd
df = pd.DataFrame({'date': [pd.Timestamp.now(), pd.Timestamp.now()]})
df['date'].apply(lambda x: x.day)
This yields (because today is the 16th)
0 16
1 16
Name: date, dtype: int64

Related

Convert multiple time format object as datetime format

I have a dataframe with a list of time value as object and needed to convert them to datetime, the issue is, they are not on the same format so when I try:
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M:%S')
it gives me an error
ValueError: time data '3:22' does not match format '%H:%M:%S' (match)
or if use this code
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M')
I get this error
ValueError: unconverted data remains: :58
These are the values on my data
Total call time
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
**45:48**
1:41:40
5:08:37
**3:22**
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58
times = """\
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
45:48
1:41:40
5:08:37
3:22
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58""".split()
import pandas as pd
df = pd.DataFrame(times, columns=['elapsed'])
def pad(s):
if len(s) == 4:
return '00:0'+s
elif len(s) == 5:
return '00:'+s
return s
print(pd.to_timedelta(df['elapsed'].apply(pad)))
Output:
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 00:03:22
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Name: elapsed, dtype: timedelta64[ns]
Alternatively to grovina's answer ... instead of using apply you can directly use the dt accessor.
Here's a sample:
>>> data = [['2017-12-01'], ['2017-12-
30'],['2018-01-01']]
>>> df = pd.DataFrame(data=data,
columns=['date'])
>>> df
date
0 2017-12-01
1 2017-12-30
2 2018-01-01
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: object
Note how df.date is an object? Let's turn it into a date like you want
>>> df.date = pd.to_datetime(df.date)
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: datetime64[ns]
The format you want is for string formatting. I don't think you'll be able to convert the actual datetime64 to look like that format. For now, let's make a newly formatted string version of your date in a separate column
>>> df['new_formatted_date'] =
df.date.dt.strftime('%d/%m/%y %H:%M')
>>> df.new_formatted_date
0 01/12/17 00:00
1 30/12/17 00:00
2 01/01/18 00:00
Name: new_formatted_date, dtype: object
Finally, since the df.date column is now of date datetime64... you can use the dt accessor right on it. No need to use apply
>>> df['month'] = df.date.dt.month
>>> df['day'] = df.date.dt.day
>>> df['year'] = df.date.dt.year
>>> df['hour'] = df.date.dt.hour
>>> df['minute'] = df.date.dt.minute
>>> df
date new_formatted_date month day
year hour minute
0 2017-12-01 01/12/17 00:00 12
1 2017 0 0
1 2017-12-30 30/12/17 00:00 12
30 2017 0 0
2 2018-01-01 01/01/18 00:00
Another idea is test if double : and if not added :00 with converting to timedeltas by to_timedelta, also is test if number before first : is less like 23 - then is parsing like HH:MM, if is greater is parising like MM:SS:
m1 = df['Total call time'].str.count(':').ne(2)
m2 = df['Total call time'].str.extract('^(\d+):', expand=False).astype(float).gt(23)
s = np.select([m1 & m2, m1 & ~m2],
['00:' + df['Total call time'], df['Total call time']+ ':00'],
df['Total call time'] )
df['Total call time'] = pd.to_timedelta(s)
print (df)
Total call time
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 03:22:00
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58

pandas to_datetime convert datetime string to 0

I have a column in a df which contains datetime strings,
inv_date
24/01/2008
15/06/2007 14:55:22
08/06/2007 18:26:12
15/08/2007 14:53:25
15/02/2008
07/03/2007
13/08/2007
I used pd.to_datetime with format %d%m%Y for converting the strings into datetime values;
pd.to_datetime(df.inv_date, errors='coerce', format='%d%m%Y')
I got
inv_date
24/01/2008
0
0
0
15/02/2008
07/03/2007
13/08/2007
the format is inferred from inv_date as the most common datetime format; I am wondering how to not convert 15/06/2007 14:55:22, 08/06/2007 18:26:12, 15/08/2007 14:53:25 to 0s, but 15/06/2007, 08/06/2007, 15/08/2007.
Use the regular pd.to_datetime call then use .dt.date:
>>> pd.to_datetime(df.inv_date).dt.date
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
Or as #ChrisA mentioned, you can also use, only thing is the pandas format is good already, so skipped that part:
>>> pd.to_datetime(df.inv_date.str[:10], errors='coerce')
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
You can also try this:
df = pd.read_csv('myfile.csv', parse_dates=['inv_date'], dayfirst=True)
df['inv_date'].dt.strftime('%d/%m/%Y')
0 24/01/2008
1 15/06/2007
2 08/06/2007
3 15/08/2007
4 15/02/2008
5 07/03/2007
6 13/08/2007
Hope this will help too.

How can I simplify adding columns with certain values to my dataframe?

I have a big dataframe (more than 900000 rows) and want to add some columns depending on the first column (Timestamp with date and time). My code works, but I guess it's far too complicated and slow. I'm a beginner so help would be appreciated! Thanks!
df['seconds_midnight'] = 0
df['weekday'] = 0
df['month'] = 0
def date_to_new_columns(date_var, i):
sec_after_midnight = dt.timedelta(hours=date_var.hour, minutes=date_var.minute, seconds=date_var.second).total_seconds()
weekday = dt.date.isoweekday(date_var)
month1 = date_var.month
df.iloc[i, 24] = sec_after_midnight
df.iloc[i, 25] = weekday
df.iloc[i, 26] = month1
return
for i in range(0, 903308):
date_to_new_columns(df.timestamp.iloc[i], i)
So the reason this is slow is the for loop processing each row individually. One thing that makes pandas so nice is that you can quickly process whole columns/dataframes in one operation.
So create all the rows for each new column at the same time:
def date_to_new_columns(df):
df['sec_after_midnight'] = (df.timestamp - df.timestamp.dt.normalize()).dt.seconds
df['weekday'] = df.timestamp.dt.day_name
df['month1'] = df.timestamp.dt.month
return
Note that the dt.day_name method is called dt.weekday_name prior to pandas version 0.23.0.
If the column is a datetime64/Timestamp column you can use the .dt accessor:
In [11]: df = pd.DataFrame(pd.date_range('2019-01-23', periods=3), columns=['date'])
In [12]: df
Out[12]:
date
0 2019-01-23
1 2019-01-24
2 2019-01-25
In [13]: df.date - df.date.dt.normalize() # timedelta since midnight
Out[13]:
0 0 days
1 0 days
2 0 days
Name: date, dtype: timedelta64[ns]
In [14]: (df.date - df.date.dt.normalize()).dt.seconds # seconds since midnight
Out[14]:
0 0
1 0
2 0
Name: date, dtype: int64
In [15]: df.date.dt.day_name()
Out[15]:
0 Wednesday
1 Thursday
2 Friday
Name: date, dtype: object
In [16]: df.date.dt.month_name()
Out[16]:
0 January
1 January
2 January
Name: date, dtype: object

Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float

I have a pandas DataFrame Series time differences that looks like::
print(delta_t)
1 0 days 00:00:59
3 0 days 00:04:22
6 0 days 00:00:56
8 0 days 00:01:21
19 0 days 00:01:09
22 0 days 00:00:36
...
(the full DataFrame had a bunch of NaNs which I dropped).
I'd like to know which delta_t's are less than 1 day, 1 hour, 1 minute,
so I tried:
delta_t_lt1day = delta_t[np.where(delta_t < 30.)]
but then got a:
TypeError: cannot compare a TimedeltaIndex with type float
Little help?!?!
Assuming your Series is in timedelta format, you can skip the np.where, and index using something like this, where you compare your actual values to other timedeltas, using the appropriate units:
delta_t_lt1day = delta_t[delta_t < pd.Timedelta(1,'D')]
delta_t_lt1hour = delta_t[delta_t < pd.Timedelta(1,'h')]
delta_t_lt1minute = delta_t[delta_t < pd.Timedelta(1,'m')]
You'll get the following series:
>>> delta_t_lt1day
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1hour
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1minute
0
1 00:00:59
6 00:00:56
22 00:00:36
Name: 1, dtype: timedelta64[ns]
You could use the TimeDelta class:
import pandas as pd
deltas = pd.to_timedelta(['0 days 00:00:59',
'0 days 00:04:22',
'0 days 00:00:56',
'0 days 00:01:21',
'0 days 00:01:09',
'0 days 00:31:09',
'0 days 00:00:36'])
for e in deltas[deltas < pd.Timedelta(value=30, unit='m')]:
print(e)
Output
0 days 00:00:59
0 days 00:04:22
0 days 00:00:56
0 days 00:01:21
0 days 00:01:09
0 days 00:00:36
Note that this filter outs '0 days 00:31:09' as expected. The expression pd.Timedelta(value=30, unit='m') creates a time delta of 30 minutes.

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories