Changing column various string formats in pandas - python

I have been working on a dataframe where one of the column (flight_time) contains flight duration, all of the strings are in 3 different formats for example:
"07 h 05 m"
"13h 55m"
"2h 23m"
I would like to change them all to HH:MM format and finally change the data type from object to time.
Can somebody tell me how to do this?

It's not possible to have a time dtype. You can have a datetime64 (pd.DatetimeIndex) or a timedelta64 (pd.TimedeltaIndex). In your case, I think it's better to have a TimedeltaIndex so you can use the pd.to_timedelta function:
df['flight_time2'] = pd.to_timedelta(df['flight_time'])
print(df)
# Output
flight_time flight_time2
0 07 h 05 m 0 days 07:05:00
1 13h 55m 0 days 13:55:00
2 2h 23m 0 days 02:23:00
If you want individual time from datetime.time, use:
df['flight_time2'] = pd.to_datetime(df['flight_time'].str.findall('\d+')
.str.join(':')).dt.time
print(df)
# Output
flight_time flight_time2
0 07 h 05 m 07:05:00
1 13h 55m 13:55:00
2 2h 23m 02:23:00
In this case, flight_time2 has still object dtype:
>>> df.dtypes
flight_time object
flight_time2 object
dtype: object
But each value is an instance of datetime.time:
>>> df.loc[0, 'flight_time2']
datetime.time(7, 5)
In the first case, you can use vectorized method while in the second version is not possible. Furthermore, you loose the dt accessor.

Related

Issue in converting column to datetime in pandas

I have a csv and am reading the csv using the following code
df1 = pd.read_csv('dataDate.csv');
df1
Out[57]:
Date
0 01/01/2019
1 01/01/2019
2 01/01/2019
3 01/01/2019
4 01/01/2019
5 01/01/2019
Currently the column has dtype : dtype('O') I am now doing the following command to convert the following date to datetime in the format %d/%m/%Y
df1.Date = pd.to_datetime(df1.Date, format='%d/%m/%Y')
It produces output as :
9 2019-01-01
35 2019-01-01
48 2019-01-01
38 2019-01-01
18 2019-01-01
36 2019-01-01
31 2019-01-01
6 2019-01-01
Not sure what is wrong here, I want the same format as the input for my process. Can anyone tell what's wrong with the same?
Thanks
The produced output is the default format for pandas' datetime object, so there is nothing wrong. Yet, you can play around with the format and produce a datetime string with strftime method. This built-in method for python is implemented in pandas.
You can try the following:
df1.Date = pd.to_datetime(df1.Date, format='%d/%m/%Y')
df1['my_date'] = df1.Date.dt.strftime('%d/%m/%Y')
So that 'my_date' column has the desired format. Yet, you cannot do datetime operations with that column, but you can use for representation. You can work with Date column for your mathematical operations, etc. and represent them with my_date column.

calculate date difference between today's date and pandas date series

Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)

groupby to find row with max value is converting object to datetime

I want to groupby two variables ['CIN','calendar'] and return the row of that group where the column MCelig is the largest in that specific group. It is likely that multiple rows will have the max value, but i only want one row.
for example:
AidCode CIN MCelig calendar
0 None 1e 1 2014-03-08
1 01 1e 2 2014-03-08
2 01 1e 3 2014-05-08
3 None 2e 4 2014-06-08
4 01 2e 5 2014-06-08
Since the first two rows are a group, I want the row where MCelig =2.
I came up with this line
test=dfx.groupby(['CIN','calendar'], group_keys=False).apply(lambda x: x.ix[x.MCelig.idxmax()])
and it seemed to work, except when i have all 'None' or 'np.nan' for all values in a group for a column, that column is converted to a datetime! see the example below and watch AidCode go from an object to a date.
import datetime as DT
import numpy as np
d = {'CIN' : pd.Series(['1e','1e','1e','2e','2e']),
'AidCode' : pd.Series([np.nan,'01','01',np.nan,'01']),
'calendar' : pd.Series([DT.datetime(2014, 3, 8), DT.datetime(2014, 3, 8),DT.datetime(2014, 5, 8),DT.datetime(2014, 6, 8),DT.datetime(2014, 6, 8)]),
'MCelig' : pd.Series([1,2,3,4,5])}
dfx=pd.DataFrame(d)
#testing whether it was just the np.nan that was the problem, it isn't
#dfx = dfx.where((pd.notnull(dfx)), None)
test=dfx.groupby(['CIN','calendar'], group_keys=False).apply(lambda x: x.ix[x.MCelig.idxmax()])
output
Out[820]:
AidCode CIN MCelig calendar
CIN calendar
1e 2014-03-08 2015-01-01 1e 2 2014-03-08
2014-05-08 2015-01-01 1e 3 2014-05-08
2e 2014-06-08 2015-01-01 2e 5 2014-06-08
UPDATE:
just figured out this simple solution
x=dfx.sort(['CIN','calendar',"MCelig"]).groupby(["CIN",'calendar'], as_index=False).last();x
since it works, i think I chose it for simplicity sake.
Pandas attempts to be extra helpful by recognizing columns that look like dates and converting the column to datetime64 dtype. It's being overly aggressive here.
A workaround would be to use transform to generate a boolean mask for each group which selects maximum rows:
def onemax(x):
mask = np.zeros(len(x), dtype='bool')
idx = np.argmax(x.values)
mask[idx] = 1
return mask
dfx.loc[dfx.groupby(['CIN','calendar'])['MCelig'].transform(onemax).astype(bool)]
yields
AidCode CIN MCelig calendar
1 01 1e 2 2014-03-08
2 01 1e 3 2014-05-08
4 01 2e 5 2014-06-08
Technical detail: When groupby-apply is used, when the individual DataFrames (returned by the applied function) are glued back together into one DataFrame, Pandas tries to guess if columns
with object dtype are date-like objects, and if so, convert the column to
an actual date dtype. If the values are strings, it tries to parse them as
dates using dateutil.parser:
For better or for worse, dateutil.parser interprets '01' as a date:
In [37]: import dateutil.parser as DP
In [38]: DP.parse('01')
Out[38]: datetime.datetime(2015, 1, 1, 0, 0)
This causes Pandas to attempt to convert the entire AidCode column into dates. Since no error occurs, it thinks it just helped you out :)

Conversions of np.timedelta64 to days, weeks, months, etc

When I compute the difference between two pandas datetime64 dates I get np.timedelta64. Is there any easy way to convert these deltas into representations like hours, days, weeks, etc.?
I could not find any methods in np.timedelta64 that facilitate conversions between different units, but it looks like Pandas seems to know how to convert these units to days when printing timedeltas (e.g. I get: 29 days, 23:20:00 in the string representation dataframes). Any way to access this functionality ?
Update:
Strangely, none of the following work:
> df['column_with_times'].days
> df['column_with_times'].apply(lambda x: x.days)
but this one does:
df['column_with_times'][0].days
pandas stores timedelta data in the numpy timedelta64[ns] type, but also provides the Timedelta type to wrap this for more convenience (eg to provide such accessors of the days, hours, .. and other components).
In [41]: timedelta_col = pd.Series(pd.timedelta_range('1 days', periods=5, freq='2 h'))
In [42]: timedelta_col
Out[42]:
0 1 days 00:00:00
1 1 days 02:00:00
2 1 days 04:00:00
3 1 days 06:00:00
4 1 days 08:00:00
dtype: timedelta64[ns]
To access the different components of a full column (series), you have to use the .dt accessor. For example:
In [43]: timedelta_col.dt.hours
Out[43]:
0 0
1 2
2 4
3 6
4 8
dtype: int64
With timedelta_col.dt.components you get a frame with all the different components (days to nanoseconds) as different columns.
When accessing one value of the column above, this gives back a Timedelta, and on this you don't need to use the dt accessor, but you can access directly the components:
In [45]: timedelta_col[0]
Out[45]: Timedelta('1 days 00:00:00')
In [46]: timedelta_col[0].days
Out[46]: 1L
So the .dt accessor provides access to the attributes of the Timedelta scalar, but on the full column. That is the reason you see that df['column_with_times'][0].days works but df['column_with_times'].days not.
The reason that df['column_with_times'].apply(lambda x: x.days) does not work is that apply is given the timedelta64 values (and not the Timedelta pandas type), and these don't have such attributes.

Efficiently handling missing dates when aggregating Pandas Dataframe

Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.
You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object

Categories