Efficiently handling missing dates when aggregating Pandas Dataframe

Efficiently handling missing dates when aggregating Pandas Dataframe - python

Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.

You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object

Related

Pandas: Unable to merge on two date columns

I have two dataframes that look like:
df1:
Date Multiplier
0 1995-01-01 5.248256
1 1995-02-01 5.262376
2 1995-03-01 5.255998
3 1995-04-01 5.215762
4 1995-05-01 5.207806
df2:
PRICE Date
0 77500 1995-01-01
1 60000 1995-01-01
2 39250 1995-01-01
3 51250 1995-01-01
4 224950 1995-01-01
Both date columns have been made using the pd.to_datetime() method, and they both supposedly have <M8[ns] data types when using df1.Date.dtype and df2.Date.dtype. However when trying to merge the dataframes with pd.merge(df,hpi,how="left",on="Date") I get the error:
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat

Try to convert the Date column of df1 to a datetime64
Check dtypes first:
>>> df1.dtypes
Date object # <- Not a datetime
Multiplier float64
dtype: object
>>> df2.dtypes
PRICE int64
Date datetime64[ns] # <- Right dtype
dtype: object
Convert and merge:
df1['Date'] = pd.to_datetime(df1['Date'])
out = pd.merge(df1, df2,how='left',on='Date')

get next value in list Pandas

I have a list of unique dates in chronological order.
I have a dataframe with dates in it. I want to use the list of dates in the dataframe to get the NEXT date in the list (find the date in dataframe in the list, return the date to the right of it ( next chronological date).
Any ideas?

It appears that printing the list wouldn't work, and you haven't provided us with any code to work with, or an example print of what your date time looks like. My best suggestion is to use the sort function.
dataframe.sort()
If I wanted a specific date to print, I would have to say to print it by index number once you have it sorted. Without knowing what your computers ability is to handle print statements of this size, I suggest copying this sorted file to a out txt file to ensure that you are getting the proper response.

so for every item in the dataframe there is an exact match for its date in the list of unique dates and you want to move it to the next date
you should use a dictionary for this really
next_date_dictionary = dict(zip(sequential_list_of_dates,sequential_list_of_dates[1:]))
then you simply look up the next date in the dictionary
next_date = next_date_dictionary.get(row.date)
alternatively if you want to replace the date column you can use replace
data_frame.replace({"date":next_date_dictionary})

OK here is one way of doing this:
In [210]:
# generate some data
df = pd.DataFrame({'dates':pd.date_range(start=dt.datetime(2014,3,2), end=dt.datetime(2014,4,23))})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 0 to 52
Data columns (total 1 columns):
dates 53 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 848.0 bytes
Now I'd create a df from your date list:
In [219]:
base = dt.datetime(2014,5,3)
date_list = [base - dt.timedelta(days=x) for x in range(0, 70)]
date_df = pd.DataFrame({'dates':date_list})
date_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 69
Data columns (total 1 columns):
dates 70 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 1.1 KB
Then add a new column to this date_df that shifts the dates column by 1 and then set the index to be the dates:
In [220]:
date_df['date_lookup'] = date_df['dates'].shift(1)
date_df = date_df.set_index('dates')
date_df.head()
Out[220]:
date_lookup
dates
2014-05-03 NaT
2014-05-02 2014-05-03
2014-05-01 2014-05-02
2014-04-30 2014-05-01
2014-04-29 2014-04-30
Then call map on the orig df and pass the date_df and access the date_lookup column, map will use the index to perform a lookup which will return the corresponding next value:
In [221]:
df['date_next'] = df['dates'].map(date_df['date_lookup'])
df.head()
Out[221]:
dates date_next
0 2014-03-02 2014-03-03
1 2014-03-03 2014-03-04
2 2014-03-04 2014-03-05
3 2014-03-05 2014-03-06
4 2014-03-06 2014-03-07

Pandas - converting datetime64 to Period

I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)

Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1

import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)

How do I convert strings in a Pandas data frame to a 'date' data type?

I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?

Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')

Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00

I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.

If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date

Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)

It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())

# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]

Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.

Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64

You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)

Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help

Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.

To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently handling missing dates when aggregating Pandas Dataframe - python

Related

Pandas: Unable to merge on two date columns

get next value in list Pandas

Pandas - converting datetime64 to Period

How do I convert strings in a Pandas data frame to a 'date' data type?

Pandas Timedelta in Days

Categories

Resources