How to populate date in a dataframe using pandas in python - python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!

Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.

Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

Related

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Pandas: How to sort dataframe rows by date of one column

So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

how to concat 2 pandas dataframe based on time ranges

I have two dataframe and I would like to concat them based on time ranges
for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A)
column "timestamp1" is 7 days before column "timestamp"
for example when timestamp is 3/13 , then 3/6-13 is in the range
otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..
The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.
Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:
We can read in the example dataframes with read_clipboard():
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN

Fastest way to add an extra row to a groupby in pandas

I'm trying to create a new row for each group in a dataframe by copying the last row and then modifying some values. My approach is as follows, the concat step appears to be the bottleneck (I tried append too). Any suggestions?
def genNewObs(df):
lastRowIndex = df.obsNumber.idxmax()
row = pd.DataFrame(df.ix[lastRowIndex].copy())
# changes some other values in row here
df = pd.concat([df,row], ignore_index=True)
return df
df = df.groupby(GROUP).apply(genNewObs)
Edit 1: Basically I have a bunch of data with the last observation on different dates. I want to create a final observation for all groups on the current date.
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
If we pretend the current date is 1/31/2014 this becomes:
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
A 1/31/2014 21
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
B 1/31/2014 4
I've tried setting with enlargement and it is the slowest of all techniques. Any ideas?
Thanks to user1827356, I sped it up by a factor of 100 by taking the operation out of the apply. For some reason first was dropping by Group column, so I used idxmax instead.
def genNewObs(df):
lastRowIndex = df.groupby(Group).Date.idxmax()
rows = df.ix[lastRowIndex]
df = pd.concat([df,rows], ignore_index=True)
df = df.sort([Group, Date], ascending=True)
return df

Categories