I have a data frame with a column indicating the number of months. I would like to create a new column, starting from an initial date, let’s say 2015-01-01 and add all the months to this initial date. For example, if the month column has values [0, 1, 2, …,72], then I would like to have a column called Date of the form [2015-01-01,2015-02-01,2015-03-01,…].
How could I achieve this?
Use offsets.DateOffset and add to datetime:
df = pd.DataFrame({'n': [0,1,2,72]})
start = '2015-01-01'
df['new'] = pd.to_datetime(start) + df['n'].apply(lambda x: pd.offsets.DateOffset(months=x))
print (df)
n new
0 0 2015-01-01
1 1 2015-02-01
2 2 2015-03-01
3 72 2021-01-01
Related
I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day
I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)
I have a dataframe like this:
ID
Date
01
2020-01-02
01
2020-01-03
02
2020-01-02
I need to create a new column, that for each specific ID and Date, gives me the number of rows that have the same ID but of an earlier date.
So the output of the example df will look like this
ID
Date
Count
01
2020-01-02
0
01
2020-01-03
1
02
2020-01-02
0
I have tried working with aux tables, and also with group by using a lambda function, but with no real idea how to continue
This will create a new column with the count.
df['Date'] = pd.to_datetime(df['Date'])
df['Count'] = df.groupby('ID')['Date'].rank(ascending=True).astype(int) - 1
First you need to be sure that you are comparing dates.
df["Date"] = pd.to_datetime(df['Date'], format="%Y-%m-%d")
Then you can create new column called 'Count' iterating over each row using df.apply.
def count_earlier_dates(row):
return df[df['Date'] < row['Date']].count()['ID']
df['Count'] = df.apply(lambda row: count_earlier_dates(row), axis=1)
Let us try factorize
df['new'] = df.sort_values('Date').groupby('ID')['Date'].transform(lambda x : x.factorize()[0])
df
ID Date new
0 1 2020-01-02 0
1 1 2020-01-03 1
2 2 2020-01-02 0
I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:
row
id
date
target
yearmonth
0
A
2015-03-16
0
2015-03
1
A
2015-05-29
1
2015-05
2
A
2015-08-02
1
2015-08
3
A
2015-09-05
1
2015-09
4
A
2015-09-22
0
2015-09
5
A
2015-10-15
1
2015-10
6
A
2015-11-09
1
2015-11
7
B
2015-04-17
0
2015-04
I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')
agg_dict = {
"Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
"Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),
"Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}
df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)
I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.
id
yearmonth
Total_Transactions
Fail_Count
Perc_Monthly_Fails
A
2015-03
1
0
0
A
2015-05
1
1
100
A
2015-08
1
1
100
A
2015-09
2
1
50
A
2015-10
1
1
100
A
2015-11
1
1
100
B
2015-04
1
0
0
You can do this by merging the DataFrame with itself on 'id'.
First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.
With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).
If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]') # Credit #anky
df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))
df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3),
df1.fom-pd.offsets.DateOffset(months=1))]
df['Pct_fail'] = df1.groupby('index').target.mean()*100
id date target fom Pct_fail
0 A 2015-03-16 0 2015-03-01 NaN # No Rows to Avg
1 A 2015-05-29 1 2015-05-01 0.000000 # Avg Rows 0
2 A 2015-08-02 1 2015-08-01 100.000000 # Avg Rows 1
3 A 2015-09-05 1 2015-09-01 100.000000 # Avg Rows 2
4 A 2015-09-22 0 2015-09-01 100.000000 # Avg Rows 2
5 A 2015-10-15 1 2015-10-01 66.666667 # Avg Rows 2,3,4
6 A 2015-11-09 1 2015-11-01 75.000000 # Avg Rows 2,3,4,5
7 B 2015-04-17 0 2015-04-01 NaN # No Rows to Avg
If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.
def get_prev_avg(row, df):
df = df[df['id'].eq(row['id'])
& df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3),
row['fom']-pd.offsets.DateOffset(months=1))]
if not df.empty:
return df['target'].mean()*100
else:
return np.NaN
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')
df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)
I have modified #ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:
df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
def get_prev_avg(row, df, columnname, lastxmonths):
df = df[df[columnname].eq(row[columnname])
& df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths),
row['date']-pd.offsets.DateOffset(days=1))]
if not df.empty:
NrTransactions= len(df['target'])
PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
else:
return pd.Series([np.NaN, np.NaN, np.NaN])
for lastxmonths in [3, 4]:
for columnname in ['Id','Type']:
df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)
Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.
I have a data frame with several columns and rows, I have a column of 'Date' ( month/day/ year Hour:Min: Sec PM' and I need to get from the data frame only the rows that that match the Hour:Min:Sec part of that column. The column has the data as object.
df.loc[df['Date']== 'month/day/year 11:00:00 PM'].copy
It only works when I specify the month/day/year, but I want to obtain the rows that correspond to the time no matter the day. Does any one know how can this be achieved ?
This is in 2 steps. The first creates an intermediate col with the time only. The second does the filtering.
>>> import datetime
>>> import pandas as pd
>>> df =pd.DataFrame([[datetime(2018,1,1,2,2,2),1], [ datetime(2018,1,1,3,3,3),2]], columns=['Date','Val'])
Date Val
0 2018-01-01 02:02:02 1
1 2018-01-01 03:03:03 2
1) Create intermediate col
>>> df['new'] = df['Date'].transform(lambda x: x.time())
>>> df
Date Val new
0 2018-01-01 02:02:02 1 02:02:02
1 2018-01-01 03:03:03 2 03:03:03
2) Do filtering
>>> df[df['new'] == datetime.time(2,2,2)]
Date Val new
0 2018-01-01 02:02:02 1 02:02:02