I have a dataframe like this,
ID DateCol
1 26/06/2017
2 Employee Not Found
I want to increase the date by 1.
The expected output is,
ID DateCol
1 27/06/2017
2 Employee Not Found
I tried,
temp_result_df['NewDateCol'] = pd.to_datetime(temp_result_df['DateCol']).apply(pd.DateOffset(1))
And it is not working, as I believe there is a string on it.
Here is best working with datetimes in column with NaT for missing values - use to_datetime with Timedelta or Day offset:
temp_result_df['NewDateCol'] = pd.to_datetime(temp_result_df['DateCol'], errors='coerce')
temp_result_df['NewDateCol'] += pd.Timedelta(1, 'd')
#alternative
#temp_result_df['NewDateCol'] += pd.offsets.Day(1)
print (temp_result_df)
ID DateCol NewDateCol
0 1 26/06/2017 2017-06-27
1 2 Employee Not Found NaT
If need strings like original data need strftime with replace:
s = pd.to_datetime(temp_result_df['DateCol'], errors='coerce') + pd.Timedelta(1, 'd')
temp_result_df['NewDateCol'] = s.dt.strftime('%d/%m/%Y').replace('NaT','Employee Not Found')
print (temp_result_df)
ID DateCol NewDateCol
0 1 26/06/2017 27/06/2017
1 2 Employee Not Found Employee Not Found
Related
I have a dataset that contains a column of datetime of a month, and I need to divide it into two blocks (day and night or am\pm) and then discretize the time in each block into 10mins bins. I could add another column of 0 and 1 to show it is am or pm, but I cannot discretize it! Can you please help me with it?
df['started_at'] = pd.to_datetime(df['started_at'])
df['start hour'] = df['started_at'].dt.hour.astype('int')
df['mor/aft'] = np.where(df['start hour'] < 12, 1, 0)
df['started_at']
0 16:05:36
2 06:22:40
3 16:08:10
4 12:28:57
6 15:47:30
...
3084526 15:24:24
3084527 16:33:07
3084532 14:08:12
3084535 09:43:46
3084536 17:02:26
If I understood correctly you are trying to add a column for every interval of ten minutes to indicate if an observation is from that interval of time.
You can use lambda expressions to loop through each observation from the series.
Dividing by 10 and making this an integer gives the first digit of the minutes, based on which you can add indicator columns.
I also included how to extract the day indicator column with a lambda expression for you to compare. It achieves the same as your np.where().
import pandas as pd
from datetime import datetime
# make dataframe
df = pd.DataFrame({
'started_at': ['14:20:56',
'00:13:24',
'16:01:33']
})
# convert column to datetime
df['started_at'] = pd.to_datetime(df['started_at'])
# make day indicator column
df['day'] = df['started_at'].apply(lambda ts: 1 if ts.hour > 12 else 0)
# make indicator column for every ten minutes
for i in range(24):
for j in range(6):
col = 'hour_' + str(i) + '_min_' + str(j) + '0'
df[col] = df['started_at'].apply(lambda ts: 1 if int(ts.minute/10) == j and ts.hour == i else 0)
print(df)
Output first columns:
started_at day hour_0_min_00 hour_0_min_10 hour_0_min_20
0 2021-11-21 14:20:56 1 0 0 0
1 2021-11-21 00:13:24 0 0 1 0
2 2021-11-21 16:01:33 1 0 0 0
...
...
...
I have followed the instructions from this thread, but have run into issues.
Converting month number to datetime in pandas
I think it may have to do with having an additional variable in my dataframe but I am not sure. Here is my dataframe:
0 Month Temp
1 0 2
2 1 4
3 2 3
What I want is:
0 Month Temp
1 1990-01 2
2 1990-02 4
3 1990-03 3
Here is what I have tried:
df= pd.to_datetime('1990-' + df.Month.astype(int).astype(str) + '-1', format = '%Y-%m')
And I get this error:
ValueError: time data 1990-0-1 doesn't match format specified
IIUC, we can manually create your datetime object then format it as your expected output:
m = np.where(df['Month'].eq(0),
df['Month'].add(1), df['Month']
).astype(int).astype(str)
df['date'] = pd.to_datetime(
"1900" + "-" + pd.Series(m), format="%Y-%m"
).dt.strftime("%Y-%m")
print(df)
Month Temp date
0 0 2 1900-01
1 1 4 1900-02
2 2 3 1900-03
Try .dt.strftime() to show how to display the date, because datetime values are by default stored in %Y-%m-%d 00:00:00 format.
import pandas as pd
df= pd.DataFrame({'month':[1,2,3]})
df['date']=pd.to_datetime(df['month'], format="%m").dt.strftime('%Y-%m')
print(df)
You have to explicitly tell pandas to add 1 to the months as they are from range 0-11 not 1-12 in your case.
df=pd.DataFrame({'month':[11,1,2,3,0]})
df['date']=pd.to_datetime(df['month']+1, format='%m').dt.strftime('1990-%m')
Here is my solution for you
import pandas as pd
Data = {
'Month' : [1,2,3],
'Temp' : [2,4,3]
}
data = pd.DataFrame(Data)
data['Month']= pd.to_datetime('1990-' + data.Month.astype(int).astype(str) + '-1', format = '%Y-%m').dt.to_period('M')
Month Temp
0 1990-01 2
1 1990-02 4
2 1990-03 3
If you want Month[0] means 1 then you can conditionally add this one
I have a data frame with date columns as 20,190,927 which means: 2019/09/27.
I need to change the format to YYYY/MM/DD or something similar.
I thought of doing it manually like:
x = df_all['CREATION_DATE'].str[:2] + df_all['CREATION_DATE'].str[3:5] + "-" + \
df_all['CREATION_DATE'].str[5] + df_all['CREATION_DATE'].str[7] + "-" + df_all['CREATION_DATE'].str[8:]
print(x)
What's a more creative way of doing this? Could it be done with datetime module?
I believe this is what you want. First replace the , with nothing, so you get a yyyymmdd format, and then change it to datetime with pd.to_datetime by passing the correct format. One liner:
df['dates'] = pd.to_datetime(df['dates'].str.replace(',',''),format='%Y%m%d')
Full explanation:
import pandas as pd
a = {'dates':['20,190,927','20,191,114'],'values':[1,2]}
df = pd.DataFrame(a)
print(df)
Output, here's how the original dataframe looks like:
dates values
0 20,190,927 1
1 20,191,114 2
df['dates'] = df['dates'].str.replace(',','')
df['dates'] = pd.to_datetime(df['dates'],format='%Y%m%d')
print(df)
print(df.info())
Output of the newly formatted dataframe:
dates values
0 2019-09-27 1
1 2019-11-14 2
Printing .info() to ensure we have the correct format:
dates 2 non-null datetime64[ns]
values 2 non-null int64
Hope this helps,
date=['20,190,927','20,190,928','20,190,929']
df3=pd.DataFrame(date,columns=['Date'])
df3['Date']=df3['Date'].replace('\,','',regex=True)
df3['Date']=pd.to_datetime(df3['Date'])
def checkDF():
list1=[{'BatchNumber':'b1','Reason':'r1.1','value':1,'date':datetime(1700,01,01)},
{'BatchNumber':'b1','Reason':'r1.2','value':1,'date':'NA'},
{'BatchNumber':'b2','Reason':'r2','value':2,'date':datetime(2001,03,04)}]
df=pd.DataFrame(list1)
df.loc[df['date']!='NA' & df['date'] < datetime(2000,01,01),'date']="NaT"
if __name__=='__main__':
checkDF()
I want to replace date column values when date is NA and less than 2000, but i am not able to compare these two conditions together in pandas
I think need first to_datetime with errors='coerce' for convert non datetime to NaT (it is not string, but missing value):
list1=[{'BatchNumber':'b1','Reason':'r1.1','value':1,'date':pd.datetime(1700,1,1)},
{'BatchNumber':'b1','Reason':'r1.2','value':1,'date':'NA'},
{'BatchNumber':'b2','Reason':'r2','value':2,'date':pd.datetime(2001,3,4)}]
df = pd.DataFrame(list1)
df['date'] = pd.to_datetime(df['date'].astype(str), errors='coerce')
df.loc[df['date'] < '2000-01-01', 'date']=np.nan
#if want check not NaNs
#df.loc[(df['date'].notnull()) & (df['date'] < pd.datetime(2000,1,1)),'date']=np.nan
print (df)
BatchNumber Reason date value
0 b1 r1.1 NaT 1
1 b1 r1.2 NaT 1
2 b2 r2 2001-03-04 2
list1 = [{'BatchNumber':'b1','Reason':'r1.1','value':1,'date':datetime(1700,1,1)},
{'BatchNumber':'b1','Reason':'r1.2','value':1,'date':'NA'},
{'BatchNumber':'b2','Reason':'r2','value':2,'date':datetime(2001,3,4)}]
df = pd.DataFrame(list1)
# Create a function to represent your change
def f(x):
if (isinstance(x, datetime) and x.year < 2000) or x == 'NA':
return pd.NaT
else:
return x
# Apply this function to only the date column
df['date'] = df.date.map(f)
# Output
BatchNumber Reason date value
0 b1 r1.1 NaT 1
1 b1 r1.2 NaT 1
2 b2 r2 2001-03-04 2
Notes:
Can't use 01 to represent 1 in the dates (raises a SyntaxError)
NA in your data is a text value and not numpy.nan (the actual NA) hence we check for the string 'NA'
I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37