Here is my DF
Start-Time Running-Time Speed-Avg HR-Avg
0 2016-12-18 10:8:14 0:24:2 20 138
1 2016-12-18 10:8:14 0:24:2 20 138
2 2016-12-23 8:52:36 0:31:19 16 134
3 2016-12-23 8:52:36 0:31:19 16 134
4 2016-12-25 8:0:51 0:30:10 50 135
5 2016-12-25 8:0:51 0:30:10 50 135
6 2016-12-26 8:41:26 0:10:1 27 116
7 2016-12-26 8:41:26 0:10:1 27 116
8 2017-1-7 11:16:9 0:26:15 22 124
9 2017-1-7 11:16:9 0:26:15 22 124
10 2017-1-10 19:2:54 0:53:51 5 142
11 2017-1-10 19:2:54 0:53:51 5 142
and i have been trying to format this column in H:M:S format
using
timeDF=(pd.to_datetime(cleanDF['Running-Time'],format='%H:%M:%S'))
but i have been getting ValueError: time data ' 0:24:2' does not match format '%M:%S' (match) this error
Thank you in advance.
There is problem trailing whitespaces, so need str.strip:
Or if create DataFrame from file by read_csv add parameter skipinitialspace=True:
cleanDF = pd.read_csv(file, skipinitialspace = True)
timeDF=(pd.to_datetime(cleanDF['Running-Time'].str.strip(), format='%H:%M:%S'))
print (timeDF)
0 1900-01-01 00:24:02
1 1900-01-01 00:24:02
2 1900-01-01 00:31:19
3 1900-01-01 00:31:19
4 1900-01-01 00:30:10
5 1900-01-01 00:30:10
6 1900-01-01 00:10:01
7 1900-01-01 00:10:01
8 1900-01-01 00:26:15
9 1900-01-01 00:26:15
10 1900-01-01 00:53:51
11 1900-01-01 00:53:51
Name: Running-Time, dtype: datetime64[ns]
But maybe better is convert it to timedeltas by to_timedelta:
timeDF=(pd.to_timedelta(cleanDF['Running-Time'].str.strip()))
print (timeDF)
0 00:24:02
1 00:24:02
2 00:31:19
3 00:31:19
4 00:30:10
5 00:30:10
6 00:10:01
7 00:10:01
8 00:26:15
9 00:26:15
10 00:53:51
11 00:53:51
Name: Running-Time, dtype: timedelta64[ns]
Related
i have the following dataframe structure:
exec_start_date exec_finish_date hour_start hour_finish session_qtd
2020-03-01 2020-03-02 22 0 1
2020-03-05 2020-03-05 22 23 3
2020-03-03 2020-03-04 18 7 4
2020-03-07 2020-03-07 18 18 2
As you can see above, i have three situations of sessions execution:
1) Start in one day and finish in another day with different hours
2) Start in one day and finish in the same day with different hours
3) Start in one day and finish in the same day with same hours
I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then:
If a session starts in one day and finish in another day, the hours executed in the start date need to be filled with exec_start_date and the hours executed in the following day need to be filled with the exec_finish_date.
So, a intermediary dataset would be like this:
exec_start_date exec_finish_date hour_start hour_finish session_qtd hour_interval
2020-03-01 2020-03-02 22 0 1 [22,23,0]
2020-03-05 2020-03-05 22 23 3 [22,23]
2020-03-03 2020-03-04 20 3 4 [20,21,22,23,0,1,2,3]
2020-03-07 2020-03-07 18 18 2 [18]
And the final dataset would be like this:
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-07 2 18
I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day.
Can you help me?
So the way I would do this is to get the full dates to get the time interval, then just pull the hours from that range. np.arange will not work because hours loop.
#Create two new temp columns that calculate have the full start and end date
df['start'] = df['exec_start_date'].astype(str) + " " + df['hour_start'].astype(str) + ":00.000"
df['end'] = df['exec_finish_date'].astype(str) + " " + df['hour_finish'].astype(str) + ":00.000"
#Convert them both to datetimes
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
#Apply a function to get your range
df['date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='H').tolist(), axis = 1)
#explode new date column to get 1 column for all the created interval dates
df = df.explode(column = 'date')
#Make two new columns based on your final requested table based on new column
df['exec_date'] = df['date'].dt.date
df['hour_interval'] = df['date'].dt.hour
#Make a copy of the columns you wanted in a new df
df2 = df[['exec_date','session_qtd','hour_interval']].copy()
Make df dtype string
df=df.astype(str)
Concat date with hour and coerce to datetime
df['exec_start_date']=pd.to_datetime(df['exec_start_date'].str.cat(df['hour_start'], sep=' ')+ ':00:00')
df['exec_finish_date']=pd.to_datetime(df['exec_finish_date'].str.cat(df['hour_finish'], sep=' ')+ ':00:00')
Derive hourly periods between the start and end datetime
df['date'] = df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)
Explode date to achieve outcome
df.explode('date')
Can skip last step by exploding after deriving hourly periods as follows
df.assign(date=df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)).explode('date')
Outcome
exec_start_date exec_finish_date hour_start hour_finish session_qtd \
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
3 2020-03-07 18:00:00 2020-03-07 18:00:00 18 18 2
date
0 22
0 23
0 0
1 22
1 23
2 18
2 19
2 20
2 21
2 22
2 23
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 18
Another approach without converting the date fields:
import pandas as pd
import numpy as np
ss = '''
exec_start_date,exec_finish_date,hour_start,hour_finish,session_qtd
2020-03-01,2020-03-02,22,0,1
2020-03-05,2020-03-05,22,23,3
2020-03-03,2020-03-04,18,7,4
2020-03-07,2020-03-07,18,18,2
'''.strip()
with open('data.csv','w') as f: f.write(ss)
########## main script #############
df = pd.read_csv('data.csv')
df['hour_start'] = df['hour_start'].astype('int')
df['hour_finish'] = df['hour_finish'].astype('int')
# get hour list, group by day
df['hour_interval'] = df.apply(lambda x: list(range(x[2],x[3]+1)) if x[0]==x[1] else [(101,) + tuple(range(x[2],24))]+[(102,)+tuple(range(0,x[3]+1))], axis=1)
lst_col = 'hour_interval'
# split day group to separate rows
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# choose start\end day
hrdf['exec_date'] = hrdf.apply(lambda x: x[0] if type(x[5]) is int or x[5][0] == 101 else x[1], axis=1)
# remove day indicator
hrdf['hour_interval'] = hrdf['hour_interval'].apply(lambda x: [x] if type(x) is int else list(x[1:]))
# split hours to separate rows
df = hrdf
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# columns for final output
df_final = hrdf[['exec_date','session_qtd','hour_interval']]
print(df_final.to_string(index=False))
Output
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 18
2020-03-03 4 19
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-04 4 4
2020-03-04 4 5
2020-03-04 4 6
2020-03-04 4 7
2020-03-07 2 18
I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np
# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})
df.ID = pd.to_numeric(df.ID)
df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')
#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan
#(try1)
df.yyyy = df.Date.astype(float)
#(try2)
df.yyyy = pd.to_numeric(df.Date)
#(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]
I have a pandas dataFrame like this:
content
date
2013-12-18 12:30:00 1
2013-12-19 10:50:00 1
2013-12-24 11:00:00 0
2014-01-02 11:30:00 1
2014-01-03 11:50:00 0
2013-12-17 16:40:00 10
2013-12-18 10:00:00 0
2013-12-11 10:00:00 0
2013-12-18 11:45:00 0
2013-12-11 14:40:00 4
2010-05-25 13:05:00 0
2013-11-18 14:10:00 0
2013-11-27 11:50:00 3
2013-11-13 10:40:00 0
2013-11-20 10:40:00 1
2008-11-04 14:49:00 1
2013-11-18 10:05:00 0
2013-08-27 11:00:00 0
2013-09-18 16:00:00 0
2013-09-27 11:40:00 0
date being the index.
I reduce the values to months using:
dataFrame = dataFrame.groupby([lambda x: x.year, lambda x: x.month]).agg([sum])
which outputs:
content
sum
2006 3 66
4 65
5 48
6 87
7 37
8 54
9 73
10 74
11 53
12 45
2007 1 28
2 40
3 95
4 63
5 56
6 66
7 50
8 49
9 18
10 28
Now when I plot this dataFrame, I want the x-axis show every month/year as a tick. I have tries setting xticks but it doesn't seem to work. How could this be achieved? This is my current plot using dataFrame.plot():
You can use set_xtick() and set_xticklabels():
idx = pd.date_range("2013-01-01", periods=1000)
val = np.random.rand(1000)
s = pd.Series(val, idx)
g = s.groupby([s.index.year, s.index.month]).mean()
ax = g.plot()
ax.set_xticks(range(len(g)));
ax.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);
output: