Mixing days and months in datetime value column in Pandas - python

I want to format dates in pandas, to have year-month-day. My dates are from april to september. I do not have values from january, feb etc, but sometimes my pandas reads day as month and month as day. Look at index 16 or 84.
6 2019-08-26 15:10:00
7 2019-08-25 13:22:00
8 2019-08-24 16:06:00
9 2019-08-23 15:13:00
10 2019-08-22 14:24:00
11 2019-08-21 14:02:00
12 2019-08-16 12:31:00
13 2019-08-15 15:31:00
14 2019-08-14 14:46:00
15 2019-08-13 17:13:00
16 2019-11-08 15:54:00
17 2019-10-08 10:07:00
68 2019-06-06 11:22:00
69 2019-05-06 15:16:00
70 2019-01-06 17:02:00
75 2019-05-21 09:01:00
76 2019-05-19 16:52:00
77 2019-05-15 15:40:00
78 2019-10-05 13:34:00
81 2019-06-05 11:55:00
82 2019-03-05 17:28:00
83 2019-02-05 18:01:00
84 2019-01-05 17:05:00
85 2019-01-05 09:57:00
86 2019-04-30 10:16:00
87 2019-04-29 17:51:00
88 2019-04-27 17:42:00
How to fix this?
I want to have date type values *(year-month-day), without time, so that I can group by day, or by month.
I have tried this, but It does not work:
df['Created'] = pd.to_datetime(df['Created'], format = 'something')
And for grouping by month, I have tried this:
df['Created'] = df['Created'].dt.to_period('M')

Solution for sample data - you can create both possible datetimes with both formats with errors='coerce' for missing values in not match and then replace missing values from second Series (YYYY-DD-MM) by first Series (YYYY-MM-DD) by Series.combine_first or Series.combine_first:
a = pd.to_datetime(df['Created'], format = '%Y-%m-%d %H:%M:%S', errors='coerce')
b = pd.to_datetime(df['Created'], format = '%Y-%d-%m %H:%M:%S', errors='coerce')
df['Created'] = b.combine_first(a).dt.to_period('M')
#alternative
#df['Created'] = b.fillna(a).dt.to_period('M')
print (df)
Created
6 2019-08
7 2019-08
8 2019-08
9 2019-08
10 2019-08
11 2019-08
12 2019-08
13 2019-08
14 2019-08
15 2019-08
16 2019-08
17 2019-08
68 2019-06
69 2019-06
70 2019-06
75 2019-05
76 2019-05
77 2019-05
78 2019-05
81 2019-05
82 2019-05
83 2019-05
84 2019-05
85 2019-05
86 2019-04
87 2019-04
88 2019-04

I created a dummy dataframe to parse this. Try strftime
from datetime import datetime
import time
import pandas as pd
time1 = datetime.now()
time.sleep(6)
time2 = datetime.now()
df = pd.DataFrame({'Created': [time1, time2]})
df['Created2'] = df['Created'].apply(lambda x: x.strftime('%Y-%m-%d'))
print(df.head())

Related

Pandas Drop rows for current Year-month

My dataframe has multiple years and months in "yyyy-mm-dd" format.
I would like to dynamically drop all current Year-month rows from the df
you could use a simple strftime method where the %Y%m is not equal to the year month of the current date.
df1 = df.loc[
df['Date'].dt.strftime('%Y%m') != pd.Timestamp('today').strftime('%Y%m')]
Example
d = pd.date_range('01 oct 2021', '01 dec 2021',freq='d')
df = pd.DataFrame(d,columns=['Date'])
print(df)
Date
0 2021-10-01
1 2021-10-02
2 2021-10-03
3 2021-10-04
4 2021-10-05
.. ...
57 2021-11-27
58 2021-11-28
59 2021-11-29
60 2021-11-30
61 2021-12-01
print(df1)
Date
0 2021-10-01
1 2021-10-02
2 2021-10-03
3 2021-10-04
4 2021-10-05
5 2021-10-06
6 2021-10-07
7 2021-10-08
8 2021-10-09
9 2021-10-10
10 2021-10-11
11 2021-10-12
12 2021-10-13
13 2021-10-14
14 2021-10-15
15 2021-10-16
16 2021-10-17
17 2021-10-18
18 2021-10-19
19 2021-10-20
20 2021-10-21
21 2021-10-22
22 2021-10-23
23 2021-10-24
24 2021-10-25
25 2021-10-26
26 2021-10-27
27 2021-10-28
28 2021-10-29
29 2021-10-30
30 2021-10-31
61 2021-12-01

How to add new row in time series dataframe

My dataframe has an index column of dates and one column
var
date
2020-03-10 77
2020-03-11 88
2020-03-12 99
I have an array and I want to append it to the dataframe one by one. I have tried a few methods but anything isn't working.
my code is something like this
for i in range(20):
x=i*i
df.append(x)
After each iteration dataframe needs to be appended with x value.
Final output:
var
date
2020-03-10 77
2020-03-11 88
2020-03-12 99
2020-03-13 1
2020-03-14 4
2020-03-15 9
.
.
. 20 times
Will be grateful for any suggestions.
Try this
tmpdf = pd.DataFrame({"var":[77,88,99]},index=pd.date_range("2020-03-10",periods=3,freq='D'))
for i in range(1,21):
idx = tmpdf.tail(1).index[0] + pd.Timedelta(days=1)
tmpdf.loc[idx] = i*i
output
2020-03-10 77
2020-03-11 88
2020-03-12 99
2020-03-13 1
2020-03-14 4
2020-03-15 9
2020-03-16 16
2020-03-17 25
2020-03-18 36
2020-03-19 49
2020-03-20 64
2020-03-21 81
2020-03-22 100
2020-03-23 121
2020-03-24 144
2020-03-25 169
2020-03-26 196
2020-03-27 225
2020-03-28 256
2020-03-29 289
2020-03-30 324
2020-03-31 361
2020-04-01 400

Pandas convert Column to time

Here is my DF
Start-Time Running-Time Speed-Avg HR-Avg
0 2016-12-18 10:8:14 0:24:2 20 138
1 2016-12-18 10:8:14 0:24:2 20 138
2 2016-12-23 8:52:36 0:31:19 16 134
3 2016-12-23 8:52:36 0:31:19 16 134
4 2016-12-25 8:0:51 0:30:10 50 135
5 2016-12-25 8:0:51 0:30:10 50 135
6 2016-12-26 8:41:26 0:10:1 27 116
7 2016-12-26 8:41:26 0:10:1 27 116
8 2017-1-7 11:16:9 0:26:15 22 124
9 2017-1-7 11:16:9 0:26:15 22 124
10 2017-1-10 19:2:54 0:53:51 5 142
11 2017-1-10 19:2:54 0:53:51 5 142
and i have been trying to format this column in H:M:S format
using
timeDF=(pd.to_datetime(cleanDF['Running-Time'],format='%H:%M:%S'))
but i have been getting ValueError: time data ' 0:24:2' does not match format '%M:%S' (match) this error
Thank you in advance.
There is problem trailing whitespaces, so need str.strip:
Or if create DataFrame from file by read_csv add parameter skipinitialspace=True:
cleanDF = pd.read_csv(file, skipinitialspace = True)
timeDF=(pd.to_datetime(cleanDF['Running-Time'].str.strip(), format='%H:%M:%S'))
print (timeDF)
0 1900-01-01 00:24:02
1 1900-01-01 00:24:02
2 1900-01-01 00:31:19
3 1900-01-01 00:31:19
4 1900-01-01 00:30:10
5 1900-01-01 00:30:10
6 1900-01-01 00:10:01
7 1900-01-01 00:10:01
8 1900-01-01 00:26:15
9 1900-01-01 00:26:15
10 1900-01-01 00:53:51
11 1900-01-01 00:53:51
Name: Running-Time, dtype: datetime64[ns]
But maybe better is convert it to timedeltas by to_timedelta:
timeDF=(pd.to_timedelta(cleanDF['Running-Time'].str.strip()))
print (timeDF)
0 00:24:02
1 00:24:02
2 00:31:19
3 00:31:19
4 00:30:10
5 00:30:10
6 00:10:01
7 00:10:01
8 00:26:15
9 00:26:15
10 00:53:51
11 00:53:51
Name: Running-Time, dtype: timedelta64[ns]

Pandas if/then aggregation

I've been searching SO and haven't figured this out yet. Hoping someone can aide this python newb to solving my problem.
I'm trying to figure out how to write an if/then statement in python and perform an aggregation off that if/then statement. My end goal is to say if the date = 1/7/2017 then use the value in the "fake" column. If date = all else then average the two columns together.
Here is what I have so far:
import pandas as pd
import numpy as np
import datetime
np.random.seed(42)
dte=pd.date_range(start=datetime.date(2017,1,1), end= datetime.date(2017,1,15))
fake=np.random.randint(15,100, size=15)
fake2=np.random.randint(300,1000,size=15)
so_df=pd.DataFrame({'date':dte,
'fake':fake,
'fake2':fake2})
so_df['avg']= so_df[['fake','fake2']].mean(axis=1)
so_df.head()
Assuming you have already computed the average column:
so_df['fake'].where(so_df['date']=='20170107', so_df['avg'])
Out:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 449.0
9 395.5
10 197.0
11 438.5
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
If not, you can replace the column reference with the same calculation:
so_df['fake'].where(so_df['date']=='20170107', so_df[['fake','fake2']].mean(axis=1))
To check for multiple dates, you need to use the element-wise version of the or operator (which is pipe: |). Otherwise it will raise an error.
so_df['fake'].where((so_df['date']=='20170107') | (so_df['date']=='20170109'), so_df['avg'])
The above checks for two dates. In the case of 3 or more, you may want to use isin with a list:
so_df['fake'].where(so_df['date'].isin(['20170107', '20170109', '20170112']), so_df['avg'])
Out[42]:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 38.0
9 395.5
10 197.0
11 67.0
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
Let's use np.where:
so_df['avg'] = np.where(so_df['date'] == pd.to_datetime('2017-01-07'),
so_df['fake'], so_df[['fake',
'fake2']].mean(1))
Output:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
One way to do if-else in pandas is by using np.where
There are three values inside, condition, if and else
so_df['avg']= np.where(so_df['date'] == '2017-01-07',so_df['fake'],so_df[['fake','fake2']].mean(axis=1))
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
we can also use Series.where() method:
In [141]: so_df['avg'] = so_df['fake'] \
...: .where(so_df['date'].isin(['2017-01-07','2017-01-09']))
...: .fillna(so_df[['fake','fake2']].mean(1))
...:
In [142]: so_df
Out[142]:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 38.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5

How can I do computations on dataframes or series that have different indexes in PANDAS?

I have two Series that are of the same length and datatype. Both are float64. The only difference are the indexes both are dates but one date is at the beginnning of the month and the other is at the end of the month. How can I do computations like correlation or covariance on Series or dataframes that have different indexes?
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import Quandl
IPO=Quandl.get("RITTER/US_IPO_STATS", authtoken="api key")
ir=Quandl.get("FRBC/REALRT", authtoken="api key")
ipo_splice=IPO[264:662]
new_ipo=ipo_splice['Gross Number of IPOs'];
new_ipo=new_ipo.T
ir_splice=ir[0:398]
new_ir=ir_splice['RR 1 Month']
new_ir=new_ir.T
new_ipo.corr(new_ir)
reset_index(drop=True) for the things you want to correlate then concat.
s1 = pd.DataFrame(np.random.rand(10), list('abcdefghij'), columns=['s1'])
s2 = pd.DataFrame(np.random.rand(10), list('ABCDEFGHIJ'), columns=['s2'])
print pd.concat([s.reset_index(drop=True) for s in [s1, s2]], axis=1).corr()
s1 s2
s1 1.000000 -0.437945
s2 -0.437945 1.000000
you can use resample() function in order to resample one of your indices (our goal is have either both indices BoM or EoM):
data:
In [63]: df_bom
Out[63]:
val
2015-01-01 76
2015-02-01 27
2015-03-01 65
2015-04-01 71
2015-05-01 9
2015-06-01 23
2015-07-01 52
2015-08-01 10
2015-09-01 62
2015-10-01 25
In [64]: df_eom
Out[64]:
val
2015-01-31 87
2015-02-28 16
2015-03-31 85
2015-04-30 4
2015-05-31 37
2015-06-30 63
2015-07-31 3
2015-08-31 73
2015-09-30 81
2015-10-31 69
Solution:
In [61]: df_eom.resample('MS') + df_bom
C:\envs\py35\Scripts\ipython:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[61]:
val
2015-01-01 163
2015-02-01 43
2015-03-01 150
2015-04-01 75
2015-05-01 46
2015-06-01 86
2015-07-01 55
2015-08-01 83
2015-09-01 143
2015-10-01 94
In [62]: df_eom.resample('MS').join(df_bom, lsuffix='_lft')
C:\envs\py35\Scripts\ipython:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[62]:
val_lft val
2015-01-01 87 76
2015-02-01 16 27
2015-03-01 85 65
2015-04-01 4 71
2015-05-01 37 9
2015-06-01 63 23
2015-07-01 3 52
2015-08-01 73 10
2015-09-01 81 62
2015-10-01 69 25
alternative approach - merging DF's by year and month parts:
In [69]: %paste
(pd.merge(df_bom, df_eom,
left_on=[df_bom.index.year, df_bom.index.month],
right_on=[df_eom.index.year, df_eom.index.month],
suffixes=('_bom','_eom')))
## -- End pasted text --
Out[69]:
key_0 key_1 val_bom val_eom
0 2015 1 76 87
1 2015 2 27 16
2 2015 3 65 85
3 2015 4 71 4
4 2015 5 9 37
5 2015 6 23 63
6 2015 7 52 3
7 2015 8 10 73
8 2015 9 62 81
9 2015 10 25 69
Setup:
In [59]: df_bom = pd.DataFrame({'val':np.random.randint(0,100, 10)}, index=pd.date_range('2015-01-01', periods=10, freq='MS'))
In [60]: df_eom = pd.DataFrame({'val':np.random.randint(0,100, 10)}, index=pd.date_range('2015-01-01', periods=10, freq='M'))

Categories