I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05
Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37
Try
today = datetime.datetime.now()
You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?
Related
I have a pandas dataframe of energy demand vs. time:
0 1
0 20201231T23-07 39815
1 20201231T22-07 41387
2 20201231T21-07 42798
3 20201231T20-07 44407
4 20201231T19-07 45612
5 20201231T18-07 44920
6 20201231T17-07 42617
7 20201231T16-07 41454
8 20201231T15-07 41371
9 20201231T14-07 41793
10 20201231T13-07 42298
11 20201231T12-07 42740
12 20201231T11-07 43185
13 20201231T10-07 42999
14 20201231T09-07 42373
15 20201231T08-07 41273
16 20201231T07-07 38909
17 20201231T06-07 37099
18 20201231T05-07 36022
19 20201231T04-07 35880
20 20201231T03-07 36305
21 20201231T02-07 36988
22 20201231T01-07 38166
23 20201231T00-07 40167
24 20201230T23-07 42624
25 20201230T22-07 44777
26 20201230T21-07 46205
27 20201230T20-07 47324
28 20201230T19-07 48011
29 20201230T18-07 46995
30 20201230T17-07 44902
31 20201230T16-07 44134
32 20201230T15-07 44228
33 20201230T14-07 44813
34 20201230T13-07 45187
35 20201230T12-07 45622
36 20201230T11-07 45831
37 20201230T10-07 45832
38 20201230T09-07 45476
39 20201230T08-07 44145
40 20201230T07-07 41650
I need to convert the time column into hourly data. I know that Python has some tools that can convert dates directly, is there one I could use here or will I need to do it manually?
Well just to obtain a time string you could use str.replace:
df["time"] = df["0"].str.replace(r'^\d{8}T(\d{2})-(\d{2})$', r'\1:\2')
Assuming the time column is currently a string you could convert it to a datetime using pd.to_datetime and then extract the hour.
If you want to calculate, say, the average demand for each hour you could then use groupby.
df['time'] = pd.to_datetime(df['time'], format="%Y%m%dT%H-%M").dt.hour
df_demand_by_hour = df.groupby('time').mean()
print(df_demand_by_hour)
demand
time
0 40167.0
1 38166.0
2 36988.0
3 36305.0
4 35880.0
5 36022.0
6 37099.0
7 40279.5
8 42709.0
9 43924.5
10 44415.5
11 44508.0
12 44181.0
13 43742.5
14 43303.0
15 42799.5
16 42794.0
17 43759.5
18 45957.5
19 46811.5
20 45865.5
21 44501.5
22 43082.0
23 41219.5
i don't know exactly what the -07 means but you can turn the string to datetime by doing:
import pandas as pd
import datetime as dt
df['0'] = pd.to_datetime(df['0'], format = '%Y-%m-%d %H:%M:%S').dt.strftime('%H:%M:%S')
df
0 1
0 23:00:00 39815
1 22:00:00 41387
2 21:00:00 42798
3 20:00:00 44407
4 19:00:00 45612
...
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
I have a pandas dataframe with id and date as the 2 columns - the date column has all the way to seconds.
data = {'id':[17,17,17,17,17,18,18,18,18],'date':['2018-01-16','2018-01-26','2018-01-27','2018-02-11',
'2018-03-14','2018-01-28','2018-02-12','2018-02-25','2018-03-04'],
}
df1 = pd.DataFrame(data)
I would like to have a new column - (tslt) - 'time_since_last_transaction'. The first transaction for each unique user_id could be a number say 1. Each subsequent transaction for that user should measure the difference between the 1st time stamp for that user and its current time stamp to generate a time difference in seconds.
I used the datetime and timedelta etc. but did not have too much of luck. Any help would be appreciated.
You can try groupby().transform():
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1['date'].sub(df1.groupby('id').date.transform('min')).dt.total_seconds()
Output:
id date diff
0 17 2018-01-16 0.0
1 17 2018-01-26 864000.0
2 17 2018-01-27 950400.0
3 17 2018-02-11 2246400.0
4 17 2018-03-14 4924800.0
5 18 2018-01-28 0.0
6 18 2018-02-12 1296000.0
7 18 2018-02-25 2419200.0
8 18 2018-03-04 3024000.0
I'm trying to use Pandas to filter the dataframe. So in the dataset I have 1982-01 to 2019-11. I want to filter data based on year 2010 onwards ie. 2010-01 to 2019-11.
mydf = pd.read_csv('temperature_mean.csv')
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
I had set the index = month, and I'm able to get the mean temperature for the filtered data. However I need to get the indexes as my x label for line graph. I'm not able to do it. I tried to use reg to get data from 201x onwards, but there's still an error.
How do I get the label for the months i.e. 2010-01, 2010-02......2019-10, 2019-11, for the line graph.
Thanks!
mydf = pd.read_csv('temperature_mean.csv')
month mean_temp
______________________
0 1982-01-01 39
1 1985-04-01 29
2 1999-03-01 19
3 2010-01-01 59
4 2013-05-01 32
5 2015-04-01 34
6 2016-11-01 59
7 2017-08-01 14
8 2017-09-01 7
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
mean_temp
month
______________________
2010-01-01 59
2013-05-01 32
2015-04-01 34
2016-11-01 59
2017-08-01 14
2017-09-01 7
Drawing the line plot (default x & y arguments):
df.plot.line()
If for some reason, you want to manually specify the column name
df.reset_index().plot.line(x='month', y='mean_temp')
thank you in advance for your assistance.
Want set 'Counter' to 1 whenever there is change in month, and increment by 1 until month changes again, and repeat. Like so:
A Month Counter
2015-10-30 -1.478066 10 21
2015-10-31 -1.562437 10 22
2015-11-01 -0.292285 11 1
2015-11-02 -1.581140 11 2
2015-11-03 0.603113 11 3
2015-11-04 -0.543563 11 4
In [1]: import pandas as pd
import numpy as np
In [2]: dates = pd.date_range('20151030',periods=6)
In [3]: df =pd.DataFrame(np.random.randn(6,1),index=dates,columns=list('A'))
In [4]: df
Out[4]: A
2015-10-30 -1.478066
2015-10-31 -1.562437
2015-11-01 -0.292285
2015-11-02 -1.581140
2015-11-03 0.603113
2015-11-04 -0.543563
Tried this, adds 1 to actual month integer:
In [5]: df['Month'] = df.index.month
In [6]: df['Counter'] df['Counter']=np.where(df['Month'] <> df['Month'], (1), (df['Month'].shift()+1))
In [7]: df
Out[7]: A Month Counter
2015-10-30 -1.478066 10 NaN
2015-10-31 -1.562437 10 11
2015-11-01 -0.292285 11 11
2015-11-02 -1.581140 11 12
2015-11-03 0.603113 11 12
2015-11-04 -0.543563 11 12
Tried datetime, getting closer:
In[8]: from datetime import timedelta
In[9]: df['Counter'] = df.index + timedelta(days=1)
Out[9]: A Month Counter
2015-10-30 -0.478066 11 2015-10-31
2015-10-31 -1.562437 10 2015-11-01
2015-11-01 -0.292285 11 2015-11-02
2015-11-02 -1.581140 11 2015-11-03
2015-11-03 0.603113 11 2015-11-04
2015-11-04 -0.543563 11 2015-11-05
Latter give me the date, but not my counter. New to python, so any help is appreciated. Thank you!
Edit, extending df to periods=300 to include over 12 months of data:
In[10]: dates = pd.date_range('19971002',periods=300)
In[11]: df=pd.DataFrame(np.random.randn(300,1),index=dates,columns=list('A'))
In[12]: df['Counter'] = df.groupby(df.index.month).cumcount()+1
In[13]: df.head()
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
In[14]: df[250:]
Out[14] A Counter
1998-09-29 -0.875468 20
1998-09-30 1.498145 21
1998-10-01 0.141262 24
1998-10-02 0.581974 25
Desired results:
Out[13] A Counter
1997-09-29 -0.875468 20
1997-09-30 1.498145 21
1997-10-02 0.141262 1
1997-10-03 0.581974 2
1997-10-04 0.581974 3
Code works fine (Out[13] above), seems to be once data goes beyond 12 months counter keeps on incrementing +1 instead of setting back to 1([Out 14] above. Also, getting tricky here, random date generator includes weekend, my data only has weekday data. Hope that helps me help you to help me better. Thank you!
You could use groupby/cumcount to assign a cumulative count to each group:
import pandas as pd
import numpy as np
N = 300
dates = pd.date_range('19971002', periods=N, freq='B')
df = pd.DataFrame(np.random.randn(N, 1),index=dates,columns=list('A'))
df['Counter'] = df.groupby([df.index.year, df.index.month]).cumcount()+1
print(df.loc['1998-09-25':'1998-10-05'])
yields
A Counter
1998-09-25 -0.511721 19
1998-09-28 1.912757 20
1998-09-29 -0.988309 21
1998-09-30 1.277888 22
1998-10-01 -0.579450 1
1998-10-02 -2.486014 2
1998-10-05 0.728789 3