My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30
Related
I have the following date column that I would like to transform to a pandas datetime object. Is it possible to do this with weekly data? For example, 1-2018 stands for week 1 in 2018 and so on. I tried the following conversion but I get an error message: Cannot use '%W' or '%U' without day and year
import pandas as pd
df1 = pd.DataFrame(columns=["date"])
df1['date'] = ["1-2018", "1-2018", "2-2018", "2-2018", "3-2018", "4-2018", "4-2018", "4-2018"]
df1["date"] = pd.to_datetime(df1["date"], format = "%W-%Y")
You need to add a day to the datetime format
df1["date"] = pd.to_datetime('0' + df1["date"], format='%w%W-%Y')
print(df1)
Output
date
0 2018-01-07
1 2018-01-07
2 2018-01-14
3 2018-01-14
4 2018-01-21
5 2018-01-28
6 2018-01-28
7 2018-01-28
As the error message says, you need to specify the day of the week by adding %w :
df1["date"] = pd.to_datetime( '0'+df1.date, format='%w%W-%Y')
I want to create a dataframe with date from previous years. For example something like this -
df = pd.DataFrame({'Years': pd.date_range('2021-09-21', periods=-5, freq='Y')})
but negative period is not supported. How to achieve that?
Use end parameter in date_range aand then add DateOffset:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='Y') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2016-09-21
1 2017-09-21
2 2018-09-21
3 2019-09-21
4 2020-09-21
Or if need also actual year to last value of column use YS for start of year:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='YS') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2017-09-21
1 2018-09-21
2 2019-09-21
3 2020-09-21
4 2021-09-21
I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.
I would like to get the number of days before the end of the month, from a string column representing a date.
I have the following pandas dataframe :
df = pd.DataFrame({'date':['2019-11-22','2019-11-08','2019-11-30']})
df
date
0 2019-11-22
1 2019-11-08
2 2019-11-30
I would like the following output :
df
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
The package pd.tseries.MonthEnd with rollforward seemed a good pick, but I can't figure out how to use it to transform a whole column.
Subtract all days of month created by Series.dt.daysinmonth with days extracted by Series.dt.day:
df['date'] = pd.to_datetime(df['date'])
df['days_end_month'] = df['date'].dt.daysinmonth - df['date'].dt.day
Or use offsets.MonthEnd, subtract and convert timedeltas to days by Series.dt.days:
df['days_end_month'] = (df['date'] + pd.offsets.MonthEnd(0) - df['date']).dt.days
print (df)
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
I have and pandas dataframe with a multiindex that looks like this:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
# multi-indexed dataframe
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
df.sort_index(inplace=True)
Console output:
df.head()
Out[23]:
0 1 2
datetime
2016 0.458802 0.413004 0.091056
2016 -0.051840 -1.780310 -0.304122
2016 -1.119973 0.954591 0.279049
2016 -0.691850 -0.489335 0.554272
2016 -1.278834 -1.292012 -0.637931
df.head()
...: df.tail()
Out[24]:
0 1 2
datetime
2018 -1.872155 0.434520 -0.526520
2018 0.345213 0.989475 -0.892028
2018 -0.162491 0.908121 -0.993499
2018 -1.094727 0.307312 0.515041
2018 -0.880608 -1.065203 -1.438645
Now I want to create annual sums along the level 'datetime'.
My first try was the following but this doesn't work:
# sum along years
years = df.index.get_level_values('datetime').year.tolist()
df.index.set_levels([years], level=['datetime'], inplace=True)
df = df.groupby(level=['datetime']).sum()
And it also seems quite heavy handed to me as this task is probably pretty easy to realize.
So here's my question: How can I get annual sums for the level 'datetime'? Is there a simple way to realize this by applying a function to the DateTime level values?
You can groupby by second level of multiindex and year:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
# multi-indexed dataframe
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
df.sort_index(inplace=True)
print df.head()
0 1 2
concept datetime
some_value 2016-01-01 00:00:00 1.973437 0.101535 -0.693360
2016-01-01 01:00:00 1.221657 -1.983806 -0.075609
2016-01-01 02:00:00 -0.208122 -2.203801 1.254084
2016-01-01 03:00:00 0.694332 -0.235864 0.538468
2016-01-01 04:00:00 -0.928815 -1.417445 1.534218
# sum along years
#years = df.index.get_level_values('datetime').year.tolist()
#df.index.set_levels([years], level=['datetime'], inplace=True)
print df.index.levels[1].year
[2016 2016 2016 ..., 2018 2018 2018]
df = df.groupby(df.index.levels[1].year).sum()
print df.head()
0 1 2
2016 -93.901914 -32.205514 -22.460965
2017 205.681817 67.701669 -33.960801
2018 67.438355 150.954614 -21.381809
Or you can use get_level_values and year:
df = df.groupby(df.index.get_level_values('datetime').year).sum()
print df.head()
0 1 2
2016 -93.901914 -32.205514 -22.460965
2017 205.681817 67.701669 -33.960801
2018 67.438355 150.954614 -21.381809
Starting with your sample data:
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
you can apply groupby to a level of your MultiIndex:
df.groupby(pd.TimeGrouper(level='datetime', freq='A')).sum()
to get:
0 1 2
datetime
2016-12-31 100.346135 -71.673222 42.816675
2017-12-31 -132.880909 -66.017010 -73.449358
2018-12-31 -71.449710 -15.774929 97.634349
pd.TimeGrouper is now (0.23) deprecated; please use pd.Grouper(freq=...) instead.