As per the discussion, extracting date/year/quarter in Pandas is as below
df = pd.DataFrame({'date_text': ['Jan 2020', 'May 2020', 'Jun 2020']})
df ['date'] = pd.to_datetime ( df.date_text ).dt.date
df ['year'], df ['month'],df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
However, the compiler return an error
AttributeError: Can only use .dt accessor with datetimelike values
May I know where did I do wrong?
Fix it by remove the first dt.date
df ['date'] = pd.to_datetime ( df.date_text )
df ['year'], df ['month'], df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
df
Out[43]:
date_text date year month qtr
0 Jan 2020 2020-01-01 2020 1 1
1 May 2020 2020-05-01 2020 5 2
2 Jun 2020 2020-06-01 2020 6 2
Related
I try to convert multiple dates format into YYYY-MM-DD, then merge them into 1 column ignore the NULL, but I end up with TypeError: cannot add DatetimeArray and DatetimeArray
import pandas as pd
data = [[ 'Apr 2021'], ['Jan 1'], ['Fri'], [ 'Jan 18']]
df = pd.DataFrame(data, columns = ['date', ])
#convert Month date Jan 1
df['date1']=(pd.to_datetime('2021 '+ df['date'],errors='coerce',format='%Y %b %d'))
# convert Month Year Apr 2021
df['date2']=pd.to_datetime(df['date'], errors='coerce')
#convert fri to this friday
today = datetime.date.today()
friday = today + datetime.timedelta( (4-today.weekday()) % 7 )
this_firday = friday.strftime('%Y-%m-%d')
df['date3']=df['date'].map({'Fri':this_firday})
df['date3'] = pd.to_datetime(df['date3'])
df['dateFinal'] = df['date1'] + df['date2'] + df['date3']
I check the dtypes, they're all datetime, I don't know why. my approach is not efficient, feel free to let me know a better way.
IIUC:
try via bfill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].bfill(axis=1).iloc[:,0]
OR
via ffill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].ffill(axis=1).iloc[:,-1]
OR
via stack()+to_numpy()
df['dateFinal'] = df[['date1','date2','date3']].stack().to_numpy()
output of df:
date date1 date2 date3 dateFinal
0 Apr 2021 NaT 2021-04-01 NaT 2021-04-01
1 Jan 1 2021-01-01 NaT NaT 2021-01-01
2 Fri NaT NaT 2021-08-13 2021-08-13
3 Jan 18 2021-01-18 NaT NaT 2021-01-18
I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.
I am trying to understand how I can edit the dataframe in python using pandas so I can drop everything but the year.
Example: if the date is 2014-01-01, I want it to show 2014 and drop both the month and the date. All the dates are in a single column.
Thanks in advice!
You can convert the numpy.datetime64 date value to datetime using pd.to_datetime() and then you can extract year or month or day from it.
import numpy as np
date = np.datetime64('2014-01-01')
type(date)
Output:
numpy.datetime64
Convert this date to pandas datetime using pd.to_datetime.
date = pd.to_datetime(date)
type(date)
Output:
pandas._libs.tslibs.timestamps.Timestamp
Then you can extract the year using .year
date.year
Output:
2014
So, if you if you have a df:
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.DatetimeIndex(df['date']).year
df
Output:
date
0 2014
1 2015
2 2016
Alternately, you can also do this
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = df['date'].apply(lambda x: x.strftime('%Y'))
df
Output:
date
0 2014
1 2015
2 2016
EDIT 1
Group by using year when the column has date values
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.groupby(df.index.year).size()
Output:
date
2014 1
2015 1
2016 1
You can still do the same even if you have removed the month and day from the date and only have year in your column
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.DatetimeIndex(df['date']).year
df.groupby('date').size()
Output:
date
2014 1
2015 1
2016 1
Split Date e.g. Aug 2018 --> 01-08-2018 ??
Here's my sample input
id year_pass
1 Aug 2018 - Nov 2018
2 Jul 2017
Here's my sample input 2
id year_pass
1 Jul 2018
2 Aug 2017 - Nov 2018
What i did,
I'm able to split the date on the with eg:(aug 2018 - nov 2018)
# splitting the date column on the '-'
year_start, year_end = df['year_pass'].str.split('-')
df.drop('year_pass', axis=1, inplace=True)
# assigning the split values to columns
df['year_start'] = year_start
df['year_end'] = year_end
# converting to datetime objects
df['year_start'] = pd.to_datetime(df['year_start'])
df['year_end'] = pd.to_datetime(df['year_end'])
But couldn't figure out how to do it for both
Output should be:
id year_start year_end
1 01-08-2018 01-11-2018
2 01-07-2018
This is one approach using dt.strftime("%d-%m-%Y").
Ex:
import pandas as pd
df = pd.DataFrame({"year_pass": ["Aug 2018 - Nov 2018", "Jul 2017"]})
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 01-08-2018 01-11-2018
1 01-07-2017 NaT
Edit as per comment:
import pandas as pd
def replaceInitialSpace(val):
if val.startswith(" "):
return " - "+val.strip()
return val
df = pd.DataFrame({"year_pass": [" Jul 2018", "Aug 2018 - Nov 2018", "Jul 2017 "]})
df["year_pass"] = df["year_pass"].apply(replaceInitialSpace)
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 NaT 01-07-2018
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
You could start by splitting the strings by the original dataframe:
# split the original dataframe
df = df.year_pass.str.split(' - ', expand=True)
0 1
id
1 Aug2018 Nov2018
2 Jul2017 None
And then apply pd.to_datetime to turn the strings to datetime objects and format them using strftime:
# rename the columns
df.columns = ['year_start','year_end']
df.apply(lambda x: pd.to_datetime(x, errors='coerce').dt.strftime('%d-%m-%Y'), axis=0)
year_start year_end
id
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
If need datetimes in output is necessary different format - YYYY-MM-DD:
df1 = df.pop('year_pass').str.split('\s+-\s+', expand=True).apply(pd.to_datetime)
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 2018-08-01 2018-11-01
1 2 2017-07-01 NaT
print (df.dtypes)
id int64
year_start datetime64[ns]
year_end datetime64[ns]
dtype: object
If need change format then get strings, but all datetimelike functions failed:
df1 = (df.pop('year_pass').str.split('\s+-\s+', expand=True)
.apply(lambda x: pd.to_datetime(x).dt.strftime('%d-%m-%Y'))
.replace('NaT',''))
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 01-08-2018 01-11-2018
1 2 01-07-2017
print (df.dtypes)
id int64
year_start object
year_end object
dtype: object
print (type(df.loc[0, 'year_start']))
<class 'str'>
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30