Unable to extract date/year/quarter from Pandas - python

As per the discussion, extracting date/year/quarter in Pandas is as below
df = pd.DataFrame({'date_text': ['Jan 2020', 'May 2020', 'Jun 2020']})
df ['date'] = pd.to_datetime ( df.date_text ).dt.date
df ['year'], df ['month'],df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
However, the compiler return an error
AttributeError: Can only use .dt accessor with datetimelike values
May I know where did I do wrong?

Fix it by remove the first dt.date
df ['date'] = pd.to_datetime ( df.date_text )
df ['year'], df ['month'], df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
df
Out[43]:
date_text date year month qtr
0 Jan 2020 2020-01-01 2020 1 1
1 May 2020 2020-05-01 2020 5 2
2 Jun 2020 2020-06-01 2020 6 2

Related

Python merge multiple date columes with null into 1 column Pandas

I try to convert multiple dates format into YYYY-MM-DD, then merge them into 1 column ignore the NULL, but I end up with TypeError: cannot add DatetimeArray and DatetimeArray
import pandas as pd
data = [[ 'Apr 2021'], ['Jan 1'], ['Fri'], [ 'Jan 18']]
df = pd.DataFrame(data, columns = ['date', ])
#convert Month date Jan 1
df['date1']=(pd.to_datetime('2021 '+ df['date'],errors='coerce',format='%Y %b %d'))
# convert Month Year Apr 2021
df['date2']=pd.to_datetime(df['date'], errors='coerce')
#convert fri to this friday
today = datetime.date.today()
friday = today + datetime.timedelta( (4-today.weekday()) % 7 )
this_firday = friday.strftime('%Y-%m-%d')
df['date3']=df['date'].map({'Fri':this_firday})
df['date3'] = pd.to_datetime(df['date3'])
df['dateFinal'] = df['date1'] + df['date2'] + df['date3']
I check the dtypes, they're all datetime, I don't know why. my approach is not efficient, feel free to let me know a better way.
IIUC:
try via bfill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].bfill(axis=1).iloc[:,0]
OR
via ffill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].ffill(axis=1).iloc[:,-1]
OR
via stack()+to_numpy()
df['dateFinal'] = df[['date1','date2','date3']].stack().to_numpy()
output of df:
date date1 date2 date3 dateFinal
0 Apr 2021 NaT 2021-04-01 NaT 2021-04-01
1 Jan 1 2021-01-01 NaT NaT 2021-01-01
2 Fri NaT NaT 2021-08-13 2021-08-13
3 Jan 18 2021-01-18 NaT NaT 2021-01-18

Python: how to groupby a pandas dataframe to count by hour and day?

I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.

Editing the date in pandas to show year only, by column

I am trying to understand how I can edit the dataframe in python using pandas so I can drop everything but the year.
Example: if the date is 2014-01-01, I want it to show 2014 and drop both the month and the date. All the dates are in a single column.
Thanks in advice!
You can convert the numpy.datetime64 date value to datetime using pd.to_datetime() and then you can extract year or month or day from it.
import numpy as np
date = np.datetime64('2014-01-01')
type(date)
Output:
numpy.datetime64
Convert this date to pandas datetime using pd.to_datetime.
date = pd.to_datetime(date)
type(date)
Output:
pandas._libs.tslibs.timestamps.Timestamp
Then you can extract the year using .year
date.year
Output:
2014
So, if you if you have a df:
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.DatetimeIndex(df['date']).year
df
Output:
date
0 2014
1 2015
2 2016
Alternately, you can also do this
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = df['date'].apply(lambda x: x.strftime('%Y'))
df
Output:
date
0 2014
1 2015
2 2016
EDIT 1
Group by using year when the column has date values
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.groupby(df.index.year).size()
Output:
date
2014 1
2015 1
2016 1
You can still do the same even if you have removed the month and day from the date and only have year in your column
df = pd.DataFrame({'date': [np.datetime64('2014-01-01'), np.datetime64('2015-01-01'), np.datetime64('2016-01-01')]})
df['date'] = pd.DatetimeIndex(df['date']).year
df.groupby('date').size()
Output:
date
2014 1
2015 1
2016 1

Python Pandas: split and change the date format(one with eg:(aug 2018 - nov 2018)) and other with only one?

Split Date e.g. Aug 2018 --> 01-08-2018 ??
Here's my sample input
id year_pass
1 Aug 2018 - Nov 2018
2 Jul 2017
Here's my sample input 2
id year_pass
1 Jul 2018
2 Aug 2017 - Nov 2018
What i did,
I'm able to split the date on the with eg:(aug 2018 - nov 2018)
# splitting the date column on the '-'
year_start, year_end = df['year_pass'].str.split('-')
df.drop('year_pass', axis=1, inplace=True)
# assigning the split values to columns
df['year_start'] = year_start
df['year_end'] = year_end
# converting to datetime objects
df['year_start'] = pd.to_datetime(df['year_start'])
df['year_end'] = pd.to_datetime(df['year_end'])
But couldn't figure out how to do it for both
Output should be:
id year_start year_end
1 01-08-2018 01-11-2018
2 01-07-2018
This is one approach using dt.strftime("%d-%m-%Y").
Ex:
import pandas as pd
df = pd.DataFrame({"year_pass": ["Aug 2018 - Nov 2018", "Jul 2017"]})
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 01-08-2018 01-11-2018
1 01-07-2017 NaT
Edit as per comment:
import pandas as pd
def replaceInitialSpace(val):
if val.startswith(" "):
return " - "+val.strip()
return val
df = pd.DataFrame({"year_pass": [" Jul 2018", "Aug 2018 - Nov 2018", "Jul 2017 "]})
df["year_pass"] = df["year_pass"].apply(replaceInitialSpace)
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 NaT 01-07-2018
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
You could start by splitting the strings by the original dataframe:
# split the original dataframe
df = df.year_pass.str.split(' - ', expand=True)
0 1
id
1 Aug2018 Nov2018
2 Jul2017 None
And then apply pd.to_datetime to turn the strings to datetime objects and format them using strftime:
# rename the columns
df.columns = ['year_start','year_end']
df.apply(lambda x: pd.to_datetime(x, errors='coerce').dt.strftime('%d-%m-%Y'), axis=0)
year_start year_end
id
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
If need datetimes in output is necessary different format - YYYY-MM-DD:
df1 = df.pop('year_pass').str.split('\s+-\s+', expand=True).apply(pd.to_datetime)
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 2018-08-01 2018-11-01
1 2 2017-07-01 NaT
print (df.dtypes)
id int64
year_start datetime64[ns]
year_end datetime64[ns]
dtype: object
If need change format then get strings, but all datetimelike functions failed:
df1 = (df.pop('year_pass').str.split('\s+-\s+', expand=True)
.apply(lambda x: pd.to_datetime(x).dt.strftime('%d-%m-%Y'))
.replace('NaT',''))
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 01-08-2018 01-11-2018
1 2 01-07-2017
print (df.dtypes)
id int64
year_start object
year_end object
dtype: object
print (type(df.loc[0, 'year_start']))
<class 'str'>

pd.to_datetime is getting half my dates with flipped day / months

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Categories