My pandas dataframe has year, month and date in the first 3 columns. To convert them into a datetime type, i use a for loop that loops over each row taking the content in the first 3 columns of each row as inputs to the datetime function. Any way i can avoid the for loop here and get the dates as a datetime?
I'm not sure there's a vectorized hook, but you can use apply, anyhow:
>>> df = pd.DataFrame({"year": [1992, 2003, 2014], "month": [2,3,4], "day": [10,20,30]})
>>> df
day month year
0 10 2 1992
1 20 3 2003
2 30 4 2014
>>> df["Date"] = df.apply(lambda x: pd.datetime(x['year'], x['month'], x['day']), axis=1)
>>> df
day month year Date
0 10 2 1992 1992-02-10 00:00:00
1 20 3 2003 2003-03-20 00:00:00
2 30 4 2014 2014-04-30 00:00:00
Related
a=pd.date_range(start='1/1/2021', end='01/01/2022', freq='h')
I have these hourly dates for the year but when I try to save it out to excel its just one ridiculously big number and what I need is 4 columns (year;month;day;hour)
Thanks in advance!
One solution could be:
Use pd.DataFrame to construct a df with respective columns.
Use df.to_csv to write the df to a csv-file (also possible: directly to excel with df.to_excel).
import pandas as pd
a = pd.date_range(start='1/1/2021', end='01/01/2022', freq='h')
df = pd.DataFrame({'year': a.year,
'month': a.month,
'day': a.day,
'hour': a.hour})
print(df.head())
year month day hour
0 2021 1 1 0
1 2021 1 1 1
2 2021 1 1 2
3 2021 1 1 3
4 2021 1 1 4
df.to_csv('fname.csv')
I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication
So I selected 3 columns from my dataframe in order to create a time series that I could then plot:
booking_date = pd.DataFrame({'day': hotel_bookings_cleaned["arrival_date_day_of_month"],
'month': hotel_bookings_cleaned["arrival_date_month"],
'year': hotel_bookings_cleaned["arrival_date_year"]})
and the output looks like:
day month year
0 1 July 2015
1 1 July 2015
2 1 July 2015
3 1 July 2015
4 1 July 2015
I tried using
dates = pd.to_datetime(booking_date)
but got the error message
ValueError: Unable to parse string "July" at position 0
I'm assuming I need to convert the Month column to a numeric value before I can convert it to a datetime, but I haven't been able to make any parsers work.
Try this
dates = pd.to_datetime(booking_date.astype(str).agg('-'.join, axis=1), format='%d-%B-%Y')
Out[13]:
0 2015-07-01
1 2015-07-01
2 2015-07-01
3 2015-07-01
4 2015-07-01
dtype: datetime64[ns]
Not sure if this is more performant than the previous answer, but you can convert your string column to integers with a dictionary mapping to fit the format that pandas expects in to_datetime()
month_map = {
'January':1,
'February':2,
'March':3,
'April':4,
'May':5,
'June':6,
'July':7,
'August':8,
'September':9,
'October':10,
'November':11,
'December':12
}
dates = pd.DataFrame({
'day':booking_date.day,
'month':booking_date.month.apply(lambda x: month_map[x]),
'year':booking_date.year
})
ts = pd.to_datetime(dates)
I am still quite new to Python, so please excuse my basic question.
After a reset of pandas grouped dataframe, I get the following:
year month pl
0 2010 1 27.4376
1 2010 2 29.2314
2 2010 3 33.5714
3 2010 4 37.2986
4 2010 5 36.6971
5 2010 6 35.9329
I would like to merge year and month to one column in pandas datetime format.
I am trying:
C3['date']=pandas.to_datetime(C3.year + C3.month, format='%Y-%m')
But it gives me a date like this:
year month pl date
0 2010 1 27.4376 1970-01-01 00:00:00.000002011
What is the correct way? Thank you.
You need to convert to str if necessary, then zfill the month col and pass this with a valid format to to_datetime:
In [303]:
df['date'] = pd.to_datetime(df['year'].astype(str) + df['month'].astype(str).str.zfill(2), format='%Y%m')
df
Out[303]:
year month pl date
0 2010 1 27.4376 2010-01-01
1 2010 2 29.2314 2010-02-01
2 2010 3 33.5714 2010-03-01
3 2010 4 37.2986 2010-04-01
4 2010 5 36.6971 2010-05-01
5 2010 6 35.9329 2010-06-01
If the conversion is unnecessary then the following should work:
df['date'] = pd.to_datetime(df['year'] + df['month'].str.zfill(2), format='%Y%m')
Your attempt failed as it treated the value as epoch time:
In [305]:
pd.to_datetime(20101, format='%Y-%m')
Out[305]:
Timestamp('1970-01-01 00:00:00.000020101')
here is a question about the data from pandas. What I am looking is to fetch two column from a csv file, and manipulate these data before finally saving them.
The csv file looks like :
year month
2007 1
2007 2
2007 3
2007 4
2008 1
2008 3
this is my current code:
records = pd.read_csv(path)
frame = pd.DataFrame(records)
combined = datetime(frame['year'].astype(int), frame['month'].astype(int), 1)
The error is :
TypeError: cannot convert the series to "<type 'int'>"
any thoughts?
datetime won't operate on a pandas Series (column of a dataframe). You can use to_datetime or you could use datetime within apply. Something like the following should work:
In [9]: df
Out[9]:
year month
0 2007 1
1 2007 2
2 2007 3
3 2007 4
4 2008 1
5 2008 3
In [10]: pd.to_datetime(df['year'].astype(str) + '-'
+ df['month'].astype(str)
+ '-1')
Out[10]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Or use apply:
In [11]: df.apply(lambda x: datetime(x['year'],x['month'],1),axis=1)
Out[11]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Another Edit: You can also do most of the date parsing with read_csv but then you need to adjust the day after you read it in (note, my data is in a string named 'data'):
In [12]: df = pd.read_csv(StringIO(data),header=True,
parse_dates={'date':['year','month']})
In [13]: df['date'] = df['date'].values.astype('datetime64[M]')
In [14]: df
Out[14]:
date
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
Had similar issue the answer is assuming that you have the Year, Month and Day in columns of your DataFrame:
df['Date'] = df[['Year', 'Month', 'Day']].apply(lambda s : datetime.datetime(*s),axis = 1)
first part selects the columns with the Year, Month and Date form the Dateframe, second bit applies the datetime function element-wise on the data.
if you do not gave the day in your data asit looks like form your data, just do:
df['Day'] = 1
to place the day there as well. should be way to do that in code, but will be quick workaround. Can always drop the Day column afterward if you dont want it.