I am trying to convert a column with type Integer to Year. Here is my situation:
Original Column: June 13, 1980 (United States)
I split and slice it into
Year Column: 1980
Here, I tried to use:
df['Year'] = pd.to_datetime(df['Year'])
It changed the column to have the year is different from the Original column. For example,
Original Year
1980 1970
2000 1970
2016 1970
I am looking forward to your help. Thank you in advance.
Best Regards,
Tu Le
df['Year'] = df['Original'].astype(str).astype('datetime64')
print(df)
Prints:
Original Year
0 1980 1980-01-01
1 2000 2000-01-01
2 2016 2016-01-01
If need datetimes from year, it means also added month=1 and day=1 add format parameter, here %Y for YYYY:
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
print (df)
Original Year
0 1980 1970-01-01
1 2000 1970-01-01
2 2016 1970-01-01
Related
My pandas series contains year values. They're not formatted consistently. For example,
df['year']
1994-1996
circa 1990
1995-1998
circa 2010
I'd like to grab the year from the string.
df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]
This doesn't work for rows with circa.
I'd like handle the rows with circa and grab only the year if it exists.
df['Year']
1994
1990
1995
2010
df['Year_Only']=df['Year'].str.extract(r'(\d{4})')[:4]
You can use str.extract then convert as pd.Int16Dtype:
df['Year'] = df['Year'].str.extract(r'(\d{4})', expand=False).astype(pd.Int16Dtype())
print(df)
# Output
Year
0 1994
1 1990
2 1995
3 2010
Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?
Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946
a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1
So I selected 3 columns from my dataframe in order to create a time series that I could then plot:
booking_date = pd.DataFrame({'day': hotel_bookings_cleaned["arrival_date_day_of_month"],
'month': hotel_bookings_cleaned["arrival_date_month"],
'year': hotel_bookings_cleaned["arrival_date_year"]})
and the output looks like:
day month year
0 1 July 2015
1 1 July 2015
2 1 July 2015
3 1 July 2015
4 1 July 2015
I tried using
dates = pd.to_datetime(booking_date)
but got the error message
ValueError: Unable to parse string "July" at position 0
I'm assuming I need to convert the Month column to a numeric value before I can convert it to a datetime, but I haven't been able to make any parsers work.
Try this
dates = pd.to_datetime(booking_date.astype(str).agg('-'.join, axis=1), format='%d-%B-%Y')
Out[13]:
0 2015-07-01
1 2015-07-01
2 2015-07-01
3 2015-07-01
4 2015-07-01
dtype: datetime64[ns]
Not sure if this is more performant than the previous answer, but you can convert your string column to integers with a dictionary mapping to fit the format that pandas expects in to_datetime()
month_map = {
'January':1,
'February':2,
'March':3,
'April':4,
'May':5,
'June':6,
'July':7,
'August':8,
'September':9,
'October':10,
'November':11,
'December':12
}
dates = pd.DataFrame({
'day':booking_date.day,
'month':booking_date.month.apply(lambda x: month_map[x]),
'year':booking_date.year
})
ts = pd.to_datetime(dates)
I am working with some historical data on fiscal transfers in Canada. The downloaded data is in the format of fiscal year i.e.
Year Quebec Alberta
1980-1981 2000 4000
1981-1982 3000 6000
I am using the pandas library. However, when I try to make any visualizations using either matplot or sns, it generates an error either not recognizing 'Year' as a numerical value or ('DataFrame' object has no attribute 'Year'). However, when I change the values in the csv to a single year i.e.
Year Quebec Alberta
1980 2000 4000
1981 3000 6000
it works perfectly fine. Is there a way for Python to treat fiscal year values like 1980-1981 the same as normal year. Any advice would be much appreciated.
You can use 2years periods, but if print DataFrame columns cannot see end year:
print (df)
Year Quebec Alberta
0 1980 2000 4000
1 1981 3000 6000
df['Year'] = df['Year'].apply(lambda x: pd.Period(x, freq='2A-DEC'))
print (df['Year'])
0 1980
1 1981
Name: Year, dtype: period[2A-DEC]
print (df['Year'].dt.to_timestamp('A', how='s'))
0 1980-12-31
1 1981-12-31
Name: Year, dtype: datetime64[ns]
print (df['Year'].dt.to_timestamp('A', how='e'))
0 1981-12-31 23:59:59.999999999
1 1982-12-31 23:59:59.999999999
Name: Year, dtype: datetime64[ns]
But I think most easier is create 2 columns for start and end year:
print (df)
Year Quebec Alberta
0 1980-1981 2000 4000
1 1981-1982 3000 6000
df[['StartYear','EndYear']] = df['Year'].str.split('-', expand=True).astype(int)
print (df)
Year Quebec Alberta StartYear EndYear
0 1980-1981 2000 4000 1980 1981
1 1981-1982 3000 6000 1981 1982
I am still quite new to Python, so please excuse my basic question.
After a reset of pandas grouped dataframe, I get the following:
year month pl
0 2010 1 27.4376
1 2010 2 29.2314
2 2010 3 33.5714
3 2010 4 37.2986
4 2010 5 36.6971
5 2010 6 35.9329
I would like to merge year and month to one column in pandas datetime format.
I am trying:
C3['date']=pandas.to_datetime(C3.year + C3.month, format='%Y-%m')
But it gives me a date like this:
year month pl date
0 2010 1 27.4376 1970-01-01 00:00:00.000002011
What is the correct way? Thank you.
You need to convert to str if necessary, then zfill the month col and pass this with a valid format to to_datetime:
In [303]:
df['date'] = pd.to_datetime(df['year'].astype(str) + df['month'].astype(str).str.zfill(2), format='%Y%m')
df
Out[303]:
year month pl date
0 2010 1 27.4376 2010-01-01
1 2010 2 29.2314 2010-02-01
2 2010 3 33.5714 2010-03-01
3 2010 4 37.2986 2010-04-01
4 2010 5 36.6971 2010-05-01
5 2010 6 35.9329 2010-06-01
If the conversion is unnecessary then the following should work:
df['date'] = pd.to_datetime(df['year'] + df['month'].str.zfill(2), format='%Y%m')
Your attempt failed as it treated the value as epoch time:
In [305]:
pd.to_datetime(20101, format='%Y-%m')
Out[305]:
Timestamp('1970-01-01 00:00:00.000020101')