My pandas series contains year values. They're not formatted consistently. For example,
df['year']
1994-1996
circa 1990
1995-1998
circa 2010
I'd like to grab the year from the string.
df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]
This doesn't work for rows with circa.
I'd like handle the rows with circa and grab only the year if it exists.
df['Year']
1994
1990
1995
2010
df['Year_Only']=df['Year'].str.extract(r'(\d{4})')[:4]
You can use str.extract then convert as pd.Int16Dtype:
df['Year'] = df['Year'].str.extract(r'(\d{4})', expand=False).astype(pd.Int16Dtype())
print(df)
# Output
Year
0 1994
1 1990
2 1995
3 2010
Related
So i have the Quarterly data for Disney_Plus revenue from Q1 2020 to Q4 2021.
Desired Output of Disney_Plus_Revenue should include yearly results of 2020 and 2021. In addition to this it should also have 2010 to 2019 yearly results as None/NaN.
I initially changed the column Year to Quarter and later inserted a new column Year with 2020 and 2021 values and used groupby('Year).agg('revenue':['sum'])
But when i am trying to append 2010 to 2019 yearly revenues to this it is throwing me an error:
Solution i tried
Disney_plus_Revenue = pd.read_csv("Disney_plus_Revenue.csv")
Disney_plus_Revenue.rename(columns = {'Year':'Quarter'},inplace = True)
Disney_plus_Revenue.insert(0,"Year",["2020","2020","2020","2020","2021","2021","2021","2021"],True)
Disney_plus_Revenue.rename(columns = {'Revenue':'Disney_Plus_Revenue'},inplace = True)
Disney_plus_Revenue = Disney_plus_Revenue.groupby('Year').agg({'Disney_Plus_Revenue': ['sum']})
DS_new = pd.DataFrame(np.array([["2010",None],["2011",None],["2012",None],["2013",None],["2014",None],["2015",None],["2016",None],["2017",None],["2018",None],["2019",None]]), columns=['Year','Disney_Plus_Revenue']).append(Disney_plus_Revenue, ignore_index=True)
Error -
First of all, using .agg will give you a MultiIndex DataFrame.
Since you're using only one aggregation function, maybe you should group that way:
agg = Disney_plus_Revenue.groupby("Year")["Revenues"].sum()
This will give you a Series:
Year
2020 2.802
2021 5.200
Name: Revenues, dtype: float64
Then you can create another Series with None values for each years:
indexes = np.arange(2010, 2020)
values = [None for x in indexes]
new_series = pd.Series(data=values, index=indexes, name="Revenues")
And finally, concat them:
pd.concat([new_series, agg])
2010 None
2011 None
2012 None
2013 None
2014 None
2015 None
2016 None
2017 None
2018 None
2019 None
2020 2.802
2021 5.2
Name: Revenues, dtype: object
I am trying to convert a column with type Integer to Year. Here is my situation:
Original Column: June 13, 1980 (United States)
I split and slice it into
Year Column: 1980
Here, I tried to use:
df['Year'] = pd.to_datetime(df['Year'])
It changed the column to have the year is different from the Original column. For example,
Original Year
1980 1970
2000 1970
2016 1970
I am looking forward to your help. Thank you in advance.
Best Regards,
Tu Le
df['Year'] = df['Original'].astype(str).astype('datetime64')
print(df)
Prints:
Original Year
0 1980 1980-01-01
1 2000 2000-01-01
2 2016 2016-01-01
If need datetimes from year, it means also added month=1 and day=1 add format parameter, here %Y for YYYY:
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
print (df)
Original Year
0 1980 1970-01-01
1 2000 1970-01-01
2 2016 1970-01-01
Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?
Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946
a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1
I have a dataframe with a Date column as an index as DateTime type, and a value attached to each entry.
The dates are split into yyyy-mm-dd, with each row being the next day.
Example:
Date: x:
2012-01-01 44
2012-01-02 75
2012-01-03 62
How would I split the Date column into Year and Month columns, using those two as indexes while also summing the values of all the days in a month?
Example of expected output:
Year: Month: x:
2012 1 745
2 402
3 453
...
2013 1 4353
Use Series.dt.year
Series.dt.month with aggregate sum by GroupBy.sum and rename for new columns names:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.year.rename('Year'),
df['Date'].dt.month.rename('Month')])['x'].sum().reset_index()
print (df1)
Year Month x
0 2012 1 181
Use groupby and sum:
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum())
Year Month
2012 1 181
Name: x, dtype: int64
Note that if "Date" isn't a datetime dtype column, use
df.Date = pd.to_datetime(df.Date, errors='coerce')
To convert it first.
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum()
.reset_index())
Year Month x
0 2012 1 181
I have a 1000 x 6 dimension data frame and one of the columns' header is "Date" where the date is presented in the format "JAN2014", "JUN2002" etc...
I would like to split this column in two separate columns: "Year" and "Month" so JAN will be in "Month" column, 2014 will be in "Year" column etc..
Could anyone please tell me how to do this in Python?
You can use the str accessor and indexing:
df['Month'] = df['Date'].str[:3]
df['Year'] = df['Date'].str[3:]
Example:
df = pd.DataFrame({'Date':['JAN2014','JUN2002']})
df['Month'] = df['Date'].str[:3]
df['Year'] = df['Date'].str[3:]
print(df)
Output:
Date Month Year
0 JAN2014 JAN 2014
1 JUN2002 JUN 2002