Trying to extract year from dataset in python
df["YYYY"] = pd.DatetimeIndex(df["Date"]).year
year appears as decimal point in the new column.
YYYY
2001.0
2002.0
2015.0
2022.0
How to just have year appear with no decimal points?
You likely have null values in you input resulting in NaNs and a float type for your column.
No missing values:
pd.DatetimeIndex(['2022-01-01']).year
Int64Index([2022], dtype='int64')
Missing values:
pd.DatetimeIndex(['2022-01-01', '']).year
Float64Index([2022.0, nan], dtype='float64')
I suggest to use pandas.to_datetime combined with convert_dtypes:
pd.to_datetime(pd.Series(['2022-01-01', ''])).dt.year.convert_dtypes()
0 2022
1 <NA>
dtype: Int64
Or to extract directly the year from the initial strings. But for that we would need a sample of the input.
sample program for your problem
import pandas as pd
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000']})
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df['year'])
pandas takes care of date by itself
if not we can directly specify as
df["date_feild"] = pd.to_datetime(df["date_feild"])
hope it will make things clear to you.
if not can you specify the df samples
Related
Data import from csv:
Date
Item_1
Item 2
1990-01-01
34
78
1990-01-02
42
19
.
.
.
.
.
.
2020-12-31
41
23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()
Issue:
So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1
My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?
Pandas groupby month and year
Update:
Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:
Link to sample file is:
https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file
import pandas as pd
df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )
df.groupby([(df.index.year),(df.index.month)]).mean()
Seems to do the trick from the provided data.
IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))
output:
date date
1990 1 0.563678
2 0.489105
3 0.459131
4 0.755165
5 0.424466
6 0.523857
7 0.612977
8 0.396031
9 0.452538
10 0.527063
11 0.397951
12 0.600371
dtype: float64
I am trying to figure out how to calculate the mean values for each row in this Python Pandas Pivot table that I have created.
I also want to add the sum of each year at the bottom of the pivot table.
The last step I want to do is to take the average value for each month calculated above and divide it with the total average in order to get the average distribution per year.
import pandas as pd
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2011, 1, 1)
end = datetime.datetime(2017, 12, 31)
libor = web.DataReader('USD1MTD156N', 'fred', start, end) # Reading the data
libor = libor.dropna(axis=0, how= 'any') # Dropping the NAN values
libor = libor.resample('M').mean() # Calculating the mean value per date
libor['Month'] = pd.DatetimeIndex(libor.index).month # Adding month value after each
libor['Year'] = pd.DatetimeIndex(libor.index).year # Adding month value after each
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
print pivot
Any suggestions how to proceed?
Thank you in advance
I think this is what you want (This is on python3 - I think only the print command is different in this script):
# Mean of each row
ave_month = pivot.mean(1)
#sum of each year at the bottom of the pivot table.
sum_year = pivot.sum(0)
# average distribution per year.
ave_year = sum_year/sum_year.mean()
print(ave_month, '\n', sum_year, '\n', ave_year)
Month
1 0.324729
2 0.321348
3 0.342014
4 0.345907
5 0.345993
6 0.369418
7 0.382524
8 0.389976
9 0.392838
10 0.392425
11 0.406292
12 0.482017
dtype: float64
Year
2011 2.792864
2012 2.835645
2013 2.261839
2014 1.860015
2015 2.407864
2016 5.953718
2017 13.356432
dtype: float64
Year
2011 0.621260
2012 0.630777
2013 0.503136
2014 0.413752
2015 0.535619
2016 1.324378
2017 2.971079
dtype: float64
I would use pivot_table over pivot, and then use the aggfunc parameter.
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
would be
import numpy as np
pivot = libor.pivot_table(index='Month',columns='Year',values='USD1MTD156N', aggfunc=np.mean)
YOu should be able to drop the resample statement also if I'm not mistaken
A link ot the docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
I m trying to upload some data from a csv file and find the values for date and month get interchanged.
Given below is how the data looks:
id,date
1001,09/10/2018
1002,20/09/2018
1003,09/05/2018
All of the dates are from September but as seen they are interchanged in different format. I am using the below to convert to datetime
df['date'] = pd.to_datetime(df['date']).dt.strftime('%d/%m/%Y')
I've figured out a neat little trick using str.extract and pd.to_datetime to do this quickly and efficiently:
m = df.date.str.extract(r'(?:(09)/(\d+))')[1].astype(int) > 31
df['date'] = [
pd.to_datetime(d, dayfirst=m) for d, m in zip(df.date, m)]
id date
0 1001 2018-09-10
1 1002 2018-09-20
2 1003 2018-09-05
Pandas has no issues dealing with your sample data because it clearly comes in the US notation apart from the case of '20/09/2018' where 20 cannot possibly be a month which pandas has no problem dealing with either.
However, if the input contains e.g. '10/09/2018' (as was mentioned in the comments) where it's impossible to tell day and month apart unless either the US notation is assumed or it is known beforehand that absolutely all dates are in September.
Since the latter seems to be the case, you can do
df['date'].map(lambda x: pd.datetime(x.year, x.day, x.month)
if (x.month != 9) & (x.day == 9)
else x)
0 2018-09-10
1 2018-09-20
2 2018-09-05
I am basically new to python and I have the below requirement
I have dates from Jan to dec and average values for some itmes like
In the attached image there are 5 rows belongs to May month and 6 rows belongs to June month
How can we iterate and calculate the average on month wise, like I want to calculate Averages of Food, Drinks and wastage for the months of May and June(on month wise I have 12 months of data).
I need output like
Month Food Drink wastage
May-17 2.0 3.0 2.0
June-17 2.5 2.5 3.0
First put your data into a panda data frame - I made up dummy data on my own - you need to figure out how to load your source. (from csv or excel).
Initiate the frame
import pandas as pd
import datetime
df1 = pd.DataFrame({'Start_date' : ['2018-01-01','2018-01-02','2018-01-03','2018-02-
01','2018-03-10','2018-02-05'],'food' : [2, 2.5, 3, 2.4, 5, 4],'drinks' :
[1,2,3,4,5,6], 'wastage':[6,5,4,3,2,1]})
Make sure you have date format on your date column - in this my input was string so I needed to cast it (you need to use a different format here) see (bottom of documentation for date formats: https://docs.python.org/2/library/datetime.html )
df1.Start_date = pd.to_datetime(df1.Start_date, format ='%Y-%m-%d')
I would add a month column:
Edited with year:
df1["period"] = df1.Start_date.apply(lambda x: datetime.datetime.strftime(x, '%b-%y'))
df1['month'] = pd.DatetimeIndex(df1.Start_date).month
Apply group by and mean
df1.groupby(['month']).mean() # for only month groupings
df1.groupby(['period']).mean() # for output listed above
import calendar
df= pd.DataFrame({'date': ['6/8/2015','7/10/2018','6/5/2015'],'food':[1.5,2.5,3],'drinks':[2,2.4,3],'wastage':[2,2.5,3],})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
df=pd.DataFrame(df.groupby(by=[df.date.dt.month.rename('month'),df.date.dt.year.rename('year')]).mean()).reset_index()
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
df['year']=df['year'].apply(str)
df['year']=df.year.str.replace("20","")
df['period'] = df[['month', 'year']].apply(lambda x: '-'.join(x), axis=1)
df=df.drop(['year','month'],axis=1)
df=df.rename(index=str, columns={"period": "month"})
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df[cols]
Output
month drinks food wastage
0 Jun-15 2.5 2.25 2.5
1 Jul-18 2.4 2.50 2.5
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases