I have a dataframe that has one column which is a datetime series object. It has some data associated with every date in another column. The year ranges from 2005-2014. I want to group similar dates in each year together, i.e, all the 1st January falling in 2005-15 must be grouped together irrespective of the year.Similarly for all the 365 days in a year. So I should have 365 days as the output. How can I do that?
Assuming your DataFrame has a column Date, you can make it the index of the DataFrame and then use strftime, to convert to a format with only day and month (like "%m-%d"), and groupby plus the appropriate function (I just used mean):
df = df.set_index('Date')
df.index = df.index.strftime("%m-%d")
dfAggregated = df.groupby(level=0).mean()
Please note that the output will have 366 days, due to leap years. You might want to filter out the data associated to Feb 29th or merge it into Feb 28th/March 1st (depending on the specific use case of your application)
Related
I have a data frame(df) with as the columns company names, and as the entries their sales volume for a certain day in a certain year up until that point, cumulative. The df is indexed by dates in datetime format. I would like to select all the values belonging to 31 December for every year, and if no entry is existent for that date, I want to select the previous closest date that has a value and return a data frame with the companies and their entries for 31st December for every year or the most recent if not available. For example, if there is no entry for Amazon belonging to 12-31-2015, but there is one for 12-30-2015, I want to retrieve that entry. Currently I have the following code to retrieve the entries belonging to every 31st December every year:
end_of_year_sales = df.loc[(df.index.month==12) & (df.index.day==31) ]
However, this correctly retrieves all the columns with the companies and the sales volume for 31st December, but I do not know how to retrieve the most recent possible values when there is no value on the 31st of December for a certain company in certain year.
This will only for for December values, but here is one trick:
df.loc[df.index.to_series().resample('y').last().values]
That is, just resample annually and get the last value. It will work regardless of the number of days you have in that month.
You can do it this way by selecting the column that has the date and then use head to select most recent.
date=(df['date'].dt.month ==12) & (df['date'].dt.day == 31)
df=df.loc[date]
df.head(10)
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I have a dataframe and these are the first 5 index, there are several rows with different datapoint for a date and then it goes to the next day
DatetimeIndex(['2014-01-01', '2014-01-01', '2014-01-01', '2014-01-01',
'2014-01-01'],
dtype='datetime64[ns]', name='DayStartedOn', freq=None)
and this is the current column dtypes
country object
type object
name object
injection float64
withdrawal float64
cy_month period[M]
I wish to add a column with calendar year month, and 2 columns with different fiscal years and months.
better to separate year and month in different columns like: calendar year, calendar month, fiscal year, fiscal month. The objective is to keep these column values when I perform regroup or resample with other columns
I achieved above cy_month by
df['cy_month']=df.index.to_period('M')
even I don't feel very comfortable with this, as I want the period, not the monthend
I tried to add these 2 columns
for calendar year:
pd.Period(df_storage_clean.index.year, freq='A-DEC')
for another fiscal year:
pd.Period(df_storage_clean.index.year, freq='A-SEP')
but had Traceback:
ValueError: Value must be Period, string, integer, or datetime
So I started to NOT using pandas by loop row by row and add to a list,
lst_period_cy=[]
for y in lst_cy:
period_cy=pd.Period(y, freq='A-DEC')
lst_period_cy.append(period_cy)
then convert the list to a Series or df and add it back to the df
but I suppose it's not efficient (150k rows data) so haven't continued
Just in case you haven't found a solution yet ...
You could do the following:
df.reset_index(drop=False, inplace=True)
df['cal_year_month'] = df.DayStartedOn.dt.month
df['cal_year'] = df.DayStartedOn.dt.year
df['fisc_year'] = df.DayStartedOn.apply(pd.Period, freq='A-SEP')
df.set_index('DayStartedOn', drop=True, inplace=True)
My assumption is that, as in your example, the index is named DayStartedOn. If that's not the case then the code has to be adjusted accordingly.
I am reading an .xlsx spreadsheet into a Pandas DataFrame so that I can remove duplicate rows based on all columns and export the DataFrame into a .csv. One of the columns is a date column formatted as MM/DD/YY.
Here is a sample of the unaltered data
This spreadsheet contains abnormal pay hours entries for a payroll that is payed every Friday based on hours from one week previous to the current week. Rows are added each day there is an abnormal function with that day's data. I want to tell pandas to only find duplicates in rows whose date is less than or equal to the Friday date one week previous from the current Friday (This script will only be ran on Fridays). For example, if today is Friday 12/7/18, I want to set a cutoff date of the previous Friday, 11/30/18, and only look at rows whose dates are on or before 11/30/18. How can I trim the DataFrame in this way before executing drop_duplicates?
you can use date and timedelta.
get todays date.
store the date one week from todays date.
filter your data (I'm not sure how you have it stored, but I used generic names)
from datetime import date, timedelta
today = date.today()
week_prior = today - timedelta(weeks=1)
df_last_week = df[df['date'] <= week_prior]
Note that using a fixed time window of 1 week (or 7 days) is fine if you are sure that your script will only ever be run on a Friday.
You can, of course programatically get the date of last Friday, and filter your dataframe on that date:
last_friday = datetime.now().date() - timedelta(days=datetime.now().weekday()) + timedelta(days=4, weeks=-1)
print(df[df['date'] <= pd.Timestamp(last_friday)])
I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!