If there was a variable in an xarray dataset with a time dimension with daily values over some multiyear time span
2017-01-01 ... 2018-12-31, then it is possible to group the data by month, or by the day of the year, using
.groupby("time.month") or .groupby("time.dayofyear")
Is there a way to efficiently group the data by the day of the month, for example if I wanted to calculate the mean value on the 21st of each month?
See the xarray docs on the DateTimeAccessor helper object. For more info, you can also check out the xarray docs on Working with Time Series Data: Datetime Components, which in turn refers to the pandas docs on date/time components.
You're looking for day. Unfortunately, both pandas and xarray simply describe .dt.day as referring to "the days of the datetime" which isn't particularly helpful. But if you take a look at python's native datetime.Date.day definition, you'll see the more specific:
date.day
Between 1 and the number of days in the given month of the given year.
So, simply
da.groupby("time.day")
Should do the trick!
I not sure, but maybe you can do like this:
import datetime
x = datetime.datetime.now()
day = x.strftime("%d")
month = x.strftime("%m")
year = x.strftime("%Y")
.groupby(month) or .groupby(year)
Related
Is it possible to use .resample() to take the last observation in a month of a weekly time series to create a monthly time series from the weekly time series? I don't want to sum or average anything, just take the last observation of each month
Thank you.
Based on what you want and what the documentation describes, you could try the following :
data[COLUMN].resample('M', convention='end')
Try it out and update us!
References
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Is the 'week' field as week of year, a date or other?
If it's a datetime, and you have datetime library imported , use .dt.to_period('M') on your current date column to create a new 'month' column, then get the max date for each month to get the date to sample ( if you only want the LAST date in each month ? )
Like max(df['MyDateField'])
Someone else is posting as I type this, so may have a better answer :)
I have a table which contains information on the number of changes done on a particular day. I want to add a text field to it in the format YYYY-WW (e. g. 2022-01) which indicates the week number of the day. I need this information to determine in what week the total number of changes was the highest.
How can I determine the week number in Python?
Below is the code based on this answer:
week_nr = day.isocalendar().week
year = day.isocalendar().year
week_nr_txt = "{:4d}-{:02d}".format(year, week_nr)
At a first glance it seems to work, but I am not sure that week_nr_txt will contain year-week tuple according to the ISO 8601 standard.
Will it?
If not how do I need to change my code in order to avoid any week-related errors (example see below)?
Example of a week-related error: In year y1 there are 53 weeks and the last week spills over into the year y1+1.
The correct year-week tuple is y1-53. But I am afraid that my code above will result in y2-53 (y2=y1+1) which is wrong.
Thanks. I try to give my answer. You can easily use datetime python module like this:
from datetime import datetime
date = datetime(year, month, day)
# And formating the date time object like :
date.strftime('%Y-%U')
Then you will have the year and wich week the total information changes
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I have a DataFrame with a timestamp column containing all the days of the year.
I would like to keep only the first day of the month, any idea of how should I do this?
see this article for an example on how you can ask a good question on Stack Overflow and provide a minimum reproducible example:
https://stackoverflow.com/help/how-to-ask
With that in mind, you can access the day attribute of a datetime object as follows:
from datetime import datetime
dt = datetime.today()
dt.day
Output:
2021-07-11 09:37:23.122548
11
You could then use masking to select rows in your dataframe that have a value of 1 for day as below:
df = df[df['date_column'].dt.day == 1]
You'll just need to replace 'date_column' with whatever your date column is called.
We've got a tutorial for a complete introduction to pandas on our website, feel free to take a look if you'd like to learn more!
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.