I have a DataFrame with a timestamp column containing all the days of the year.
I would like to keep only the first day of the month, any idea of how should I do this?
see this article for an example on how you can ask a good question on Stack Overflow and provide a minimum reproducible example:
https://stackoverflow.com/help/how-to-ask
With that in mind, you can access the day attribute of a datetime object as follows:
from datetime import datetime
dt = datetime.today()
dt.day
Output:
2021-07-11 09:37:23.122548
11
You could then use masking to select rows in your dataframe that have a value of 1 for day as below:
df = df[df['date_column'].dt.day == 1]
You'll just need to replace 'date_column' with whatever your date column is called.
We've got a tutorial for a complete introduction to pandas on our website, feel free to take a look if you'd like to learn more!
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
Related
Is it possible to use .resample() to take the last observation in a month of a weekly time series to create a monthly time series from the weekly time series? I don't want to sum or average anything, just take the last observation of each month
Thank you.
Based on what you want and what the documentation describes, you could try the following :
data[COLUMN].resample('M', convention='end')
Try it out and update us!
References
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Is the 'week' field as week of year, a date or other?
If it's a datetime, and you have datetime library imported , use .dt.to_period('M') on your current date column to create a new 'month' column, then get the max date for each month to get the date to sample ( if you only want the LAST date in each month ? )
Like max(df['MyDateField'])
Someone else is posting as I type this, so may have a better answer :)
If there was a variable in an xarray dataset with a time dimension with daily values over some multiyear time span
2017-01-01 ... 2018-12-31, then it is possible to group the data by month, or by the day of the year, using
.groupby("time.month") or .groupby("time.dayofyear")
Is there a way to efficiently group the data by the day of the month, for example if I wanted to calculate the mean value on the 21st of each month?
See the xarray docs on the DateTimeAccessor helper object. For more info, you can also check out the xarray docs on Working with Time Series Data: Datetime Components, which in turn refers to the pandas docs on date/time components.
You're looking for day. Unfortunately, both pandas and xarray simply describe .dt.day as referring to "the days of the datetime" which isn't particularly helpful. But if you take a look at python's native datetime.Date.day definition, you'll see the more specific:
date.day
Between 1 and the number of days in the given month of the given year.
So, simply
da.groupby("time.day")
Should do the trick!
I not sure, but maybe you can do like this:
import datetime
x = datetime.datetime.now()
day = x.strftime("%d")
month = x.strftime("%m")
year = x.strftime("%Y")
.groupby(month) or .groupby(year)
I'm using pandas to analyze some data about the House Price Index of all states from quandl:
HPI_Data = quandl.get("FMAC/HPI_AK")
The data looks something like this:
HPI Alaska
Date
1975-01-31 35.105461
1975-02-28 35.465209
1975-03-31 35.843110
and so on.
I've got a second dataframe with some special dates in it:
Date
Name
David 1979-08
Allen 1980-08
Hugo 1989-09
The values for "Date" here are of "string" type and not "date".
I'd like to go 6 months back from each date in the special dataframe and see the values in the HPI dataframe.
I'd like to use .loc but I have not been able to convert the first dataframe's index from "END OF MONTH" to "MONTH". even after resampling to "1D" then back to "M".
I'd would appreciate any help, if it solves the problem a different way or the janky data deleting way I want :).
Not sure if I understand correctly. So please clarify your question if this is not correct.
You can convert a string to a pandas date time object using pd.to_datetime and use the format parameter to specify how to parse the string
import pandas as pd
# Creating a dummy Series
sr = pd.Series(['2012-10-21 09:30', '2019-7-18 12:30', '2008-02-2 10:30',
'2010-4-22 09:25', '2019-11-8 02:22'])
# Convert the underlying data to datetime
sr = pd.to_datetime(sr)
# Subtract 6 months of the datetime series
sr-pd.DateOffset(month=6)
In regards to changing the datetime to just month i.e. 2012-10-21 09:30 --> 2012-10 I would do this:
sr.dt.to_period('M')
I have a little problem with the .loc function.
Here is the code:
date = df.loc [df ['date'] == d] .index [0]
d is a specific date (e.g. 21.11.2019)
The problem is that the weekend can take days. In the dataframe in the column date there are no values for weekend days. (contains calendar days for working days only)
Is there any way that if d is on the weekend he'll take the next day?
I would have something like index.get_loc, method = bfill
Does anyone know how to implement that for .loc?
IIUC you want to move dates of format: dd.mm.yyyy to nearest Monday, if they happen to fall during the weekend, or leave them as they are, in case they are workdays. The most efficient approach will be to just modify d before you pass it to pandas.loc[...] instead of looking for the nearest neighbour.
What I mean is:
import datetime
d="22.12.2019"
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
if(dt.weekday() in [5,6]):
dt=dt+datetime.timedelta(days=7-dt.weekday())
d=dt.strftime("%d.%m.%Y")
Output:
23.12.2019
Edit
In order to just take first date, after or on d, which has entry in your dataframe try:
import datetime
df['date']=pd.to_datetime(df['date'], format='%d.%m.%Y')
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
d=df.loc[df ['date'] >= d, 'date'].min()
dr.loc[df['date']==d]...
...
I have a dataframe which look like this as below
Year Birthday OnsetDate
5 2018/1/1
5 2018/2/2
now I use the OnsetDate column subtract with the Day column
df['Birthday'] = df['OnsetDate'] - pd.to_timedelta(df['Day'], unit='Y')
but the outcome of the Birthday column is mixing with time just like below
Birthday
2013/12/31 18:54:00
2013/1/30 18:54:00
the outcome is just a dummy data, what I focused on this is that the time will cause inaccurate of date after the operation. What is the solution to avoid the time being generated so that I can get accurate data.
Second question, I merge the above dataframe to another data frame.
new.update(df)
and the 'new' dataframe Birthday column became like this
Birthday
1164394440000000000
1165949640000000000
so actually caused this and what is the solution?
First question, you should know that is not a whole year by using pd.to_timedelta. If you print, you can see 1 year = 365 days 05:49:12.
print(pd.to_timedelta(1, unit='Y'))
365 days 05:49:12
If you want to avoid the time being generated, you can use DateOffset.
from pandas.tseries.offsets import DateOffset
df['Year'] = df['Year'].apply(lambda x: DateOffset(years=x))
df['Birthday'] = df['OnsetDate'] - df['Year']
Year OnsetDate Birthday
0 <DateOffset: years=5> 2018-01-01 2013-01-01
1 <DateOffset: years=5> 2018-02-02 2013-02-02
As for the second question is caused by the type of column, you can use pd.to_datetime to solve it.
new['Birthday'] = pd.to_datetime(new['Birthday'])