I have a dataframe with a Date column as an index as DateTime type, and a value attached to each entry.
The dates are split into yyyy-mm-dd, with each row being the next day.
Example:
Date: x:
2012-01-01 44
2012-01-02 75
2012-01-03 62
How would I split the Date column into Year and Month columns, using those two as indexes while also summing the values of all the days in a month?
Example of expected output:
Year: Month: x:
2012 1 745
2 402
3 453
...
2013 1 4353
Use Series.dt.year
Series.dt.month with aggregate sum by GroupBy.sum and rename for new columns names:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.year.rename('Year'),
df['Date'].dt.month.rename('Month')])['x'].sum().reset_index()
print (df1)
Year Month x
0 2012 1 181
Use groupby and sum:
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum())
Year Month
2012 1 181
Name: x, dtype: int64
Note that if "Date" isn't a datetime dtype column, use
df.Date = pd.to_datetime(df.Date, errors='coerce')
To convert it first.
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum()
.reset_index())
Year Month x
0 2012 1 181
Related
I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.
I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)
I have a dataset between 2002 - 2018 which contains 1 value per month, 198 rows in total.
I want to know how I can average all the values from the same month (e.g. January/2003 + ... + January/2018)
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')
df = pd.read_csv('turbidez.csv', parse_dates=['date'], index_col='date',date_parser=dateparse)
data = df['x']
data.head()
date
2002-07-31 8.466111
2002-08-31 6.234259
2002-09-30 8.160763
2002-10-31 4.927685
2002-11-30 8.125012
Searching a bit I visit this solution, but couldn't apply it properly to my data.
Thank you in advance for any assistance.
Use pandas.to_datetime and pandas.Series.dt.month:
# Sample data
date x
0 2002-07-31 8.466111
1 2003-07-31 6.234259
2 2002-09-30 8.160763
3 2003-09-30 4.927685
4 2002-11-30 8.125012
df["date"] = pd.to_datetime(df["date"] )
new_df = df.groupby(df["date"].dt.month).sum()
print(new_df)
Output:
x
date
7 14.700370
9 13.088448
11 8.125012
I have 2 columns:
dt_year, dt_month
2014 3
I need a date column.
I tried something like:
pd.to_datetime((df.dt_year + df.dt_month +1).apply(str),format='%Y%m%d')
But I get an error:
ValueError: time data '2014' does not match format '%Y%m%d' (match)
Any ideas?
first, change the column names to something more normal. then add a 'day' column.
df.columns = df.columns.str.replace('dt_', '')
df['day'] = 1
df
year month day
0 2014 3 1
Then the magic happens
pd.to_datetime(df)
0 2014-03-01
dtype: datetime64[ns]
My pandas dataframe has year, month and date in the first 3 columns. To convert them into a datetime type, i use a for loop that loops over each row taking the content in the first 3 columns of each row as inputs to the datetime function. Any way i can avoid the for loop here and get the dates as a datetime?
I'm not sure there's a vectorized hook, but you can use apply, anyhow:
>>> df = pd.DataFrame({"year": [1992, 2003, 2014], "month": [2,3,4], "day": [10,20,30]})
>>> df
day month year
0 10 2 1992
1 20 3 2003
2 30 4 2014
>>> df["Date"] = df.apply(lambda x: pd.datetime(x['year'], x['month'], x['day']), axis=1)
>>> df
day month year Date
0 10 2 1992 1992-02-10 00:00:00
1 20 3 2003 2003-03-20 00:00:00
2 30 4 2014 2014-04-30 00:00:00