I have a pandas DataFrame with a date column. It is not an index.
I want to make a pivot_table on the dataframe using counting aggregate per month for each location.
The data look like this:
['INDEX'] DATE LOCATION COUNT
0 2009-01-02 00:00:00 AAH 1
1 2009-01-03 00:00:00 ABH 1
2 2009-01-03 00:00:00 AAH 1
3 2009-01-03 00:00:00 ABH 1
4 2009-01-04 00:00:00 ACH 1
I used:
pivot_table(cdiff, values='COUNT', rows=['DATE','LOCATION'], aggfunc=np.sum)
to pivot the values. I need a way to convert cdiff.DATE to a month rather than a date.
I hope to end up with something like:
The data look like this:
MONTH LOCATION COUNT
January AAH 2
January ABH 2
January ACH 1
I tried all manner of strftime methods on cdiff.DATE with no success. It wants to apply the to strings, not series object.
I would suggest:
months = cdiff.DATE.map(lambda x: x.month)
pivot_table(cdiff, values='COUNT', rows=[months, 'LOCATION'],
aggfunc=np.sum)
To get a month name, pass a different function or use the built-in calendar.month_name. To get the data in the format you want, you should call reset_index on the result, or you could also do:
cdiff.groupby([months, 'LOCATION'], as_index=False).sum()
Related
I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.
I want to aggregate daily data to weekly (7-day sum) but with the last date as the 'origin'. Is it possible to do a group by from the end date using pd.Grouper? This is how the data looks like:
This code:
df.groupby(pd.Grouper(key='date', freq='7d'))['value'].sum()
results to
2020-01-01 5
2020-01-08 12
2020-01-15 4
but I was hoping for this:
2020-01-01 0
2020-01-03 7
2020-01-10 14
the method you have used can be shortened using resample method of pandas on df
but i think you problem is the order your dates are;
the result you expect is more day wise output;
hence what i will recommend is splitting the df and then again merging them
df.set_index(['date'],inplace=True)
df_below = df[3:].resample('W').sum()
df_up = df.iloc[0:3,:].sum()
# or you can give dates instead of 0:3 in iloc
the rows [0,1,2] you can take sum of those n then using hstack or concat or merge again make them one DataFrame
feel free for asking further queries....
I have a dataframe that contains a column with dates e.g. 24/07/15 etc
Is there a way to create a new column into the dataframe that displays all the days of the week corresponding to the already existing 'Date' column?
I want the output to appear as:
[Date][DayOfTheWeek]
This might work:
If you want day name:
In [1405]: df
Out[1405]:
dates
0 24/07/15
1 25/07/15
2 26/07/15
In [1406]: df['dates'] = pd.to_datetime(df['dates']) # You don't need to specify the format also.
In [1408]: df['dow'] = df['dates'].dt.day_name()
In [1409]: df
Out[1409]:
dates dow
0 2015-07-24 Friday
1 2015-07-25 Saturday
2 2015-07-26 Sunday
If you want day number:
In [1410]: df['dow'] = df['dates'].dt.day
In [1411]: df
Out[1411]:
dates dow
0 2015-07-24 24
1 2015-07-25 25
2 2015-07-26 26
I would try the apply function, so something like this:
def extractDayOfWeek(dateString):
...
df['DayOfWeek'] = df.apply(lambda x: extractDayOfWeek(x['Date'], axis=1)
The idea is that, you map over every row, extract the 'date' column, and then apply your own function to create a new row entry named 'Day'
Depending of the type of you column Date.
df['Date']=pd.to_datetime(df['Date'], format="d/%m/%y")
df['weekday'] = df['Date'].dt.dayofweek
I have a dataframe (df) with a column 'Date of birth' column:
Date of birth
0 1957-04-30 00:00:00
1 1966-11-10 00:00:00
2 1966-11-10 00:00:00
3 1962-03-28 00:00:00
4 1958-10-28 00:00:00
5 1958-06-04 00:00:00
How can I reformat the column to a date only format? After I reformat I'm going to work out age from a specific date:
Date of birth
0 1957-04-30
1 1966-11-10
2 1966-11-10
3 1962-03-28
4 1958-10-28
5 1958-06-04
I have tried using
df["Date of birth"] = pd.to_datetime(df['Date of birth'], format='%d%b%Y')
df["Date of birth"] = df["Date of birth"].dt.strftime('%m/%d/%Y')
but with no joy.
After the column becomes a date, use date accessor to access it.
df["Date of birth"] = pd.to_datetime(df['Date of birth']).dt.date
here is a question about the data from pandas. What I am looking is to fetch two column from a csv file, and manipulate these data before finally saving them.
The csv file looks like :
year month
2007 1
2007 2
2007 3
2007 4
2008 1
2008 3
this is my current code:
records = pd.read_csv(path)
frame = pd.DataFrame(records)
combined = datetime(frame['year'].astype(int), frame['month'].astype(int), 1)
The error is :
TypeError: cannot convert the series to "<type 'int'>"
any thoughts?
datetime won't operate on a pandas Series (column of a dataframe). You can use to_datetime or you could use datetime within apply. Something like the following should work:
In [9]: df
Out[9]:
year month
0 2007 1
1 2007 2
2 2007 3
3 2007 4
4 2008 1
5 2008 3
In [10]: pd.to_datetime(df['year'].astype(str) + '-'
+ df['month'].astype(str)
+ '-1')
Out[10]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Or use apply:
In [11]: df.apply(lambda x: datetime(x['year'],x['month'],1),axis=1)
Out[11]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Another Edit: You can also do most of the date parsing with read_csv but then you need to adjust the day after you read it in (note, my data is in a string named 'data'):
In [12]: df = pd.read_csv(StringIO(data),header=True,
parse_dates={'date':['year','month']})
In [13]: df['date'] = df['date'].values.astype('datetime64[M]')
In [14]: df
Out[14]:
date
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
Had similar issue the answer is assuming that you have the Year, Month and Day in columns of your DataFrame:
df['Date'] = df[['Year', 'Month', 'Day']].apply(lambda s : datetime.datetime(*s),axis = 1)
first part selects the columns with the Year, Month and Date form the Dateframe, second bit applies the datetime function element-wise on the data.
if you do not gave the day in your data asit looks like form your data, just do:
df['Day'] = 1
to place the day there as well. should be way to do that in code, but will be quick workaround. Can always drop the Day column afterward if you dont want it.