I'm trying to create a biweekly periods from pandas data frame. For instance like that
import pandas as pd
date_range = pd.date_range("2022-04-01", "2022-04-30", freq="B")
test_data = pd.DataFrame(np.arange(len(date_range)), index=date_range)
I'd like to have a Period index with 2 weeks length. I have assumed that the pandas way to do it is the following
test_data.resample("2W", kind="period").last()
However the labels I'm getting are
0
2022-03-28/2022-04-03 5
2022-04-11/2022-04-17 15
2022-04-25/2022-05-01 20
I'd expect to see something like this
0
2022-03-21/2022-04-03 0
2022-04-04/2022-04-17 10
2022-04-18/2022-05-01 20
Another interesting point is that changing kind="timestamp" changes the values to the values I'd like to see at the end.
0
2022-04-03 0
2022-04-17 10
2022-05-01 20
Is there any native way to get biweekly index from pandas, or better to do it manually?
You can try pandas.Grouper
df = test_data.groupby(pd.Grouper(freq='2W')).last()
print(df)
0
2022-04-03 0
2022-04-17 10
2022-05-01 20
Related
Bear with me as I'm self-learning.
Basically, I have this Raw Data where I got Date and SLT Percent which is a computation plus a state.
What I want is to group them Year-Month as Rows, Count how many Made and Missed are there for each month as columns and compute the mean/average of SLT Percent on the 3rd column.
I've been trying to do a grouper or groupby or unstack and doing mean also on groupby but I always got incorrect data. I can do this easily on excel pivot but I'm having hard time recreating it on Python Dataframe
Raw Data:
ID
SLT Date
SLT Percent
SLT State
1
5/28/2018
1
Made
2
11/13/2018
0
Mised
11
3/6/2019
0
Missed
12
5/20/2019
1
Made
13
10/25/2021
1
Made
14
11/12/2019
1
Made
18
6/4/2020
1
Made
19
6/11/2020
1
Made
20
8/6/2020
1
Made
21
12/9/2021
0
Missed
22
5/16/2022
1
Made
23
3/22/2018
0
Missed
24
3/20/2018
0
Missed
25
5/11/2018
1
Made
26
12/20/2018
0
Missed
27
5/12/2022
1
Made
28
10/7/2021
1
Made
29
3/21/2019
1
Made
30
4/24/2019
0
Missed
Output Table:
Date
Made
Missed
Percent
2020-5
10
2
80%
2020-6
25
15
60%
2020-7
50
23
23%
Below answer is intentionally verbose for understandability.
Change column names & Percent formula (Made vs Missed) accordingly.
df = pd.DataFrame({'date':['2013-01-01','2013-01-05','2014-06-01','2015-11-18','2015-12-21'],
'state':['Made','Missed','Missed','Made','Missed']}
)
df['date'] = pd.to_datetime(df['date']) #Change date column format to datetime
df['Year_Month']= df['date'].dt.year.astype('str')+'_' +df['date'].dt.month.astype('str') # Creat Year_Month colum - for grouping
df = pd.pivot_table(df, index = 'Year_Month', columns='state',values='state' , aggfunc='count').fillna(0).reset_index() # Pivot table
df.rename_axis(None, axis=1, inplace=True) # remove index name
df[['Year','Month']] = df['Year_Month'].str.split('_', expand=True) #create Year & Month columns
df.drop('Year_Month', axis =1, inplace = True) # Drop uneccesary column
df['Percent'] = df['Made']/(df['Made'] + df['Missed']) # Calculate Made, Missed percentage
df[['Year','Month','Made','Missed','Percent']] # Print column in correct format
Output
I have a set of data that has several different columns, with daily data going back several years. The variable is the exact same for each column. I've calculated the daily, monthly, and yearly statistics for each column, and want to do the same, but combining all columns together to get one statistic for each day, month, and year rather than the several different ones I calculated before.
I've been using Pandas group by so far, using something like this:
sum_daily_files = daily_files.groupby(daily_files.Date.dt.day).sum()
sum_monthly_files = daily_files.groupby(daily_files.Date.dt.month).sum()
sum_yearly_files = daily_files.groupby(daily_files.Date.dt.year).sum()
Any suggestions on how I might go about using Pandas - or any other package - to combine the statistics together? Thanks so much!
edit
Here's a snippet of my dataframe:
Date site1 site2 site3 site4 site5 site6
2010-01-01 00:00:00 2 0 1 1 0 1
2010-01-02 00:00:00 7 5 1 3 1 1
2010-01-03 00:00:00 3 3 2 2 2 1
2010-01-04 00:00:00 0 0 0 0 0 0
2010-01-05 00:00:00 0 0 0 0 0 1
I just had to type it in because I was having trouble getting it over, so my apologies. Basically, it's six different sites from 2010 to 2019 that details how much snow (in inches) each site received on each day.
(Your problem need to be clarify)
Is this what you want?
all_sum_daily_files = sum_daily_files.sum(axis=1) # or daily_files.sum(axis=1)
all_sum_monthly_files = sum_monthly_files.sum(axis=1)
all_sum_yearly_files = sum_yearly_files.sum(axis=1)
If your data is daily, why calculate the daily sum, you can use directly daily_files.sum(axis=1).
I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error
SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.
Data looks roughly like this:
Index hotIn(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:00 4 0
2019-03-23T00:00:01 5 0
2019-03-23T00:00:02 4 0
2019-03-23T00:00:03 4 0
2019-03-23T00:00:04 3 0
2019-03-23T00:00:05 4 1
2019-03-23T00:00:06 4 0
2019-03-23T00:00:07 5 0
2019-03-23T00:00:08 3 0
2019-03-23T00:00:09 3 0
2019-03-23T00:00:10 4 0
2019-03-23T00:00:11 4 0
2019-03-23T00:00:12 5 0
2019-03-23T00:00:13 5 1
What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.
For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.
Results would ideally look like this:
Index hotIn sum(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 32 1
I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible.
I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.
IIUC, you can do:
s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
.agg({'Index':'last', 'hotIn(gpm)':'sum'})
.reset_index(drop=True)
)
Output:
Index hotIn(gpm)
0 2019-03-23T00:00:05 24
1 2019-03-23T00:00:13 33
You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.
s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()
(df.reset_index()
.groupby(s, as_index=False)
.agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
.set_index('Index'))
hotIn(gpm) hotOut(pulse=1gal)
Index
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 33 1
I was just curious as to what's going on here. I have 13 dataframes that look something like this:
df1:
time val
00:00 1
00:01 2
00:02 5
00:03 8
df2:
time val
00:04 5
00:05 12
00:06 4
df3:
time val
00:07 8
00:08 24
00:09 3
and so on. As you can see each dataframe continues the time exactly where the other left off, which means ideally I would like them in one dataframe for simplicity sake. Note the example ones I used are significantly smaller then my actual ones. However, upon using the following:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
Where these 13 dataframes are produced through that list comprehension, I get a very strange result. It's as if I have set axis=1 inside the pd.concat() function. If I try to reference a column, say val
df['val']
Pandas returns something that looks like this:
0 1
1 2
...
2 5
3 8
Name: val, Length: 4, dtype: float64
In this output it does not specify what happened to the other 11 val columns. If I then reference an index, as follows:
df['val'][0]
It returns:
0 1
0 5
0 8
Name: val, dtype: float64
which is the first index of each column. I am unsure as to why pandas is behaving like this, as I would imagine it just joins together columns with similar header names, but obviously this isn't the case.
If sometime could explain this that would be great.
I believe your issue is that you are not resetting the index after concatenation, but before selecting the data.
Try:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
df = df.reset_index(Drop=True)
df['val'][0]