I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.
So I start off with with data frame, df:
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
and define some start and end dates:
import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.
So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.
It might look like this:
id start_date end_date weekly_sum
65 2016-02-13 2016-02-20 100
Any direction is greatly appreciated!
P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!
You can use
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
# Return first id and sum of values
return n['id'].values[0],n['value'].sum()
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
'time_stamp': dates,
'value': np.random.randint(2,60,size=(1000,))
})
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
id time_stamp value
0 770 2015-01-01 02:00:00 59
1 781 2015-01-02 02:00:00 32
2 761 2015-01-03 02:00:00 40
3 317 2015-01-04 02:00:00 16
4 538 2015-01-05 02:00:00 20
print(ndf.head(5))
end start id value
0 2016-02-20 2016-02-13 569 221
1 2016-03-01 2016-02-23 28 216
2 2016-03-24 2016-03-17 152 258
3 2016-03-31 2016-03-24 892 265
4 2016-04-02 2016-03-26 606 244
You can calculate a weekly summary with the following code. The code below is based on Monday.
import pandas as pd
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
# _id time_stamp value day_of_week start week
# 0 396 2015-01-01 02:00:00 59 Thursday 0 0
# 1 396 2015-01-02 02:00:00 59 Friday 0 0
# 2 396 2015-01-03 02:00:00 59 Saturday 0 0
# 3 396 2015-01-04 02:00:00 59 Sunday 0 0
# 4 396 2015-01-05 02:00:00 59 Monday 1 1
# 5 396 2015-01-06 02:00:00 59 Tuesday 0 1
# 6 396 2015-01-07 02:00:00 59 Wednesday 0 1
# 7 396 2015-01-08 02:00:00 59 Thursday 0 1
# 8 396 2015-01-09 02:00:00 59 Friday 0 1
# 9 396 2015-01-10 02:00:00 59 Saturday 0 1
# 10 396 2015-01-11 02:00:00 59 Sunday 0 1
# 11 396 2015-01-12 02:00:00 59 Monday 1 2
# 12 396 2015-01-13 02:00:00 59 Tuesday 0 2
# 13 396 2015-01-14 02:00:00 59 Wednesday 0 2
# 14 396 2015-01-15 02:00:00 59 Thursday 0 2
# 15 396 2015-01-16 02:00:00 59 Friday 0 2
# 16 396 2015-01-17 02:00:00 59 Saturday 0 2
# 17 396 2015-01-18 02:00:00 59 Sunday 0 2
# 18 396 2015-01-19 02:00:00 59 Monday 1 3
# 19 396 2015-01-20 02:00:00 59 Tuesday 0 3
aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
# week start_date end_date weekly_sum
# 58 58 2016-02-08 02:00:00 2016-02-14 02:00:00 413
# 59 59 2016-02-15 02:00:00 2016-02-21 02:00:00 413
# 60 60 2016-02-22 02:00:00 2016-02-28 02:00:00 413
Related
I have a dataframe that consists of 3 years of data and two columns remaining useful life and predicted remaining useful life.
I am aggregating rul and pred_rul of 3 years data for each machineID for the maximum date they have. The original dataframe looks like this-
rul pred_diff machineID datetime
10476749 870 312.207825 408 2021-05-25 00:00:00
11452943 68 288.517578 447 2023-03-01 12:00:00
12693829 381 273.159698 493 2021-09-16 16:00:00
3413787 331 291.326416 133 2022-10-26 12:00:00
464093 77 341.506195 19 2023-10-10 16:00:00
... ... ... ... ...
11677555 537 310.586090 456 2022-04-07 00:00:00
2334804 551 289.307129 92 2021-09-04 20:00:00
5508311 35 293.721771 214 2023-01-06 04:00:00
12319704 348 322.199219 479 2021-11-11 20:00:00
4777501 87 278.089417 186 2021-06-29 12:00:00
1287421 rows × 4 columns
And I am aggregating it based on this code-
y_test_grp = y_test.groupby('machineID').agg({'datetime':'max', 'rul':'mean', 'pred_diff':'mean'})[['datetime','rul', 'pred_diff']].reset_index()
which gives the following output-
machineID datetime rul pred_diff
0 1 2023-10-03 20:00:00 286.817681 266.419401
1 2 2023-11-14 00:00:00 225.561953 263.372531
2 3 2023-10-25 00:00:00 304.736237 256.933351
3 4 2023-01-13 12:00:00 204.084899 252.476066
4 5 2023-09-07 00:00:00 208.702431 252.487156
... ... ... ... ...
495 496 2023-10-11 00:00:00 302.445285 298.836798
496 497 2023-08-26 04:00:00 281.601613 263.479885
497 498 2023-11-28 04:00:00 292.593906 263.985034
498 499 2023-06-29 20:00:00 260.887529 263.494844
499 500 2023-11-08 20:00:00 160.223614 257.326034
500 rows × 4 columns
Since this is grouped by on machineID, it is giving just 500 rows which is less. I want to aggregate rul and pred_rul on weekly basis such that for each machineID I get 52weeks*3years=156 rows. I am not able to identify which function to use for taking 7 days as interval and aggregating rul and pred_rul on that.
You can use Grouper:
pd.groupby(['machineID', pd.Grouper(key='datetime', freq='7D')]).mean()
our electricity provider think it could be very fun to make difficult to read csv files they provide.
This is precise electric consumption, every 30 min but in the SAME column you have hours, and date, example :
[EDIT : here the raw version of the csv file, my bad]
;
"Récapitulatif de mes puissances atteintes en W";
;
"Date et heure de relève par le distributeur";"Puissance atteinte (W)"
;
"19/11/2022";
"00:00:00";4494
"23:30:00";1174
"23:00:00";1130
[...]
"01:30:00";216
"01:00:00";2672
"00:30:00";2816
;
"18/11/2022";
"00:00:00";4494
"23:30:00";1174
"23:00:00";1130
[...]
"01:30:00";216
"01:00:00";2672
"00:30:00";2816
How damn can I obtain this kind of lovely formated file :
2022-11-19 00:00:00 2098
2022-11-19 23:30:00 218
2022-11-19 23:00:00 606
etc.
Okay I have an idiotic brutforce solution for you, so dont take that as coding recommondation but just something that gets the job done:
import itertools
dList = [f"{f}/{s}/2022" for f, s in itertools.product(range(1, 32), range(1, 13))]
i assume you have a text file with that so im just gonna use that:
file = 'yourfilename.txt'
#make sure youre running the program in the same directory as the .txt file
with open(file, "r") as f:
global lines
lines = f.readlines()
lines = [word.replace('\n','') for word in lines]
for i in lines:
if i in dList:
curD = i
else:
with open('output.txt', 'w') as g:
g.write(f'{i} {(i.split())[0]} {(i.split())[1]}')
make sure to create a file called output.txt in the same directory and everything will get writen into that file.
Try:
import pandas as pd
current_date = None
all_data = []
with open("your_file.txt", "r") as f_in:
# skip first 5 rows (header)
for _ in range(5):
next(f_in)
for row in map(str.strip, f_in):
row = row.replace('"', "")
if row == "":
continue
if "/" in row:
current_date = row
else:
all_data.append([current_date, *row.split(";")])
df = pd.DataFrame(all_data, columns=["Date", "Time", "Value"])
print(df)
Prints:
Date Time Value
0 19/11/2022; 00:00:00 4494
1 19/11/2022; 23:30:00 1174
2 19/11/2022; 23:00:00 1130
3 19/11/2022; 01:30:00 216
4 19/11/2022; 01:00:00 2672
5 19/11/2022; 00:30:00 2816
6 18/11/2022; 00:00:00 4494
7 18/11/2022; 23:30:00 1174
8 18/11/2022; 23:00:00 1130
9 18/11/2022; 01:30:00 216
10 18/11/2022; 01:00:00 2672
11 18/11/2022; 00:30:00 2816
Using pandas operations would be like the following:
data.csv
19/11/2022
00:00:00 2098
23:30:00 218
23:00:00 606
01:30:00 216
01:00:00 2672
00:30:00 2816
18/11/2022
00:00:00 1994
23:30:00 260
23:00:00 732
01:30:00 200
01:00:00 1378
00:30:00 2520
17/11/2022
00:00:00 1830
23:30:00 96
23:00:00 122
01:30:00 694
01:00:00 2950
00:30:00 3062
16/11/2022
00:00:00 2420
23:30:00 678
23:00:00 644
Implementation
import pandas as pd
df = pd.read_csv('data.csv', header=None)
df['amount'] = df[0].apply(lambda item:item.split(' ')[-1] if item.find(':')>0 else None)
df['time'] = df[0].apply(lambda item:item.split(' ')[0] if item.find(':')>0 else None)
df['date'] = df[0].apply(lambda item:item if item.find('/')>0 else None)
df['date'] = df['date'].fillna(method='ffill')
df = df.dropna(subset=['amount'], how='any')
df = df.drop(0, axis=1)
print(df)
output
amount time date
1 2098 00:00:00 19/11/2022
2 218 23:30:00 19/11/2022
3 606 23:00:00 19/11/2022
4 216 01:30:00 19/11/2022
5 2672 01:00:00 19/11/2022
6 2816 00:30:00 19/11/2022
8 1994 00:00:00 18/11/2022
9 260 23:30:00 18/11/2022
10 732 23:00:00 18/11/2022
11 200 01:30:00 18/11/2022
12 1378 01:00:00 18/11/2022
13 2520 00:30:00 18/11/2022
15 1830 00:00:00 17/11/2022
16 96 23:30:00 17/11/2022
17 122 23:00:00 17/11/2022
18 694 01:30:00 17/11/2022
19 2950 01:00:00 17/11/2022
20 3062 00:30:00 17/11/2022
22 2420 00:00:00 16/11/2022
23 678 23:30:00 16/11/2022
24 644 23:00:00 16/11/2022
I have a timeseries data for a full year for every minute.
timestamp day hour min rainfall_rate
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to combine the timestamps such that I can get the combined rainfall_rate for every month, i.e I want to use group-by to combine them based on the date and also the plot them with the axis as timestamp for further analysis.
How can I perform this using pandas?
I used -
daily_groups = rainfall_df.groupby(rainfall_df.index.date) then
daily_groups.get_group(pd.Timestamp(2010,1,1))['rainfall_rate'].sum() but of course I could not plot them because they are of different shape.
Use pd.Grouper with freq="M":
print (df.groupby(pd.Grouper(freq="M"))["rainfall_rate"].count())
#
timestamp
2010-01-31 5
2010-02-28 0
2010-03-31 0
2010-04-30 0
2010-05-31 0
2010-06-30 0
2010-07-31 0
2010-08-31 0
2010-09-30 0
2010-10-31 0
2010-11-30 0
2010-12-31 0
I have a long dataframe with an index of a timeseries like this:
datetime number
2015-07-06 00:00:00 12
2015-07-06 00:10:00 55
2015-07-06 00:20:00 129
2015-07-06 00:30:00 5
2015-07-06 00:40:00 3017
2015-07-06 00:50:00 150
2015-07-06 01:00:00 347
2015-07-06 01:10:00 8
2015-07-06 01:20:00 19
... ...
I would like to transform/reshape this by splitting the column every n rows into a row in a 'new' table.
For example, an n=3 create:
datetime #0 #1 #2
2015-07-06 00:00:00 12 55 129
2015-07-06 00:30:00 5 3017 150
2015-07-06 01:00:00 347 8 19
... ... ... ...
I can think of doing this with a For-Loop, but I was wondering if there was a more efficient way native to Pandas.
You can use groupby and apply/agg with list:
u = df.groupby(pd.Grouper(key='datetime', freq='30min'))['number'].agg(list)
pd.DataFrame(u.tolist(), index=u.index)
0 1 2
datetime
2015-07-06 00:00:00 12 55 129
2015-07-06 00:30:00 5 3017 150
2015-07-06 01:00:00 347 8 19
Here is one solution
n = 3
new_df = df.groupby(df.index//n).agg({'datetime': 'first', 'number': lambda x: x.tolist()})
new_df.assign(**(new_df.number.apply(pd.Series).add_prefix('#')))
datetime number #0 #1 #2
0 2015-07-06 00:00:00 [12, 55, 129] 12 55 129
1 2015-07-06 00:30:00 [5, 3017, 150] 5 3017 150
2 2015-07-06 01:00:00 [347, 8, 19] 347 8 19
You can drop the number column
Edit: As #coldspeed suggested, you can combine the last two steps.
new_df = df.groupby(df.index//n).agg({'datetime': 'first', 'number': lambda x: x.tolist()})
new_df.assign(**(new_df.pop('number').apply(pd.Series).add_prefix('#')))
datetime #0 #1 #2
0 2015-07-06 00:00:00 12 55 129
1 2015-07-06 00:30:00 5 3017 150
2 2015-07-06 01:00:00 347 8 19
I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})