Group rows in fixed duration windows that satisfy multiple conditions

Group rows in fixed duration windows that satisfy multiple conditions - python

I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...

Related

Pandas - Find first occurance of number closest to an input value

I have a dataframe like below.
time speed
0 1 0.20
1 2 0.40
2 3 2.00
3 4 3.00
4 5 0.40
5 6 0.43
6 7 6.00
I would like to find the first occurance of a number ( in 'Speed' Column) that is closest to an input value I enter.
For example :
input value = 0.43
Expected Output :
Speed : 0.40 & corresponding Time : 2
The speed column should not be sorted for this problem.
I tried the below,but not getting the expected output.
Any help on this would be appreciated.

absolute closest
You can compute the absolute difference to your reference and get the idxmin:
speed_input = 0.43
df.loc[abs(df['speed']-speed_input).idxmin()]
output:
time 6.00
speed 0.43
Name: 5, dtype: float64
first closest with threshold:
i = 0.43
thresh = 0.03
df.loc[abs(df['speed']-i).le(thresh).idxmax()]
output:
time 2.0
speed 0.4
Name: 1, dtype: float64

One idea is round both values:
df[[(df['speed'].round(1)-round(speed_input, 1)).abs().idxmin()]]

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks

One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Extract values based on timestamps diff conditions in pandas

I have a dataset with timestamps and values for each ID. The number of rows, for each ID, is different and I need a double for loop like this:
for ids in IDs:
for index in Date:
Now, I would like to find the difference between timestamps, for each ID, in these ways:
values between 2 days
values between 7 days
In particular, for each ID
if, from the first value, in the next 2 days there is an increment of at least 0.3 fromt the first value
OR
if, from the first value, in the next 7 days there is a value equals to 1.5*first value
I store that ID in a dataframe, otherwise I store that ID in another dataframe.
Now, my code is the following:
yesDf = pd.DataFrame()
noDf = pd.DataFrame()
for ids in IDs:
for index in Date:
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 2):
if (df.iloc[index]['Val'] - df.iloc[index - 1]['Val'] >= 0.3):
yesDf += IDs['ID']
noDf += IDs['ID']
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 7):
if(df.iloc[Date - 1]['Val'] >= df.iloc[index]['Val'] * 1.5):
yesDf += IDs['ID']
noDf += IDs['ID']
print(yesDf)
print(noDf)
I get these errors:
TypeError: incompatible type for a datetime/timedelta operation [sub]
and
pandas.errors.NullFrequencyError: Cannot shift with no freq
How can I solve this problem?
Thank you
Edit: my dataframe
Val ID Date
2199 0.90 0000.0 2017-12-26 11:00:01
2201 1.35 0001.0 2017-12-26 11:00:01
63540 0.72 0001.0 2018-08-10 11:53:01
68425 0.86 0001.0 2018-10-14 08:33:01
42444 0.99 0002.0 2018-02-01 09:25:53
41474 1.05 0002.0 2018-04-01 08:00:04
42148 1.19 0002.0 2018-07-01 08:50:00
24291 1.01 0004.0 2017-01-01 08:12:02
for exampled: for ID 0001.0 the first value is 1.35, and in the next 2 days I don't have an increment of at least 0.3 from the start value, and in the next 7 days I don't have an increment 1.5times the firsrt value, so it goes in the noDf Dataframe.
Also the dtypes:
Val float64
ID object
Date datetime64[ns]
Surname object
Name object
dtype: object
edit:
after the modified code the results are:
Val ID Date Date_diff_cumsum Val_diff
24719 2.08 0118.0 2017-01-15 08:16:05 1.0 0.36
24847 2.17 0118.0 2017-01-16 07:23:04 1.0 0.45
25233 2.45 0118.0 2017-01-17 08:21:03 2.0 0.73
24749 2.95 0118.0 2017-01-18 09:49:09 3.0 1.23
17042 1.78 0129.0 2018-02-05 22:48:17 0.0 0.35
And it is correct. Now I only need to add the single ID into a dataframe

This answer should work assuming you start from the first value of an ID, so the first timestamp.
First, I added the 'Date_diff_cumsum' column, which stores the difference in days between the first date for the ID and the row's date:
df['Date_diff_cumsum'] = df.groupby('ID').Date.diff().dt.days
df['Date_diff_cumsum'] = df.groupby('ID').Date_diff_cumsum.cumsum().fillna(0)
Then, I add the 'Value_diff' column, which is the difference between the first value for an ID and the row's value:
df['Val_diff'] = df.groupby('ID')['Val'].transform(lambda x:x-x.iloc[0])
Here is what I get after adding the columns for your sample DataFrame:
Val ID Date Date_diff_cumsum Val_diff
0 0.90 0.0 2017-12-26 11:00:01 0.0 0.00
1 1.35 1.0 2017-12-26 11:00:01 0.0 0.00
2 0.72 1.0 2018-08-10 11:53:01 227.0 -0.63
3 0.86 1.0 2018-10-14 08:33:01 291.0 -0.49
4 0.99 2.0 2018-02-01 09:25:53 0.0 0.00
5 1.05 2.0 2018-04-01 08:00:04 58.0 0.06
6 1.19 2.0 2018-07-01 08:50:00 149.0 0.20
7 1.01 4.0 2017-01-01 08:12:02 0.0 0.00
And finally, return the rows which satisfy the conditions in your question:
df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))]
In this case, it will return no rows.
yesDf = df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
noDf = df[~((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
yesDf contains the IDs that satisfy the condition, and noDf the ones that don't
I hope this answers your question !

Implemented the groupby and want to insert by output of groupby in my .csv file

I have around 8781 rows in my dataset. I have grouped the different items according to month and calculated the mean of a particular item of every month. Now, I want to store the result of every month after inserting the new row after every month.
Below is the code that I have worked upon for grouping the item and calculated the mean.
Please, anyone, tell how I can insert a new row after every month and store my groupby result in it.
a = pd.read_csv("data3.csv")
print (a)
df=pd.DataFrame(a,columns=['month','day','BedroomLights..kW.'])
print(df)
groupby_month=df['day'].groupby(df['month'])
print(groupby_month)
c=list(df['day'].groupby(df['month']))
print(c)
d=df['day'].groupby(df['month']).describe()
print (d)
#print(groupby_month.mean())
e=df['BedroomLights..kW.'].groupby(df['month']).mean()
print(e)
A sample of csv file is :
Day Month Year lights Fan temperature windspeed
1 1 2016 0.003 0.12 39 8.95
2 1 2016 0.56 1.23 34 9.54
3 1 2016 1.43 0.32 32 10.32
4 1 2016 0.4 1.43 24 8.32
.................................................
1 12 2016 0.32 0.54 22 7.65
2 12 2016 1.32 0.43 21 6.54
The excepted output I want is adding a new row that is mean of items of every month like:
Month lights ......
1 0.32
1 0.43
...............
mean as a new row
...............
12 0.32
12 0.43
mean .........
The output of the code I have shown is as follows:
month
1 0.006081
2 0.005993
3 0.005536
4 0.005729
5 0.005823
6 0.005587
7 0.006214
8 0.005509
9 0.005935
10 0.005821
11 0.006226
12 0.006056
Name: BedroomLights..kW., dtype: float64

If your indices are named 1mean, 2mean, 3mean, etc., sort_indexes should place them where you want.
e.index = [str(n)+'mean' for n in range(1,13)]
df = df.append(e)
df = df.sort_index()

Pandas rolling sum, variating length

I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N

"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02

The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()

I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64

Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group rows in fixed duration windows that satisfy multiple conditions - python

Related

Pandas - Find first occurance of number closest to an input value

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

Extract values based on timestamps diff conditions in pandas

Implemented the groupby and want to insert by output of groupby in my .csv file

Pandas rolling sum, variating length

Categories

Resources