Group rows in fixed duration windows that satisfy multiple conditions - python

I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...

Related

Pandas - Find first occurance of number closest to an input value

I have a dataframe like below.
time speed
0 1 0.20
1 2 0.40
2 3 2.00
3 4 3.00
4 5 0.40
5 6 0.43
6 7 6.00
I would like to find the first occurance of a number ( in 'Speed' Column) that is closest to an input value I enter.
For example :
input value = 0.43
Expected Output :
Speed : 0.40 & corresponding Time : 2
The speed column should not be sorted for this problem.
I tried the below,but not getting the expected output.
Any help on this would be appreciated.
absolute closest
You can compute the absolute difference to your reference and get the idxmin:
speed_input = 0.43
df.loc[abs(df['speed']-speed_input).idxmin()]
output:
time 6.00
speed 0.43
Name: 5, dtype: float64
first closest with threshold:
i = 0.43
thresh = 0.03
df.loc[abs(df['speed']-i).le(thresh).idxmax()]
output:
time 2.0
speed 0.4
Name: 1, dtype: float64
One idea is round both values:
df[[(df['speed'].round(1)-round(speed_input, 1)).abs().idxmin()]]

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Extract values based on timestamps diff conditions in pandas

I have a dataset with timestamps and values for each ID. The number of rows, for each ID, is different and I need a double for loop like this:
for ids in IDs:
for index in Date:
Now, I would like to find the difference between timestamps, for each ID, in these ways:
values between 2 days
values between 7 days
In particular, for each ID
if, from the first value, in the next 2 days there is an increment of at least 0.3 fromt the first value
OR
if, from the first value, in the next 7 days there is a value equals to 1.5*first value
I store that ID in a dataframe, otherwise I store that ID in another dataframe.
Now, my code is the following:
yesDf = pd.DataFrame()
noDf = pd.DataFrame()
for ids in IDs:
for index in Date:
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 2):
if (df.iloc[index]['Val'] - df.iloc[index - 1]['Val'] >= 0.3):
yesDf += IDs['ID']
noDf += IDs['ID']
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 7):
if(df.iloc[Date - 1]['Val'] >= df.iloc[index]['Val'] * 1.5):
yesDf += IDs['ID']
noDf += IDs['ID']
print(yesDf)
print(noDf)
I get these errors:
TypeError: incompatible type for a datetime/timedelta operation [sub]
and
pandas.errors.NullFrequencyError: Cannot shift with no freq
How can I solve this problem?
Thank you
Edit: my dataframe
Val ID Date
2199 0.90 0000.0 2017-12-26 11:00:01
2201 1.35 0001.0 2017-12-26 11:00:01
63540 0.72 0001.0 2018-08-10 11:53:01
68425 0.86 0001.0 2018-10-14 08:33:01
42444 0.99 0002.0 2018-02-01 09:25:53
41474 1.05 0002.0 2018-04-01 08:00:04
42148 1.19 0002.0 2018-07-01 08:50:00
24291 1.01 0004.0 2017-01-01 08:12:02
for exampled: for ID 0001.0 the first value is 1.35, and in the next 2 days I don't have an increment of at least 0.3 from the start value, and in the next 7 days I don't have an increment 1.5times the firsrt value, so it goes in the noDf Dataframe.
Also the dtypes:
Val float64
ID object
Date datetime64[ns]
Surname object
Name object
dtype: object
edit:
after the modified code the results are:
Val ID Date Date_diff_cumsum Val_diff
24719 2.08 0118.0 2017-01-15 08:16:05 1.0 0.36
24847 2.17 0118.0 2017-01-16 07:23:04 1.0 0.45
25233 2.45 0118.0 2017-01-17 08:21:03 2.0 0.73
24749 2.95 0118.0 2017-01-18 09:49:09 3.0 1.23
17042 1.78 0129.0 2018-02-05 22:48:17 0.0 0.35
And it is correct. Now I only need to add the single ID into a dataframe
This answer should work assuming you start from the first value of an ID, so the first timestamp.
First, I added the 'Date_diff_cumsum' column, which stores the difference in days between the first date for the ID and the row's date:
df['Date_diff_cumsum'] = df.groupby('ID').Date.diff().dt.days
df['Date_diff_cumsum'] = df.groupby('ID').Date_diff_cumsum.cumsum().fillna(0)
Then, I add the 'Value_diff' column, which is the difference between the first value for an ID and the row's value:
df['Val_diff'] = df.groupby('ID')['Val'].transform(lambda x:x-x.iloc[0])
Here is what I get after adding the columns for your sample DataFrame:
Val ID Date Date_diff_cumsum Val_diff
0 0.90 0.0 2017-12-26 11:00:01 0.0 0.00
1 1.35 1.0 2017-12-26 11:00:01 0.0 0.00
2 0.72 1.0 2018-08-10 11:53:01 227.0 -0.63
3 0.86 1.0 2018-10-14 08:33:01 291.0 -0.49
4 0.99 2.0 2018-02-01 09:25:53 0.0 0.00
5 1.05 2.0 2018-04-01 08:00:04 58.0 0.06
6 1.19 2.0 2018-07-01 08:50:00 149.0 0.20
7 1.01 4.0 2017-01-01 08:12:02 0.0 0.00
And finally, return the rows which satisfy the conditions in your question:
df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))]
In this case, it will return no rows.
yesDf = df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
noDf = df[~((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
yesDf contains the IDs that satisfy the condition, and noDf the ones that don't
I hope this answers your question !

Implemented the groupby and want to insert by output of groupby in my .csv file

I have around 8781 rows in my dataset. I have grouped the different items according to month and calculated the mean of a particular item of every month. Now, I want to store the result of every month after inserting the new row after every month.
Below is the code that I have worked upon for grouping the item and calculated the mean.
Please, anyone, tell how I can insert a new row after every month and store my groupby result in it.
a = pd.read_csv("data3.csv")
print (a)
df=pd.DataFrame(a,columns=['month','day','BedroomLights..kW.'])
print(df)
groupby_month=df['day'].groupby(df['month'])
print(groupby_month)
c=list(df['day'].groupby(df['month']))
print(c)
d=df['day'].groupby(df['month']).describe()
print (d)
#print(groupby_month.mean())
e=df['BedroomLights..kW.'].groupby(df['month']).mean()
print(e)
A sample of csv file is :
Day Month Year lights Fan temperature windspeed
1 1 2016 0.003 0.12 39 8.95
2 1 2016 0.56 1.23 34 9.54
3 1 2016 1.43 0.32 32 10.32
4 1 2016 0.4 1.43 24 8.32
.................................................
1 12 2016 0.32 0.54 22 7.65
2 12 2016 1.32 0.43 21 6.54
The excepted output I want is adding a new row that is mean of items of every month like:
Month lights ......
1 0.32
1 0.43
...............
mean as a new row
...............
12 0.32
12 0.43
mean .........
The output of the code I have shown is as follows:
month
1 0.006081
2 0.005993
3 0.005536
4 0.005729
5 0.005823
6 0.005587
7 0.006214
8 0.005509
9 0.005935
10 0.005821
11 0.006226
12 0.006056
Name: BedroomLights..kW., dtype: float64
If your indices are named 1mean, 2mean, 3mean, etc., sort_indexes should place them where you want.
e.index = [str(n)+'mean' for n in range(1,13)]
df = df.append(e)
df = df.sort_index()

Pandas rolling sum, variating length

I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N
"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02
The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()
I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64
Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.

Categories