Calculating the number of consecutive periods that match a condition - python

Given the data in the Date and Close columns, I'd like to calculate the values in the ConsecPeriodsUp column. This column gives the number of consecutive two-week periods that the Close value has increased.
Date Close UpThisPeriod ConsecPeriodsUp
23/12/2015 3 1 1
16/12/2015 2 0 0
09/12/2015 1 0 0
02/12/2015 3 1 1
25/11/2015 2 0 0
18/11/2015 1 0 0
11/11/2015 7 1 3
04/11/2015 6 1 3
28/10/2015 5 1 2
21/10/2015 4 1 2
14/10/2015 3 1 1
07/10/2015 2 NaN NaN
30/09/2015 1 NaN NaN
I've written the following code to give the UpThisPeriod column but I can't see how I would aggregate that to get the ConsecPeriodsUp column, or whether there is way to do it in a single calculation that I'm missing.
import pandas as pd
def up_over_period(s):
return s[0] >= s[-1]
df = pd.read_csv("test_data.csv")
period = 3 # one more than the number of weeks
df['UpThisPeriod'] = pd.rolling_apply(
df['Close'],
window=period,
func=up_over_period,
).shift(-period + 1)

This can be done by adapting the groupby, shift and cumsum trick described in the Pandas Cookbook, Grouping like Python’s itertools.groupby. The main change is in dividing by the length of the period - 1 and then using the ceil function to round up to the next integer.
from math import ceil
...
s = df['UpThisPeriod'][::-1]
df['ConsecPeriodsUp'] = (s.groupby((s != s.shift()).cumsum()).cumsum() / (period - 1)).apply(ceil)

Related

Having trouble with calculation on Dataframe

I am having trouble creating a couple new calculated columns to my Dataframe. Here is what I'm looking for:
Original DF:
Col_IN Col_OUT
5 2
1 2
2 2
3 0
3 1
What I want to add is two columns. One is 'running end of day total' that takes in the net of the current day plus total of day before. Second column I want 'Available Units' - which factors in the previous day end total plus incoming units. Desired result:
Desired DF:
Col_IN Available_Units Col_OUT End_Total
5 5 2 3
1 4 2 2
2 4 2 2
3 5 0 5
3 8 1 7
It's a weird one - anybody have an idea? Thanks.
For the End_Total you can use np.cumsum and for Available Units you can use shift.
df = pd.DataFrame({
'Col_IN': [5,1,2,3,3],
'Col_OUT': [2,2,2,0,1]
})
df['End_Total'] = np.cumsum(df['Col_IN'] - df['Col_OUT'])
df['Available_Units'] = df['End_Total'].shift().fillna(0) + df['Col_IN']
print(df)
will output
Col_IN Col_OUT End_Total Available_Units
0 5 2 3 5.0
1 1 2 2 4.0
2 2 2 2 4.0
3 3 0 5 5.0
4 3 1 7 8.0
Running totals are also known as cumulative sums, for which pandas has the cumsum() function.
The end totals can be calculated through the cumulative sum of incoming minus the cumulative sum of outgoing:
df["End_Total"] = df["Col_IN"].cumsum() - df["Col_OUT"].cumsum()
The available units can be calculated in the same way, if you shift the outgoing column one down:
df["Available_Units"] = df["Col_IN"].cumsum() - df["Col_OUT"].shift(1).fillna(0).cumsum()

pandas: How to remove "duplicate" rows whose C1 column is within a tolerance, and whose C2 column is maximal?

following my previous question:
I have a dataframe:
load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
1,1576147339.641,124219
9,1576147339.662,124219
8,1576147339.663,124219
7,1576147339.663,124219
6,1576147339.663,124219
5,1576147339.663,124219
4,1576147339.663,124219
3,1576147339.663,124219
2,1576147339.663,124219
1,1576147339.663,124219
0,1576147339.663,124219
0,1576147339.673,124219
3,1576147341.567,124221
2,1576147341.568,124221
1,1576147341.569,124221
0,1576147341.57,124221
4,1576147341.581,124221
3,1576147341.581,124221
I want to remove all rows that are within some tolerance from one another, in the 'timestamp' column except the one that has the largest 'load' column.
In the above example, if tolerance=0.01, that would leave us with
load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
9,1576147339.662,124219
0,1576147339.673,124219
3,1576147341.567,124221
4,1576147341.581,124221
The maximal value of 'load' doesn't have to be the 1st one!
Idea is round values by values >1 created by multiple by tolerance divided by 1 and pass to groupby for aggregate max:
tolerance=0.01
df = df.groupby(df['timestamp'].mul(1/tolerance).round()).max().reset_index(drop=True)
print (df)
load timestamp timestr
0 0 1.576147e+09 124219
1 0 1.576147e+09 124219
2 2 1.576147e+09 124219
3 9 1.576147e+09 124219
4 0 1.576147e+09 124219
5 3 1.576147e+09 124221
6 4 1.576147e+09 124221
Rounding is susceptible to such a problem that there can be
2 rows with fractional parts e.g. 0.494 and 0.502.
The first will be rounded to 0.49 and the second to 0.50,
so they will be in different groups, despite the fact that
they are less than 0.01 apart.
So my proposition is to do the job (compute result DataFrame)
by iteration:
result = pd.DataFrame(columns=df.columns)
wrk = df.sort_values('timestamp')
threshold = 0.01
while wrk.index.size > 0:
tMin = wrk.iloc[0, 1] # min timestamp
grp = wrk[wrk.timestamp <= tMin + threshold]
row = grp.nlargest(1, 'load') # max load
result = result.append(row)
wrk.drop(grp.index, inplace=True)
To confirm my initial remark, change the fractional part of timestamp
in the first row to 0.494.
For readability, I also "shortened" the integer part.
My solution returns:
load timestamp timestr
0 0 7339.494 124219
2 2 7339.637 124219
4 9 7339.662 124219
14 0 7339.673 124219
15 3 7341.567 124221
19 4 7341.581 124221
whereas the other solution returns:
load timestamp timestr
0 0 7339.494 124219
1 0 7339.502 124219
2 2 7339.641 124219
3 9 7339.663 124219
4 0 7339.673 124219
5 3 7341.570 124221
6 4 7341.581 124221

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

Using a function to calculate the frequency of columns in a dataframe (pandas)

For the following data set:
Index ADR EF INF SS
1 1 1 0 0
2 1 0 1 1
3 0 1 0 0
4 0 0 1 1
5 1 0 1 1
I am going to calculate the frequency for each column. This is my code:
df.ADR.value_counts()
df.EF.value_counts()
df.INF.value_counts()
df.SS.value_counts()
How I can do it by writing a function, rather than repeating the code for each column? I tried this:
def frequency (df, *arg):
count =df.arg.value_counts()
return (count)
But it does not work.
Assuming you want to calculate the frequency of all columns, rather than selectively, I don't recommend a custom function.
Try using df.apply, passing pd.value_counts:
In [1048]: df.apply(pd.value_counts, axis=0)
Out[1048]:
ADR EF INF SS
0 2 3 2 2
1 3 2 3 3
If you want to calculate selectively, you may pass a list of columns to a function:
def foo(df, columns):
return df[columns].apply(pd.value_counts, axis=0)
print(foo(df, ['ADR', 'EF']))
If you only have value 0 and 1
Freq=pd.concat([(df==0).sum(),(df==1).sum()],axis=1)
Out[62]:
0 1
Index 0 1
ADR 2 3
EF 3 2
INF 2 3
SS 2 3
This will do the job:
def frequency(df,col_name):
count=df[col_name].value_counts()
return count
In the above function, you should enter the column name as a string. For example:
frequency(df,'ADR')
If you want to find the counts of all the columns, then it is better to use df.apply as suggested in #cᴏʟᴅsᴘᴇᴇᴅ's answer.

Boolean Check in a Pandas DataFrame based on Criteria at different Index values

I would like to calculate the number of instances two criteria are fulfilled in a Pandas DataFrame at a different index value. A snipped of the DataFrame is:
GDP USRECQ
DATE
1947-01-01 NaN 0
1947-04-01 NaN 0
1947-07-01 NaN 0
1947-10-01 NaN 0
1948-01-01 0.095023 0
1948-04-01 0.107998 0
1948-07-01 0.117553 0
1948-10-01 0.078371 0
1949-01-01 0.034560 1
1949-04-01 -0.004397 1
I would like to count the number of observation for which USRECQ[DATE+1]==1 and GDP[DATE]>a if GDP[DATE]!='NAN'.
By referring to DATE+1 and DATE I mean that the value of USRECQ should be check at the subsequent date for which the value of GDP is examined. Unfortunately, I do not know how to address the deal with the different time indices in my selection. Can someone kindly advise me on how to count the number of instances properly?
One may of achieving this is to create a new column to show what the next value of 'USRECQ' is:
>>> df['USRECQ NEXT'] = df['USRECQ'].shift(-1)
>>> df
DATE GDP USRECQ USRECQ NEXT
0 1947-01-01 NaN 0 0
1 1947-04-01 NaN 0 0
2 1947-07-01 NaN 0 0
3 1947-10-01 NaN 0 0
4 1948-01-01 0.095023 0 0
5 1948-04-01 0.107998 0 0
6 1948-07-01 0.117553 0 0
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
9 1949-04-01 -0.004397 1 NaN
Then you could filter your DataFrame according to your requirements as follows:
>>> a = 0.01
>>> df[(df['USRECQ NEXT'] == 1) & (df['GDP'] > a) & (pd.notnull(df['GDP']))]
DATE GDP USRECQ USRECQ NEXT
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
To count the number of rows in a DataFrame, you can just use the built-in function len.
I think the DataFrame.shift method is the key to what you seek in terms of looking at the next index.
And Numpy's logical expressions can come in really handy for these sorts of things.
So if df is your dataframe then I think what you're looking for is something like
count = df[np.logical_and(df.shift(-1)['USRECQ'] == 1,df.GDP > -0.1)]
The example I used to test this is on github.

Categories