How to vectorize code with nested if and loops in Python? - python

I have a dataframe like given below
df = pd.DataFrame({
'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''
In this operation, I am performing an operation as shown below in code. This code works fine and produces expected output but I can't use this approach for a real dataset as it has more than million records.
t1 = df['PEEP']
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]):
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
However, I can't apply this to real data as it has more than million records. I am learning Python and can you help me understand how to vectorize my code in Python?
You can refer this post related post to understand the logic. As I have got the logic right, I have created this post mainly to seek help in vectorizing and fastening my code
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
Is there any efficient and elegant way to fasten my code operation for a million records dataset

Not sure what's the story behind this, but you can certainly vectorize three if independently and combine them together,
con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))
df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')
Edit (Groupby SubjectID)
con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})

Does this work?
df.groupby('subject_id')\
.rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)
Output:
subject_id
1 0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 False
10 True
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 True
19 False
2 20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 False
29 False
30 False
31 False
32 True
33 False
34 True
35 False
36 False
37 True
38 False
39 False
Name: PEEP, dtype: bool
Details:
Use groupby to break the data up using 'subject_id'
Apply rolling with a n=3 or a window size three.
Look at that last value in that window using -1 indexing and subtact
the maximum of the first two values in that window using index
slicing.

Related

Mark True from conditions satisfy on two consecutive values till another two consecutive values

I have a float column in a dataframe. And I want to add another boolean column which will be True if condition satisfies on two consecutive values till another condition satisfies on next two consecutive values.
For Example I have a data-frame which look like this:
index
Values %
0
0
1
5
2
11
3
9
4
14
5
18
6
30
7
54
8
73
9
100
10
100
11
100
12
100
13
100
Now I want to mark True from where two consecutive values satisfies the condition df['Values %'] >= 10 till next two consecutive values satisfies the next condition i.e. df[Values %] == 100.
So the final result will look like something this:
index
Values %
Flag
0
0
False
1
5
False
2
11
False
3
9
False
4
14
False
5
18
True
6
30
True
7
54
True
8
73
True
9
100
True
10
100
True
11
100
False
12
100
False
13
100
False
Not sure how exactly the second part of your question is supposed to work but here is how to achieve the first.
example data
s = pd.Series([0,5,11,9,14,18,2,14,16,18])
solution
# create true/false series for first condition and take cumulative sum
x = (s >= 10).cumsum()
# compare each element of x with 2 elements before. There will be a difference of 2 for elements which belong to streak of 2 or more True
condition = x - x.shift(2) == 2
condition looks like this
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
I have a rather inefficient way of doing this. It's not vectorised, so not ideal, but it works:
# Convert the values column to a 1D NumPy array for ease of use.
values = df["Values %"].tolist()
values_np = np.array(values)
# Initialize flags 1D array to be the same size as values_np. Initially set to all 0s. Uses int form of booleans, i.e. 0 = False and 1 = True.
flags = np.zeros((values_np.shape[0]), dtype=int)
# Iterate from 1st (not 0th) row to last row.
for i in range(1, values_np.shape[0]):
# First set flag to 1 (True) if meets the condition that consecutive values are both >= 10.
if values_np[i] >= 10 and values_np[i-1] >= 10:
flags[i] = 1
# Then if consecutive values are both larger than 100, set flag to 0 (False).
if values_np[i] >= 100 and values_np[i-1] >= 100:
flags[i] = 0
# Turn flags into boolean form (i.e. convert 0 and 1 to False and True).
flags = flags.astype(bool)
# Add flags as a new column in df.
df["Flags"] = flags
One thing -- my method gives False for row 10, because both row 9 and row 10 >= 100. If this is not what you wanted, let me know and I can change it so that the flag is True only if the previous two values and the current value (3 consecutive values) are all >= 100.

Is there a way to see how many cells / rows were selected when using pandas loc?

I am running a script that needs to highlight specific cells in a dataframe if the conditions below are met, so that I can export it to an excel spreadsheet, but I want to know how many of the cells were indeed selected and highlighted using loc in the code below. Is there a way to do this, or would a for loop be the only way to count how many were selected? If a for loop is necessary, how would I write one to parse through all rows of the dataframe, and still update the highlighting and count at the same time?
a = ['P1','P2']
b = ['P3','P4']
p1 = ((x['priority'].str.contains('P1')) | (x['priority'].str.contains('P2'))) & (x['days_in_status'] > 7)
p2 = ((x['priority'].str.contains('P3')) | (x['priority'].str.contains('P4'))) & (x['days_in_status'] > 14)
p3 = (x['days_since_creation'] > 60)
df.loc[p1,['priority','days_in_status']] = 'background-color:red'
df.loc[p2,['priority','days_in_status']] = 'background-color:orange'
df.loc[p3,'days_since_creation'] = 'background-color:yellow'
df['key'] = 'color:blue;text-decoration:underline'```
p1,p2,p3 are series containing booleans, meaning True or False.
Since True can also be intrepreted as a 1 you can do the following:
(p1+p2+p3).sum()
Here a short example:
import pandas as pd
df = pd.DataFrame({'test':range(1,21)})
p1 = df['test'] > 15
p2 = df['test'] < 5
p3 = df['test'] == 7
(p1+p2+p3)
# Out[181]:
# 0 True
# 1 True
# 2 True
# 3 True
# 4 False
# 5 False
# 6 True
# 7 False
# 8 False
# 9 False
# 10 False
# 11 False
# 12 False
# 13 False
# 14 False
# 15 True
# 16 True
# 17 True
# 18 True
# 19 True
# Name: test, dtype: bool
(p1+p2+p3).sum()
# Out[182]: 10
Alternatively you can also just compare the dataframe before you made any changed witht he dataframe after you made changes, see this example:
import pandas as pd
df = pd.DataFrame({'test':list(range(1,10))+[1,2,3]})
df1 = pd.DataFrame({'test':list(range(1,10))+[3,2,1]})
df.compare(df1)
# self other
# 9 1.0 3.0
# 11 3.0 1.0

Looking for on/off signals or value-pairs across adjacent rows in pandas dataframe

I have a df containing rows (sometimes thousands) of data, corresponding to a digital signal. I have added an extra column using:
df['On/Off'] = np.where(df[col] > value, 'On', 'Off')
to label the signal as being on or off (value is set depending on the signal source). The following code gives an example dataframe albeit without actual measurement data:
df = pd.DataFrame({"Time/s" : np.arange(0,100,2),
"On/Off" : ("Off")})
df.at[10:13,"On/Off"] = "On"
df.at[40:43,"On/Off"] = "On"
df.at[47:,"On/Off"] = "On"
I want to count how many times the signal registers as being on. For the above code, the result would be 2 (ideally with an index returned).
Given how the dataframe is organised, I think going down the rows and looking for pairs of rows where column on/off reads as 'off' at row n, then 'on' at row_n+1 should be the approach, as in:
i =0 # <--- number of on/off pairings
if cycle = [row_n]='On'; [row_n+1]='Off':
i=+1
My current plan came from an answer for this (Pandas iterate over DataFrame row pairs)
I think df.shift() offers a potential route, generating 2 dataframes, and then comparing rows for mismatches, but it feels there could be a simpler way, possibly using itertools, or pd.iterrows (etc.).
As usual, any help is greatly appreciated.
Use shift with eq (==) for compare values, chain both boolean mask and last count Trues by sum:
out = (df['On/Off'].shift(-1).eq('Off') & df['On/Off'].eq('On')).sum()
Another solution:
out = (df['On/Off'].shift().eq('On') & df['On/Off'].eq('Off')).sum()
print (out )
2
Detail:
print ((df['On/Off'].shift().eq('On') & df['On/Off'].eq('Off')))
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
32 False
33 False
34 False
35 False
36 False
37 False
38 False
39 False
40 False
41 False
42 False
43 False
44 True
45 False
46 False
47 False
48 False
49 False
Name: On/Off, dtype: bool

How to count trues and falses in a two column data frame?

Here is my code:
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "unix_timestamp_of_request_utc"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
print(lines)
dts = pd.to_datetime(lines, unit='s')
hours = dts.dt.hour
print(hours)
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
data = pd.DataFrame({'houroftheday' : hours.values, 'successpizza' : lines})
print(data)
****Which gives me:****
houroftheday successpizza
23 18 true
67 2 true
105 14 true
166 23 true
258 20 true
297 1 true
340 2 true
385 22 true
...
304646 21 false
304686 12 false
304746 1 false
304783 3 false
304840 20 false
304907 17 false
304948 1 false
305023 4 false
How can I sum the hours that only correspond to the trues?
First filter all rows by Trues in column successpizza and then sum column houroftheday:
sum_hour = data.loc[data['successpizza'] == 'true', 'houroftheday'].sum()
print (sum_hour)
102
If want size is necessary only count Trues, if use sum, Trues are processes like 1:
len_hour = (data['successpizza'] == 'true').sum()
print (len_hour)
8
Or if need length of each houroftheday:
mask = (data['successpizza'] == 'true').astype(int)
out = mask.groupby(data['houroftheday']).sum()
print (out)
houroftheday
1 1
2 2
3 0
12 0
14 1
18 1
20 1
21 0
22 1
23 1
Name: successpizza, dtype: int32
Solution for remove traling whitespaces is str.strip:
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1].str.strip()
I think you want a count of the occurrences of each hour where successpizza is true. If so you will want to slice the data frame using successpizza, then groupby the houroftheday column and aggregate using a count.
It also looks like you are reading in the true/false values from a file, so they are strings. You will need to convert them first.
data.successpizza = data.successpizza.apply(lambda x: x=='true')
data[data.successpizza].groupby('houroftheday').count()

How to iterate through a dictionary pandas where values are tuples, and find first True and False values

I have a dictionary called weeks_adopted where when I run iteritems() and print the value, I get (example of the values for 3 keys, each key is called app_id). The weeks_adopted dict consists of key value pairs where the key is of type <type 'str'>and the value is a <class 'pandas.core.series.Series'> where dtype is bool. Here is one example of one value, where the indices are basically the week referred to (weeks 0-13 of the year in order):
Name: app_id_str, dtype: bool
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 True
13 False
Name: app_id_str, dtype: bool
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
Name: app_id_str, dtype: bool
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 True
12 True
13 True
What I want to do is calculate the number of rows from the first True value right through to the first False value, for each key, obviously accounting for each cases for example in the 3rd tuple you see the first True after the first False. Basically this is to do with drop out rates - when does a user first see something (True) and then give it up (False).
In the example of the tuples above, the result should be 1, 1 and 3 in terms of adoption rate.
Here is my current basic method:
for key,value in weeks_adopted.iteritems():
start= value.index(True)
end = value.index(False)
adoption=end-start
weeks_adopted[key] = adoption
However I get this error even with this method:
TypeError Traceback (most recent call last)
<ipython-input-32-608c4f533e54> in <module>()
19 for key,value in weeks_adopted.iteritems():
20 print value
---> 21 start= value.index(True)
22 end = value.index(False)
23 adoption=end-start
TypeError: 'Int64Index' object is not callable
In the answer, please could you help me in what other checks I need to be doing to find the first True and first Last value? I am presuming this type of loop is a common one for many situations?
you can try this:
def calc_adoption(ts):
true_index = ts[ts].index
if len(true_index) == 0:
return 0
first_true_index = true_index[0]
false_index = ts.index.difference(true_index)
false_index = false_index[false_index > first_true_index]
if len(false_index) == 0:
return 14 - first_true_index
return false_index[0] - first_true_index
adopted_weeks = {k: calc_adoption(v) for k, v in weeks_adopted.iteritems()}

Categories