How to vectorize code with nested if and loops in Python? - python
I have a dataframe like given below
df = pd.DataFrame({
'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''
In this operation, I am performing an operation as shown below in code. This code works fine and produces expected output but I can't use this approach for a real dataset as it has more than million records.
t1 = df['PEEP']
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]):
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
However, I can't apply this to real data as it has more than million records. I am learning Python and can you help me understand how to vectorize my code in Python?
You can refer this post related post to understand the logic. As I have got the logic right, I have created this post mainly to seek help in vectorizing and fastening my code
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
Is there any efficient and elegant way to fasten my code operation for a million records dataset
Not sure what's the story behind this, but you can certainly vectorize three if independently and combine them together,
con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))
df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')
Edit (Groupby SubjectID)
con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})
Does this work?
df.groupby('subject_id')\
.rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)
Output:
subject_id
1 0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 False
10 True
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 True
19 False
2 20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 False
29 False
30 False
31 False
32 True
33 False
34 True
35 False
36 False
37 True
38 False
39 False
Name: PEEP, dtype: bool
Details:
Use groupby to break the data up using 'subject_id'
Apply rolling with a n=3 or a window size three.
Look at that last value in that window using -1 indexing and subtact
the maximum of the first two values in that window using index
slicing.
Related
Mark True from conditions satisfy on two consecutive values till another two consecutive values
I have a float column in a dataframe. And I want to add another boolean column which will be True if condition satisfies on two consecutive values till another condition satisfies on next two consecutive values. For Example I have a data-frame which look like this: index Values % 0 0 1 5 2 11 3 9 4 14 5 18 6 30 7 54 8 73 9 100 10 100 11 100 12 100 13 100 Now I want to mark True from where two consecutive values satisfies the condition df['Values %'] >= 10 till next two consecutive values satisfies the next condition i.e. df[Values %] == 100. So the final result will look like something this: index Values % Flag 0 0 False 1 5 False 2 11 False 3 9 False 4 14 False 5 18 True 6 30 True 7 54 True 8 73 True 9 100 True 10 100 True 11 100 False 12 100 False 13 100 False
Not sure how exactly the second part of your question is supposed to work but here is how to achieve the first. example data s = pd.Series([0,5,11,9,14,18,2,14,16,18]) solution # create true/false series for first condition and take cumulative sum x = (s >= 10).cumsum() # compare each element of x with 2 elements before. There will be a difference of 2 for elements which belong to streak of 2 or more True condition = x - x.shift(2) == 2 condition looks like this 0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 False 8 True 9 True dtype: bool
I have a rather inefficient way of doing this. It's not vectorised, so not ideal, but it works: # Convert the values column to a 1D NumPy array for ease of use. values = df["Values %"].tolist() values_np = np.array(values) # Initialize flags 1D array to be the same size as values_np. Initially set to all 0s. Uses int form of booleans, i.e. 0 = False and 1 = True. flags = np.zeros((values_np.shape[0]), dtype=int) # Iterate from 1st (not 0th) row to last row. for i in range(1, values_np.shape[0]): # First set flag to 1 (True) if meets the condition that consecutive values are both >= 10. if values_np[i] >= 10 and values_np[i-1] >= 10: flags[i] = 1 # Then if consecutive values are both larger than 100, set flag to 0 (False). if values_np[i] >= 100 and values_np[i-1] >= 100: flags[i] = 0 # Turn flags into boolean form (i.e. convert 0 and 1 to False and True). flags = flags.astype(bool) # Add flags as a new column in df. df["Flags"] = flags One thing -- my method gives False for row 10, because both row 9 and row 10 >= 100. If this is not what you wanted, let me know and I can change it so that the flag is True only if the previous two values and the current value (3 consecutive values) are all >= 100.
Is there a way to see how many cells / rows were selected when using pandas loc?
I am running a script that needs to highlight specific cells in a dataframe if the conditions below are met, so that I can export it to an excel spreadsheet, but I want to know how many of the cells were indeed selected and highlighted using loc in the code below. Is there a way to do this, or would a for loop be the only way to count how many were selected? If a for loop is necessary, how would I write one to parse through all rows of the dataframe, and still update the highlighting and count at the same time? a = ['P1','P2'] b = ['P3','P4'] p1 = ((x['priority'].str.contains('P1')) | (x['priority'].str.contains('P2'))) & (x['days_in_status'] > 7) p2 = ((x['priority'].str.contains('P3')) | (x['priority'].str.contains('P4'))) & (x['days_in_status'] > 14) p3 = (x['days_since_creation'] > 60) df.loc[p1,['priority','days_in_status']] = 'background-color:red' df.loc[p2,['priority','days_in_status']] = 'background-color:orange' df.loc[p3,'days_since_creation'] = 'background-color:yellow' df['key'] = 'color:blue;text-decoration:underline'```
p1,p2,p3 are series containing booleans, meaning True or False. Since True can also be intrepreted as a 1 you can do the following: (p1+p2+p3).sum() Here a short example: import pandas as pd df = pd.DataFrame({'test':range(1,21)}) p1 = df['test'] > 15 p2 = df['test'] < 5 p3 = df['test'] == 7 (p1+p2+p3) # Out[181]: # 0 True # 1 True # 2 True # 3 True # 4 False # 5 False # 6 True # 7 False # 8 False # 9 False # 10 False # 11 False # 12 False # 13 False # 14 False # 15 True # 16 True # 17 True # 18 True # 19 True # Name: test, dtype: bool (p1+p2+p3).sum() # Out[182]: 10 Alternatively you can also just compare the dataframe before you made any changed witht he dataframe after you made changes, see this example: import pandas as pd df = pd.DataFrame({'test':list(range(1,10))+[1,2,3]}) df1 = pd.DataFrame({'test':list(range(1,10))+[3,2,1]}) df.compare(df1) # self other # 9 1.0 3.0 # 11 3.0 1.0
Looking for on/off signals or value-pairs across adjacent rows in pandas dataframe
I have a df containing rows (sometimes thousands) of data, corresponding to a digital signal. I have added an extra column using: df['On/Off'] = np.where(df[col] > value, 'On', 'Off') to label the signal as being on or off (value is set depending on the signal source). The following code gives an example dataframe albeit without actual measurement data: df = pd.DataFrame({"Time/s" : np.arange(0,100,2), "On/Off" : ("Off")}) df.at[10:13,"On/Off"] = "On" df.at[40:43,"On/Off"] = "On" df.at[47:,"On/Off"] = "On" I want to count how many times the signal registers as being on. For the above code, the result would be 2 (ideally with an index returned). Given how the dataframe is organised, I think going down the rows and looking for pairs of rows where column on/off reads as 'off' at row n, then 'on' at row_n+1 should be the approach, as in: i =0 # <--- number of on/off pairings if cycle = [row_n]='On'; [row_n+1]='Off': i=+1 My current plan came from an answer for this (Pandas iterate over DataFrame row pairs) I think df.shift() offers a potential route, generating 2 dataframes, and then comparing rows for mismatches, but it feels there could be a simpler way, possibly using itertools, or pd.iterrows (etc.). As usual, any help is greatly appreciated.
Use shift with eq (==) for compare values, chain both boolean mask and last count Trues by sum: out = (df['On/Off'].shift(-1).eq('Off') & df['On/Off'].eq('On')).sum() Another solution: out = (df['On/Off'].shift().eq('On') & df['On/Off'].eq('Off')).sum() print (out ) 2 Detail: print ((df['On/Off'].shift().eq('On') & df['On/Off'].eq('Off'))) 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 True 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False 32 False 33 False 34 False 35 False 36 False 37 False 38 False 39 False 40 False 41 False 42 False 43 False 44 True 45 False 46 False 47 False 48 False 49 False Name: On/Off, dtype: bool
How to count trues and falses in a two column data frame?
Here is my code: pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines()) line = "unix_timestamp_of_request_utc" lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1] print(lines) dts = pd.to_datetime(lines, unit='s') hours = dts.dt.hour print(hours) pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines()) line = "requester_received_pizza" lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1] data = pd.DataFrame({'houroftheday' : hours.values, 'successpizza' : lines}) print(data) ****Which gives me:**** houroftheday successpizza 23 18 true 67 2 true 105 14 true 166 23 true 258 20 true 297 1 true 340 2 true 385 22 true ... 304646 21 false 304686 12 false 304746 1 false 304783 3 false 304840 20 false 304907 17 false 304948 1 false 305023 4 false How can I sum the hours that only correspond to the trues?
First filter all rows by Trues in column successpizza and then sum column houroftheday: sum_hour = data.loc[data['successpizza'] == 'true', 'houroftheday'].sum() print (sum_hour) 102 If want size is necessary only count Trues, if use sum, Trues are processes like 1: len_hour = (data['successpizza'] == 'true').sum() print (len_hour) 8 Or if need length of each houroftheday: mask = (data['successpizza'] == 'true').astype(int) out = mask.groupby(data['houroftheday']).sum() print (out) houroftheday 1 1 2 2 3 0 12 0 14 1 18 1 20 1 21 0 22 1 23 1 Name: successpizza, dtype: int32 Solution for remove traling whitespaces is str.strip: line = "requester_received_pizza" lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1].str.strip()
I think you want a count of the occurrences of each hour where successpizza is true. If so you will want to slice the data frame using successpizza, then groupby the houroftheday column and aggregate using a count. It also looks like you are reading in the true/false values from a file, so they are strings. You will need to convert them first. data.successpizza = data.successpizza.apply(lambda x: x=='true') data[data.successpizza].groupby('houroftheday').count()
How to iterate through a dictionary pandas where values are tuples, and find first True and False values
I have a dictionary called weeks_adopted where when I run iteritems() and print the value, I get (example of the values for 3 keys, each key is called app_id). The weeks_adopted dict consists of key value pairs where the key is of type <type 'str'>and the value is a <class 'pandas.core.series.Series'> where dtype is bool. Here is one example of one value, where the indices are basically the week referred to (weeks 0-13 of the year in order): Name: app_id_str, dtype: bool 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 True 10 False 11 False 12 True 13 False Name: app_id_str, dtype: bool 0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False Name: app_id_str, dtype: bool 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 True 12 True 13 True What I want to do is calculate the number of rows from the first True value right through to the first False value, for each key, obviously accounting for each cases for example in the 3rd tuple you see the first True after the first False. Basically this is to do with drop out rates - when does a user first see something (True) and then give it up (False). In the example of the tuples above, the result should be 1, 1 and 3 in terms of adoption rate. Here is my current basic method: for key,value in weeks_adopted.iteritems(): start= value.index(True) end = value.index(False) adoption=end-start weeks_adopted[key] = adoption However I get this error even with this method: TypeError Traceback (most recent call last) <ipython-input-32-608c4f533e54> in <module>() 19 for key,value in weeks_adopted.iteritems(): 20 print value ---> 21 start= value.index(True) 22 end = value.index(False) 23 adoption=end-start TypeError: 'Int64Index' object is not callable In the answer, please could you help me in what other checks I need to be doing to find the first True and first Last value? I am presuming this type of loop is a common one for many situations?
you can try this: def calc_adoption(ts): true_index = ts[ts].index if len(true_index) == 0: return 0 first_true_index = true_index[0] false_index = ts.index.difference(true_index) false_index = false_index[false_index > first_true_index] if len(false_index) == 0: return 14 - first_true_index return false_index[0] - first_true_index adopted_weeks = {k: calc_adoption(v) for k, v in weeks_adopted.iteritems()}