python-Dataframe count rows with condition - python

I have a python code that is collecting information from a dataframe (df1) like this
for ind, data in enumerate(df1.Link):
print(data)
result = getInformation(driver, links)
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
the output is saved to a dataframe like the one shown in the photo:
Is there anyway to update my code before it returns the dataframe with this condition:
if noOfWorkers >=30, once we have 2 links that have this condition, the code will break and return the result
can someone help ?

It would be better to put the logic in the code you already have. I would keep a count of the number of records matching the conditions, then exit out of the loop using break (rather than a while loop):
...
workers_threshold = 30
records_matching_threshold = 0
max_records_for_matching_records = 2
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
if i["numOfWorkers"] > workers_threshold:
records_matching_threshold += 1
if records_matching_threshold > max_records_for_matching_records:
break
Note that the variable names above are purposefully long to make their purposes clear in my example.

i = 2
while i != 0:
if numOfWorkers >= 30:
i- = 1

Related

Python/ R code is taking too long to extract pairwise information from dataset. How to optimize?

Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.
The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.
What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.
Please let me know if there are ways to optimize this.
Here's a short overview of the data. [my_data.head(10)][1]
i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]
while (i < my_data.shape[0]):
row1=my_data.iloc[i]
id1=row1[1]
time1=row1[0]
x1=row1[2]
y1=row1[3]
infected1=my_data.iloc[i,4]
infectious1=my_data.iloc[i,5]
#print(row1)
#print(time1)
for j in range(i+1,my_data.shape[0]):
row2=my_data.iloc[j]
id2=row2[1]
time2=row2[0]
x2=row2[2]
y2=row2[3]
infected2=my_data.iloc[j,4]
infectious2=my_data.iloc[j,5]
print(time2)
if(time2!=time1):
i=i+1
print("diff time...breaking")
break
if(x2>x1+2) or (x1>x2+2):
i=i+1
print("x more than 2...breaking")
break
if(y2>y1+2) or (y1>y2+2):
i=i+1
print("y more than 2...breaking")
break
probability = 0
distance = round(math.sqrt(pow((x1-x2),2)+pow((y1-y2),2)),2)
print(distance)
print(infected1)
print(infected2)
if (distance<=R):
if infectious1 and not infected2 : #if one person is infectious and the other is not infected
probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
print(probability)
print("here")
infected2=decision(probability)
numid2= int(id2) # update all entries for id2
if (infected2):
my_data.loc[my_data['id'] == numid2, 'infected'] = True
#my_data.iloc[j,7]=probability
elif infectious2 and not infected1:
infected1=decision(probability)
numid1= int(id1) # update all entries for id1
if (infected1):
my_data.loc[my_data['id'] == numid1, 'infected'] = True
#my_data.iloc[i,7]=probability
inf1 = 'F'
inf2 = 'F'
if (infected1):
inf1 = 'T'
if (infected2):
inf2 = 'T'
print('prob '+str(probability)+' at time '+str(time1))
new_row = {'source': id1.astype(str)+' '+inf1, 'dest': id2.astype(str)+' '+inf2}
y = y.append(new_row, ignore_index=True)
i=i+1
[1]: https://i.stack.imgur.com/YVdmB.png
Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":
y = y.append(new_row, ignore_index=True)
You should not append rows to dataframe in a loop.
You should aggregate them in python list and then create DataFrame using all of them after the loop.
y = []
while (i < my_data.shape[0])
(...)
y.append(new_row)
y = pd.DataFrame(y)
I also suggest to use line profiler to analyse which parts of the code are the bottlenecks
You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by operation instead and then iterating through the groups.

How to not set value to slice of copy [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding?
Code:
import pandas as pd
from datetime import timedelta
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['Strategy'] = ''
def iron_condor():
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
a = 0
b = 1
c = 2
d = 3
for row in TRADER_READER.index:
start_time = TRADER_READER['Date'][a]
end_time = start_time + timedelta(seconds=5)
e = TRADER_READER.iloc[a]
f = TRADER_READER.iloc[b]
g = TRADER_READER.iloc[c]
h = TRADER_READER.iloc[d]
if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']:
e.loc[e['Strategy']] = 'Iron Condor'
f.loc[f['Strategy']] = 'Iron Condor'
g.loc[g['Strategy']] = 'Iron Condor'
h.loc[h['Strategy']] = 'Iron Condor'
print(e, f, g, h)
if (d + 1) > int(TRADER_READER.index[-1]):
break
else:
a += 1
b += 1
c += 1
d += 1
iron_condor()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
Hopefully this satisfies the data needed to replicate:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL
Expected result:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code:
The leftmost column of your input file is apparently the index column,
so it should be read as the index. The consequence is some different approach
to the way to access rows (details later).
The Date column can be converted to datetime64 as early as at the reading time.
So the initial part of your code can be:
TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date'])
TRADER_READER['Strategy'] = ''
Then I decided to organize the loop other way:
indStart is the integer index of the index column.
As you process your file in "overlapping" couples of 4 consecutive rows,
a more natural way to organize the loop is to stop on 4-th row from the end.
So the loop is over the range(TRADER_READER.index.size - 3).
Indices of 4 rows of interest can be read from the respective slice of the
index, i.e. [indStart : indStart + 4]
Check of particular row can be performed with a nested function.
To avoid your warning, setting of values in Strategy column should be
performed using loc on the original DataFrame, with row parameter for
the respective row and column parameter for Strategy.
The whole update (for the current couple of 4 rows) can be performed in
a single instruction, specifying row parameter as a slice,
from a thru d.
So the code can be something like below:
def iron_condor():
def rowCheck(row):
return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb
for indStart in range(TRADER_READER.index.size - 3):
a, b, c, d = TRADER_READER.index[indStart : indStart + 4]
e = TRADER_READER.loc[a]
undSymb = e['Underlying Symbol']
start_time = e.Date
end_time = start_time + pd.Timedelta('5S')
if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]):
TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor'
print('New values:')
print(TRADER_READER.loc[a:d])
No need to increment a, b, c and d. Neither break is needed.
Edit
If for some reason you have to do other updates on the rows in question,
you can change my code accordingly.
But I don't understand "this csv file will make a new column" in your
comment. For now anything you do is performed on the DataFrame
in memory. Only after that you can save the DataFrame back to the
original file. But note that even your code changes the type of Date
column, so I assume you do it once and then the type of this column
is just datetime64.
So you probably should change the type of Date column as a separate
operation and then (possibly many times) update thie DataFrame and save
the updated content back to the source file.
Edit following the comment as of 21:22:46Z
re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if
a match has been found, otherwise None.
So can not compare this result with the string searched. If you wanted to get
the string matched, you should run e.g.:
mtch = re.search('.*TO_OPEN$', row['Action'])
textFound = None
if mtch:
textFound = mtch.group(0)
But you actually don't need to do it. It is enough to check whether
a match has been found, so the condition can be:
found = bool(re.search('.*TO_OPEN$', row['Action']))
(note that None cast to bool returns False and any non-Null object
returns True).
Yet another (probably simpler and quicker) solution is that you run just:
row.Action.endswith('TO_OPEN')
without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case.
Deal with SettingWithCopyWarning
In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val

How do I strip the "$" symbol from a string in a pandas series?

In need of dire help. I am trying to conditionally iterate over the rows of a Google play store csv file. I keep running to the issue of pandas not recognizing the '>=' symbol during some of my loops for some reason. That said using the condition "if price == "9.00" works just fine but the other operations (i.e, '<=' and ">=" return error messages.
Additionally, I'm trying to count the number of apps with a price of $9.00 or greater. I want to strip the "$" symbol from the price column and then proceed to iterate over it. I have attempted the str.lstrip function with no success. Any and all help is greatly appreciated.
df = pd.read_csv("googleplaystore.csv")
df['Rating'].fillna(value = '0.0', inplace = True)
# Calculating how many apps have a price of $9.00 or greater
apps_morethan9 = 0
for i, row in df.iterrows():
rating = float(row.Rating)
price = float(row.Price)
if price >= 9:
apps_morethan9 += 1
print(apps_morethan9)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-103-66171ce7efb6> in <module>
5 for i, row in df.iterrows():
6 rating = float(row.Rating)
----> 7 price = float(row.Price)
8 if price >= 9:
9 apps_morethan9 += 1
ValueError: could not convert string to float: '$4.99'```
you could use string.replace() like:
for i, row in df.iterrows():
rating = float(row.Rating)
price = float(row.Price.str.replace('$',''))
if price >= 9:
apps_morethan9 += 1
but your implementation could be improved in terms of speed and complexity:
print(df[df.Price.str.replace('$','').astype(float) >= 9].count().values[0])
You can apply str.replace on the whole Series before iterating over it like so:
df["Price"].str.replace("$", "")
for i, row in df.iterrows():
#rest of your routine
I suggest you to use #gustavz solution for improved performance

How to perform multiple boolean conditions in a whole DataFrame (row by row)?

I have the next DataFrame:
open high low close volume
0 62.8571 63.9285 62.7714 63.5642 82641944.0
1 63.6642 64.9285 63.5014 64.5114 88379522.0
2 61.7014 63.6857 61.4428 63.2757 112681030.0
3 62.5928 63.6399 62.0285 62.8085 113921367.0
4 63.4357 64.0499 62.6028 63.0505 110727309.0
.. .. .. .. .. ..
And currently I have the next code to generate a "bool"(0,1,-1) Series depending con multiple conditions (selecting 2 by 2 rows. In other cases I will need 3/4 rows in each iteration/calculation):
def check_pattern(data):
engulfed_bar_range = data.iloc[-2]['close'] - data.iloc[-2]['open']
if abs(engulfed_bar_range) >= params:
if engulfed_bar_range > 0:
return -1*((data.iloc[-1]['open'] > data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] < data.iloc[-2]['open']))
else:
return +1*((data.iloc[-1]['open'] < data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] > data.iloc[-2]['open']))
return False
res = []
for index in range(1, len(all_data)):
data = all_data.iloc[index-1:index+1]
res.append(check_pattern(d))
s = pd.Series(res)
There is any better/easiest/bestPerformance way of doing that? In some other cases similar to that, in which I only need the data of one column of the DataFrame, I have used df.rolling(..), but in this case that I need using data of several columns I dn't know how to do it. Maybe, there is some function of numpy that I can use? Or pd.eval? (I have tried but I havn't been able to get what I want)...
Thank so much in advance for your help.
Graphical explanation of what I'm looking for in the df:
I want a pd.Series with +1 when there is a Bullish Engulfing pattern and -1 when there is a Bearish Engulfing pattern. And 0 if there is no patter at that indexes.

What is the fastest way to generate a dynamic lag variable that updates with each iteration?

I have a few hundred thousand groups through which I want to iterate this particular lag operation. Below is a sample where Buy_Ord_No is the group by variable:
I would like to generate Lag_Exec_Qty and Exec_Qty. What I am basically doing here is initially setting Exec_Qty equal to 0 when Buy_Act_Type = 1 or Buy_Act_Type = 4. Then, I take the lag value of Exec_Qty ad Lag_Exec_Qty. In the same row, I sum up Trd_Qty and Lag_Exec_Qty to get the updated Exec_Qty.
This is the code that I currently have:
for b in buy:
temp=buy_sorted_file[buy_sorted_file["Buy_Ord_No"]==b]
temp=temp.sort_values(["Buy_Ord_No","Buy_Ord_Txn_Time"], ascending=[True, True]).reset_index(drop=True)
for index in range(len(temp.index)):
if(int(temp["Buy_Act_Type"].iloc[index])==1 or int(temp["Buy_Act_Type"].iloc[index])==4):
temp["Exec_Qty"].iloc[index]=0
temp["Lag_Exec_Qty"].iloc[index]=0
else:
temp["Lag_Exec_Qty"].iloc[index]=temp["Exec_Qty"].iloc[index-1]
temp["Exec_Qty"].iloc[index]=temp["Trd_Qty"].iloc[index]+temp["Lag_Exec_Qty"].iloc[index]
if (len(buy_sorted_exec_file.index) == 0):
buy_sorted_exec_file = temp.copy()
else:
buy_sorted_exec_file = pd.concat([temp,buy_sorted_exec_file]).reset_index(drop=True)
buy_sorted_file= buy_sorted_exec_file.sort_values(["Buy_Ord_Txn_Time", "Buy_Ord_Limit_Pr"],ascending=[True, True]).reset_index(drop=True)
The code takes a really long time to run. Is there anyway through which I can speed this process up?
You should be able to do, without any loops:
temp['Lag_Exec_Qty'] = temp['Exec_Qty'].shift(1)
temp['Exec_Qty'] = temp['Trd_Qty'] + temp['Lag_Exec_Qty']

Categories