I am failrly new in Python. I wrote a script and I am surprised of the time it take to go through a particular loop compare to the rest of my code.
Can someone tell me what is inefficient in the code I wrote and maybe how to improve the speed ?
Here is the loop in question : (BT_Histos and Histos_Last_Rebal are dataframes with dates in index and columns of floats. Portfolio and Portfolio_Last_Rebal are dataframes same index as the 2 previous one that i am filling through the loop. weights is just a list)
Udl_Perf=BT_Histos/Histos_Last_Rebal-1
for i in range(1,len(BT_Histos.index)):
"""tricky because isin doesn't work with timestamp"""
test_date=pd.Series(Portfolio.index[i-1])
if test_date.isin(Rebalancing_Dates)[0]:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio.loc[Portfolio.index[i-1],'PortSeries']
else:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i-1],'PortSeries']
Portfolio.loc[Portfolio.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']*(1+sum(Udl_Perf.iloc[i]*weights))
Thanks!
If you really want it to be fast then first implement it in while loop.
Second the length variable which you will use, define the type in advance using Mypy library, in this you need Python 3.5+ version installed.
Also if every iteration is unique then you can use multithreading using threading library. Get eg in this git repo
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow .
is there away to use faster loop with these multiple statements (like np.where) or other solve??
i need help for rewrite this code in another way (like using np.where) or other to solve this problem
the code are:
'''
`for i in range(0,len(data)):
last=data.head(i)
select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]]
avr= select_acc[ (select_acc['average']>0)]
if len(avr)==0:
lastavrage=0
else:
lastavrage = avr.average.mean()
if (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
data["label"][i]="abnormal"
data["problem"][i]="error"
`
Generally speaking, the worst thing to do is to iterate rows.
I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).
But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).
For example, you could compute the threshold of "abnormal" average like this
for no in data.ACCOUNTNO.unique():
f=data.ACCOUNTNO==no # True/False series of rows matching this account
cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
num=f.cumsum() # Numerotation of rows for this account
data.loc[f, 'lastavr']=cs/num
After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row
Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of
pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'
Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.
I have some simple code:
for x in range(df2.shape[0]):
df1.loc[df1['df1_columnA'] == df2.iloc[x]['df2_columnB']]['df1_columnB']
This code goes through the cells located at (iloc[x], df2_columnB) of df2 and when there's a matching value of that cell with df1['df1_columnA'] it accesses that row's value (.loc) at ['df1_columnB'].
My question is, where can I find how this works internally (or if someone would be willing to explain)? Before I knew about this way of comparison I had a couple of nested for loops and other logic to find the values. I've tried searching through the github and other online resources but I can't find anything relevant. I'm simply curious to understand how it compares to my own initial code and/or whether vectorization is used, etc.
Thanks
I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table
I have this mathematical task in which I am supposed to find some combinations, etc. That doesn't matter, the problem is that I am trying to do it with itertools module and it worked fine on smaller combinations (6 - places), but now I want to do the same for large combination (18 - places) so here I run into problem because I only have 8GB of RAM and this list comes around 5GB and with my system running it consumes all RAM and then program drops MemoryError. So my question is: what would be good alternative to the method I'm using(code below)?
poliedar_kom = list(itertools.combinations_with_replacement(range(0, 13), 18))
poliedar_len = len(poliedar_kom)
So when I have this list and it's length, the rest of program is going through every value in list and checking for condition with values in another smaller list. As I already said that's problem because this list gets too big for my PC, but I'm probably doing something wrong.
Note: I am using latest Python 3.8 64-bit
Summary: I have too big list of lists through which I have to loop to check values for conditions.
EDIT: I appreciate all answers, I have to try them now, if you have any new possible solution to the problem please post it.
EDIT 2: Thanks everyone, you helped me really much. I marked answer that pointed me to Youtube video because it made me realize that my code is already generator. Thanks everyone!!!
Use generators for large data ranges, time and space complexity of the code will not increase exponentially with large data size, refer to the link for more details:
https://www.youtube.com/watch?v=bD05uGo_sVI
For any application requiring more than say, 1e4 items, you should refrain from using python lists, which are very memory- and processor-intesive
For such uses, I generally go to numpy arrays or pandas dataframes
If you aren't comfortable with these, is there some way you could refactor your algorithm so that you don't hold every value in memory at once, like with a generator?
in your case!
1) store this amount of data not in the RAM but inside a file or something in your HDD/SDD (say some SQL databases or NoSQL databases)
2) write a generator that processes each list (group of list for more efficiency) inside the whole list one after the other until the end
it will be good for you to use something like mongodb or mysql/mariadb/postgresql to store this amount of datas.
I would like to know if there's a technique to simply undo a change that was done using Pandas.
For example, I did a string replacement on a few thousand rows of Pandas Dataframe, where, every occurrence of "&" in its string be replaced with "and". However after performing the replacement, I found out that I've made a mistake in the changes and would want to revert back to the Dataframe's most latest form before that string replacement was done.
Is there a way to do this?
Yes, there is a way to do this. If you're using the newest iteration of python and pandas you could do it this way:
df.replace(to_replace='and', value='&', inplace=true)
This is the way I learned it!
If you have cells structured in step, and the mess is because of running a couple of cells that have affected the dataset, you can stop the kernel and run all the cells from the beginning.