is there away to make loop on huge data faster? [closed]

is there away to make loop on huge data faster? [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow .
is there away to use faster loop with these multiple statements (like np.where) or other solve??
i need help for rewrite this code in another way (like using np.where) or other to solve this problem
the code are:
'''
`for i in range(0,len(data)):
last=data.head(i)
select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]]
avr= select_acc[ (select_acc['average']>0)]
if len(avr)==0:
lastavrage=0
else:
lastavrage = avr.average.mean()
if (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
data["label"][i]="abnormal"
data["problem"][i]="error"
`

Generally speaking, the worst thing to do is to iterate rows.
I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).
But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).
For example, you could compute the threshold of "abnormal" average like this
for no in data.ACCOUNTNO.unique():
f=data.ACCOUNTNO==no # True/False series of rows matching this account
cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
num=f.cumsum() # Numerotation of rows for this account
data.loc[f, 'lastavr']=cs/num
After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row
Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of
pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'
Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.

Related

Handling Multiple if/else and Special Cases

So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.

Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.

Is there a way to improve the speed of combining two values as strings?

I am currently working with a very large dataset (>70mil rows, 10+ columns), it involves gap fills, forwards fills, reindexing, etc. But the step that takes the most time (over 50% of run time) is the simple code of replacing a column variable with the value of two columns combined as strings. Example code would be:
df["id_date"] = df['id'].astype(str) +"_"+ df["date"].astype(str)
Is there a way to improve the speed of this step? I am surprised this takes such a great deal longer than what thought would be more complex steps.

Take a look at Series.str.cat:
df['id_date'] = df['id'].str.cat(df["date"], sep='_')
That being said, as with any redundant information, you are likely better off just not having this column, or at least only creating the data on demand instead of up front.

pandas apply with inputs from multiple rows

I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.

This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().

For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.

Running filter over a large amount of data points and a long time period?

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.

If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.

"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.

Python Code Optimization LineProfiler [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
My function gets a combinations of numbers, can be small or large depending on user input, and will loop through each combinations to perform some operations.
Below is a line-profiling I ran on my function and it is takes 0.336 seconds to run. While this is fast, this is only a subset of a bigger framework. In this bigger framework, I will need to run this function 50-20000 times which when multiplied to 0.336, takes 16.8 to 6720 seconds (I hope this is right). Before it takes 0.996 seconds but I've manage to cut it by half through avoiding functions calls.
The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations. My dictionary is a collection data and it looks something like this:
dic = {"array1", a,
"array2", b,
"array3", c,
"listofarray", [ [list of 5 array], [list of 5 array], [list of 5 2d Array ] ]
}
I was able to cut it by another ~0.01 seconds when I placed the dictionary lookback outside of the loop..
x = dic['listofarray']['list of 5 2d array']
So when I loop to get access to the the 5 different elements I just did x[i].
Other than that I am lost in terms of where to add more performance boost.
Note: I apologize that I haven't provided any code. I'd love to show but its proprietary. I just wanted to get some thoughts on whether I am looking at the right place for speed ups.
I am willing to learn and apply new things so if cython or some other data structure can speed things up, i am all ears. Thanks so much
PS:
inside my first __getitem__:
inside my second __getitem__:
EDIT:
I am using iter tools product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).

The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations.
No it isn't. Your two posted profile traces clearly show that they're NumPy/Pandas __getitem__ functions, not dict.__getitem__. So, you're trying to optimize the wrong place.
Which explains why moving all the dict stuff out of the loop made a difference of a small fraction of a percent.
Most likely the problem is that you're looping over some NumPy object, or using some fake-vectorized function (e.g., via vectorize), rather than performing some NumPy-optimized broadcasting operation. That's what you need to fix.
For example, if you compare these:
np.vectorize(lambda x: x*2)(a)
a * 2
… the second one will go at least 10x faster on any sizable array, and it's mostly because of all the time spending doing __getitem__—which includes boxing up numbers to be usable by your Python function. (There's also some additional cost in not being able to use CPU-vectorized operations, cacheable tight loops, etc., but even if you arrange things to be complicated enough that those don't enter into it, you're still going to get much faster code.)
Meanwhile:
I am using itertools.product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).
So you're creating 10**n separate n-element arrays? That's not making sensible use of NumPy. Each array is tiny, and most likely you're spending as much time building and pulling apart the arrays as you are doing actual work. Do you have the memory to build a single giant array with an extra 10**n-long axis instead? Or, maybe, batch it up into groups of, say, 100K? Because then you could actually build and process the whole array in native NumPy-vectorized code.
However, the first thing you might want to try is to just run your code in PyPy instead of CPython. Some NumPy code doesn't work right with PyPy/NumPyPy, but there's fewer problems with each version, so you should definitely try it.
If you're lucky (and there's a pretty good chance of that), PyPy will JIT the repeated __getitem__ calls inside the loop, and make it much faster, with no change in your code.
If that helps (or if NumPyPy won't work on your code), Cython may be able to help more. But only if you do all the appropriate static type declarations, etc. And often, PyPy already helps enough that you're done.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.