I have a live feed of logging data coming in through the network. I need to calculate live statistics, like the one in my previous question. How would I design this module? I mean, it seems unrealistic (read, bad design) to keep applying a groupby function to the entire df every single time a message arrives. Can I just update one row and its calculated column gets auto-updated?
JFYI, I'd be running another thread that will print read values from the df and print to the a webpage every 5 seconds or so..
Of course, I could run groupby-apply every 5 seconds instead of doing it in real time, but I thought it'd be better to keep the df and the calculation independent of the printing module.
Thoughts?
groupby is pretty damn fast, and if you preallocate slots for new items you can make it even faster. In other words, try it and measure it for a reasonable amount of fake data. If it's fast enough, use pandas and move on. You can always rewrite it later.
Related
I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))
I have been turning around a problem by several days and I can not reach a successful solution to my code using python. I have a dataframe with sets of coordinates, let's say: the whole information of a formula 1 gran prix! Cool, I defined a function that calculate some attributes, for example the distance, the speed or the pits stop times for all the cars. My functions work but I can not find the way in which I have to save the information for all the cars, I can only get my first or my last result in a data frame.
I think that the question is very basic and I only need the correct way to save computed results in a dataframe, when the function is ran by several times. Thanks by your kind responses!
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?
Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.
Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.
Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]
I am having a dictionary of pandas Series, each with their own index and all containing float numbers.
I need to create a pandas DataFrame with all these series, which works fine by just doing:
result = pd.DataFrame( dict_of_series )
Now, I actually have to do this a large amount of time, along with some heavy calculation (we're in a Monte-Carlo engine).
I noticed that the part where my code was spending the most time was over this line. Obviously, this is if I sum up all the times it's been called.
I thought about caching the result but unfortunately the dict_of_series is almost all the time different.
I guess that what takes time is obviously the constructor has to build the global index and fill the holes and maybe there is simply no way around it, but I'm wondering if I'm not missing something obvious which slows down the process, or if there is something smarter I could do to speed it up.
Has anybody had the same experience?