I am having a dictionary of pandas Series, each with their own index and all containing float numbers.
I need to create a pandas DataFrame with all these series, which works fine by just doing:
result = pd.DataFrame( dict_of_series )
Now, I actually have to do this a large amount of time, along with some heavy calculation (we're in a Monte-Carlo engine).
I noticed that the part where my code was spending the most time was over this line. Obviously, this is if I sum up all the times it's been called.
I thought about caching the result but unfortunately the dict_of_series is almost all the time different.
I guess that what takes time is obviously the constructor has to build the global index and fill the holes and maybe there is simply no way around it, but I'm wondering if I'm not missing something obvious which slows down the process, or if there is something smarter I could do to speed it up.
Has anybody had the same experience?
Related
at the moment I'm trying to do a time series analysis in python. My data is saved as a pandas dataframe, and now I try to calculate e.g. the mean for smaller time windows (for the data in the columns). I directly want to save the new values into an array. If I only had a small dataset I would do it manual like this:
dataframe1.values (to convert it into an array)
array2 = np.array(([dataframe1[0:1000].mean()],[dataframe1[1001:2000].mean()]...])
Well, but I have a really large dataset, so it would take very long to do it this way by hand. So I thought about solving my problem with a loop, but I don't really know how to do this, when I directly want to save the new values in an array. Thanks in advance :)
I am struggling to come up with an efficient way of solving what seems to be a typical use case of dask.dataframe groupby+apply and am wondering if I'm missing something obvious (various docs speak to this issue but I haven't been able to fully resolve it).
In short, I'm trying to load a medium-sized (say 10GB) dataframe, group by some column, train a machine learning model for each subset (a few seconds per model, ~100k subsets), and save that model to disk. My best attempt so far looks like:
c = Client()
df = dd.read_parquet('data.parquet')
df = c.persist(dd.set_index('key')) # data is already sorted by key
result = c.compute(df.groupby(df.index).apply(train_and_save_model))
No matter how I try to repartition the data, I seem to spend an enormous amount of time on serialization/IO compared to on actual computation. A silly naive workaround of writing 100k separate Parquet files up front and then passing filenames to the workers to load/train on seems to be much more efficient; I'm struggling to see why that would perform any differently. Isn't the idea of setting the index and partitioning that each worker understands which parts of the file it should read from? I assume I'm missing something obvious here so any guidance would be appreciated.
It looks like there's a big difference between data that is read in with index='key' and data that is re-indexed with set_index; I thought that if the new index was already sorted then there'd be no shuffling cost but I guess that was wrong.
Changing to dd.read_parquet('data.parquet', index='key') seems to give performance like what I was hoping for so I'm happy (though still curious why set_index seems to shuffle unnecessarily in this case).
I am using pyspark, and I call getNumPartitions() to see if I need to repartition and it is dramatically slowng down my code. The code is too large to post here. My code works like this:
I have a for loop that loops through a bunch of functions that will be applied to a DataFrame
Obviously these are applied lazily, so they don't get applied until the end of the for loop.
Many of them are withColumn functions, or pivot functions like this: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
At each iteration, I print out the number of partitions by getNumPartitions()
I was under the impression that this is not an expensive operation...am I mis understanding and is it actually expensive? Or is something else slowing down my code?
Looking at the source for getNumPartitions()...
def getNumPartitions(self):
return self._jrdd.partitions().size()
it should not be that expensive. I suspect that there is something else going on that's causing your slow down.
Here's what I do know:
The list of partitions are cached and so only the first call to partitions() will cause the partitions to be calculated
Spark has to calculate the partitions for each RDD anyway, so it shouldn't add any more time for you to query the count
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
I have a live feed of logging data coming in through the network. I need to calculate live statistics, like the one in my previous question. How would I design this module? I mean, it seems unrealistic (read, bad design) to keep applying a groupby function to the entire df every single time a message arrives. Can I just update one row and its calculated column gets auto-updated?
JFYI, I'd be running another thread that will print read values from the df and print to the a webpage every 5 seconds or so..
Of course, I could run groupby-apply every 5 seconds instead of doing it in real time, but I thought it'd be better to keep the df and the calculation independent of the printing module.
Thoughts?
groupby is pretty damn fast, and if you preallocate slots for new items you can make it even faster. In other words, try it and measure it for a reasonable amount of fake data. If it's fast enough, use pandas and move on. You can always rewrite it later.