First of all: I know this is a dangerous question. There are a lot of similar questions about storing and accessing nested data in pandas but I think my question is different (more general) so hang on. :)
I have medium sized dataset of workouts for 1 athlete. Each workout has a date and time, ~200 properties (e.g. average speed and heart rate) and some raw data (3-10 lists of e.g. speed and heart rate values per second). I have about 300 workouts and each workouts contains on average ~4000 seconds.
So far I tried 3 solutions to store this data with pandas to be able to analyze it:
I could use MultiIndex and store all data in 1 DataFrame but this
DataFrame would get quite large (which doesn't have to be a problem
but visually inspecting it will be hard) and slicing the data is cumbersome.
Another way would be to store the date and properties
in a DataFrame df_1 and to store the raw data in a separate
DataFrame df_2 that I would store in a separate column raw_data
in df_1.
...Or (similar to (2)) I could store the raw data in separate DataFrames
that I store in a dict with keys identical to the index of the
DataFrame df_1.
Either of these solutions work and for this use case there are no major performance benefits to either of them. To me (1) feels the most 'Pandorable' (really like that word :) ) but slicing the data is difficult and visual inspection of the DataFrame (printing it) is of no use. (2) feels a bit 'hackish' and in-place modifications can be unreliable but this solution is very nice to work with. And (3) is ugly and a bit difficult to work with, but also the most Pythonic in my opinion.
Question: What would be the benefits of each method and what is the most Pandorable solution in your opinion?
By the way: Of course I am open to alternative solutions.
Related
I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.
I am struggling to come up with an efficient way of solving what seems to be a typical use case of dask.dataframe groupby+apply and am wondering if I'm missing something obvious (various docs speak to this issue but I haven't been able to fully resolve it).
In short, I'm trying to load a medium-sized (say 10GB) dataframe, group by some column, train a machine learning model for each subset (a few seconds per model, ~100k subsets), and save that model to disk. My best attempt so far looks like:
c = Client()
df = dd.read_parquet('data.parquet')
df = c.persist(dd.set_index('key')) # data is already sorted by key
result = c.compute(df.groupby(df.index).apply(train_and_save_model))
No matter how I try to repartition the data, I seem to spend an enormous amount of time on serialization/IO compared to on actual computation. A silly naive workaround of writing 100k separate Parquet files up front and then passing filenames to the workers to load/train on seems to be much more efficient; I'm struggling to see why that would perform any differently. Isn't the idea of setting the index and partitioning that each worker understands which parts of the file it should read from? I assume I'm missing something obvious here so any guidance would be appreciated.
It looks like there's a big difference between data that is read in with index='key' and data that is re-indexed with set_index; I thought that if the new index was already sorted then there'd be no shuffling cost but I guess that was wrong.
Changing to dd.read_parquet('data.parquet', index='key') seems to give performance like what I was hoping for so I'm happy (though still curious why set_index seems to shuffle unnecessarily in this case).
I often need to construct a bi-temporal Pandas DataFrame, with [CURR_DATE, HIST_DATE] as MultiIndex, and a dozen columns of numerical, categorical and text data. Note for every CURR_DATE, the HIST_DATE will cycle through, say, 3 years, and the rest of the row largely depend on HIST_DATE describing information on the HIST_DATE, with very infrequent changes (due to information on that HIST_DATE get updated on certain CURR_DATE).
As you can see, this DataFrame has lots of repeated information. But it gets copied again and again, making the entire DataFrame very memory inefficient. (Comparatively, dict object would allow references hence pointing to the same underlying object to be highly efficient.)
Question: what would be a better way to construct the DataFrame to still allow bi-temporal processing ability (eg. DATA_DATE needs to be joined with some other DataFrame, and HIST_DATE needs to be joined with a 3rd DataFrame), while making the entire DataFrame a lot more memory efficient / having a much smaller memory footprint?
(Feel free to ask me to clarify if the question is not clear.)
I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?
Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.
Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.
Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]
I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.