I often need to construct a bi-temporal Pandas DataFrame, with [CURR_DATE, HIST_DATE] as MultiIndex, and a dozen columns of numerical, categorical and text data. Note for every CURR_DATE, the HIST_DATE will cycle through, say, 3 years, and the rest of the row largely depend on HIST_DATE describing information on the HIST_DATE, with very infrequent changes (due to information on that HIST_DATE get updated on certain CURR_DATE).
As you can see, this DataFrame has lots of repeated information. But it gets copied again and again, making the entire DataFrame very memory inefficient. (Comparatively, dict object would allow references hence pointing to the same underlying object to be highly efficient.)
Question: what would be a better way to construct the DataFrame to still allow bi-temporal processing ability (eg. DATA_DATE needs to be joined with some other DataFrame, and HIST_DATE needs to be joined with a 3rd DataFrame), while making the entire DataFrame a lot more memory efficient / having a much smaller memory footprint?
(Feel free to ask me to clarify if the question is not clear.)
Related
I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
First of all: I know this is a dangerous question. There are a lot of similar questions about storing and accessing nested data in pandas but I think my question is different (more general) so hang on. :)
I have medium sized dataset of workouts for 1 athlete. Each workout has a date and time, ~200 properties (e.g. average speed and heart rate) and some raw data (3-10 lists of e.g. speed and heart rate values per second). I have about 300 workouts and each workouts contains on average ~4000 seconds.
So far I tried 3 solutions to store this data with pandas to be able to analyze it:
I could use MultiIndex and store all data in 1 DataFrame but this
DataFrame would get quite large (which doesn't have to be a problem
but visually inspecting it will be hard) and slicing the data is cumbersome.
Another way would be to store the date and properties
in a DataFrame df_1 and to store the raw data in a separate
DataFrame df_2 that I would store in a separate column raw_data
in df_1.
...Or (similar to (2)) I could store the raw data in separate DataFrames
that I store in a dict with keys identical to the index of the
DataFrame df_1.
Either of these solutions work and for this use case there are no major performance benefits to either of them. To me (1) feels the most 'Pandorable' (really like that word :) ) but slicing the data is difficult and visual inspection of the DataFrame (printing it) is of no use. (2) feels a bit 'hackish' and in-place modifications can be unreliable but this solution is very nice to work with. And (3) is ugly and a bit difficult to work with, but also the most Pythonic in my opinion.
Question: What would be the benefits of each method and what is the most Pandorable solution in your opinion?
By the way: Of course I am open to alternative solutions.
I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.
I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().