Pandas - retrieving HDF5 columns and memory usage - python

I have a simple question, I cannot help but feel like I am missing something obvious.
I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:
output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)
The dataset is ~50 million rows and 11 columns.
If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).
This is running on Python 2.7 with Pandas 0.14.
Am I missing something obvious?
EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379
Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.

HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.
There are several ways to approach this:
use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
chunk thru the table, see here and concat at the end - this will use constant memory
store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
create your own column store-like by storing to multiple sub tables and use select_as_multiple see here
which options you choose depend on the nature of your data access
note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index)
this will make store/query faster

Related

How to reduce amount of ram used by pandas

I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.

Large data with pivot table using Pandas

I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?
Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).

Pandas memory usage?

I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.

How can I ensure unique rows in a large HDF5

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Categories