How can I ensure unique rows in a large HDF5

How can I ensure unique rows in a large HDF5 - python

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...

See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Related

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.

I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

Large data with pivot table using Pandas

I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?

Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).

Pandas - retrieving HDF5 columns and memory usage

I have a simple question, I cannot help but feel like I am missing something obvious.
I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:
output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)
The dataset is ~50 million rows and 11 columns.
If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).
This is running on Python 2.7 with Pandas 0.14.
Am I missing something obvious?
EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379
Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.

HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.
There are several ways to approach this:
use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
chunk thru the table, see here and concat at the end - this will use constant memory
store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
create your own column store-like by storing to multiple sub tables and use select_as_multiple see here
which options you choose depend on the nature of your data access
note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index)
this will make store/query faster

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.

I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.