Large data with pivot table using Pandas - python

I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?

Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).

Related

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

Dask dataframe saving to_csv for incremental data - Effecient Writing to csv

I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).
Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.
The Pseudocode is given below:
for row in parentdf.itertuples(): #insert into <tbl> .
mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
try:
df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1)
df.insert(0,'NewCol',sym)
df = df.assign(calculatedCol = functions1(params))
df = df.assign(calculatedCol1 = functions2(params),20))
df = df.assign(calculatedCol3 = functions3(More params),20))
df[20:].to_csv(outfile, mode='a', header=False, index=False)
The category-wise reading and storing in csv takes 2 Mins-Per cycle*. This is close to .34 Seconds for each writing of the 350 Categories incrementally.
I am wondering whether I can make the above process faster & efficient by using dask dataframes.
I looked up dask.org and didn't get any clear answers, looked at the use cases as well.
Additional details: I am using Python 3.7 and Pandas 0.25,
Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.
My key function i.e. FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.
Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk. The question doesn't mention some key facts:
how much data is ingested every 5 minutes (10MB or 10TB)?
where is the code being executed (AWS Lambda or a big cluster of machines)?
what data operations does FnforExtractingNParsingData perform?
Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask is overkill for a tiny dataset. Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. For large datasets, Dask will help a lot. Feel free to provide more details in your question and I can give more specific suggestions.

Pandas - retrieving HDF5 columns and memory usage

I have a simple question, I cannot help but feel like I am missing something obvious.
I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:
output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)
The dataset is ~50 million rows and 11 columns.
If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).
This is running on Python 2.7 with Pandas 0.14.
Am I missing something obvious?
EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379
Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.
HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.
There are several ways to approach this:
use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
chunk thru the table, see here and concat at the end - this will use constant memory
store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
create your own column store-like by storing to multiple sub tables and use select_as_multiple see here
which options you choose depend on the nature of your data access
note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index)
this will make store/query faster

How can I ensure unique rows in a large HDF5

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Categories