python pandas HDFStore append not contant size data - python

I am using python 2.7 with pandas and HDFStore
I try to process a big data set, which fits into the disk but not to the memory.
I store a large size data set in a .h5 file, the size of the data in each column is not constant, for instance, one column may have a string of 5 chars in one row and a string of 20 chars in another.
So i had issues writing the data to file in iterations, when the first iteration contained a small size of data and the following batches contained larger size of data.
I found that the issue was that the min_size was not used properly and the data did not fit into the columns, I used the following code to cache the database into the h5 without error
colsLen = {}
for col in dbCols:
curs.execute('SELECT MAX(CHAR_LENGTH(%s)) FROM table' % col)
for a in curs:
colsLen.update({col: a[0]})
# get the first row to create the hdfstore
rx = dbConAndQuery.dbTableToDf(con, table, limit=1, offset=0) #this is my utility that is querying the db
hdf.put("table", table, format="table", data_columns=True, min_itemsize=colsLen)
for i in range(rxRowCount / batchSize + 1):
rx = dbConAndQuery.dbTableToDf(con, table, limit=batchSize, offset=i * batchSize + 1)
hdf.append("table", table, format="table", data_columns=True, min_itemsize=colsLen)
hdf.close()
The issue is: how can i use HDFStore in cases where I can't query for the maximum size of each column's data in advance? e.g getting or creating data in iterations due to memory constrains.
I found that I can process the data using dask with in disk dataframes, but it lacks some functionality that I need in pandas, so the main idea is to process the data in batches appending it to the existing HDFStore file.
Thanks!

I found that the issue was hdf optimizing the data storage and counting on the size of the largest value of each column,
I found two ways to solve this:
1.Pre query the database to get the maximum data char length for each column
2.inserting each batch to a new key in the file then it works, each batch will be inserted to the hdf file using it's biggest value as biggest value in the column

Related

How to calculate Pandas Dataframe size on disk before writing as parquet?

Using python 3.9 with Pandas 1.4.3 and PyArrow 8.0.0.
I have a couple parquet files (all with the same schema) which I would like to merge up to a certain threshold (not fixed size, but not higher than the threshold).
I have a directory, lets call it input that contains parquet files.
Now, if I use os.path.getsize(path) I get the size on disk, but merging 2 files and taking the sum of that size (i.e os.path.getsize(path1) + os.path.getsize(path2)) naturally won't yield good result due to the metadata and other things.
I've tried the following to see if I can have some sort of indication about the file size before writing it into parquet.
print(df.info())
print(df.memory_usage().sum())
print(df.memory_usage(deep=True).sum())
print(sys.getsizeof(df))
print(df.values.nbytes + df.index.nbytes + df.columns.nbytes)
Im aware that the size is heavily depended on compression, engine, schema, etc, so for that I would like to simply have a factor.
Simply put, if I want a threshold of 1mb per file, ill have a 4mb actual threshold since I assume that the compression will compress the data by 75% (4mb -> 1mb)
So in total i'll have something like
compressed_threshold_in_mb = 1
compression_factor = 4
and the condition to keep appending data into a merged dataframe would be by checking the multiplication of the two, i.e:
if total_accumulated_size > compressed_threshold_in_mb * compression_factor:
assuming total_accumulated_size is the accumulator of how much the dataframe will weigh on disk
You can save the data frame to parquet in memory to have an exact idea of how much data it is going to use:
import io
import pandas as pd
def get_parquet_size(df: pd.DataFrame) -> int:
with io.BytesIO() as buffer:
df.to_parquet(buffer)
return buffer.tell()

Updating Parquet Partition files with new values/rows

For this example say I have a dataframe of 5000 rows and 5 columns. I have used the following line to create a parquet file partitioned on 'partition_key' (I have ~30 unique PKs):
dataframe.to_parquet(filename,engine='pyarrow',compression='snappy', partition_cols =
'partition_key')
I am reading the parquet file into my jupyter notebook and a subset dataframe using the filters parameter (say result is 100 rows, 5 cols) successfully via:
new_df = pd.read_parquet(filename, engine = 'pyarrow', filters = one of the partition keys)
What I need to do is update values in the 'new_df' dataframe and then save it back, and/or replace the exact 100 entries/rows in the original parquet file. Is there way to either update the original parquet file, or perhaps delete the partition folder that I grabbed via the filter parameter and do my edits to the 'new_df' and append to the original parquet file?
Any suggestions are appreciated!

Pandas / Dask - group by & aggregating a large CSV blows the memory and/or takes quite a while

I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services
Based on my research, since files are only few GBs, I found that Spark, etc would be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!)
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes!
What am I doing wrong here?
for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why on earth its ram usage skyrockets to 9GB for a 1GB csv input
(Note: if I don't set the block size dask crashes even on input 1GB)
My code
I can't share the CSV, but it has 1 integer column followed by 8 text columns. Both user_id and order_id columns referenced below are text columns.
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
Pandas
#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys
test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()
for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)
final.to_csv('pandastest.csv', index=False)
Dask
#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys
test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')
# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
The groupby operation is expensive because dask will try to shuffle data across workers to check who has which user_id values. If user_id has a lot of unique values (sounds like it), there is a lot of cross-checks to be done across workers/partitions.
There are at least two ways out of it:
set user_id as index. This will be expensive during the indexing stage, but subsequent operations will be faster because now dask doesn't have to check each partition for the values of user_id it has.
df = df.set_index('user_id')
collection = df.groupby('user_id')['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
if your files have a structure that you know about, e.g. as an extreme example, if user_id are somewhat sorted, so that first the csv file contains only user_id values that start with 1 (or A, or whatever other symbols are used), then with 2, etc, then you could use that information to form partitions in 'blocks' (loose term) in a way that groupby would be needed only within those 'blocks'.

Reading the last batch of data added to hdfs file using Python

I have a program that will add a variable number of rows of data to an hdf5 file as shown below.
data_without_cosmic.to_hdf(new_file,key='s', append=True, mode='r+', format='table')
New_file is the file name and data_without_cosmic is a pandas data frame with 'x' , 'y', 'z', and 'i' columns representing positional data and a scalar quantity. I may add several data frames of this form to the file each time I run the full program. For each data frame I add, the 'z' values are a constant value.
The next time I use the program, I would need to access the last batch of rows that was added to the data in order to perform some operations. I wondered if there was a fast way to retrieve just the last data frame that was added to the file or if I could group the data in some way as I add it in order to be able to do so.
The only other way I can think of achieving my goal is by reading the entire file and then checking the z values from bottom up until it changes, but that seemed a little excessive. Any ideas?
P.S I am very inexperienced with working with hdf5 files but I read that they are efficient to work with.

Efficiently rewriting lines in a large text file with Python

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:
#attribute 'Diameter' numeric
#attribute 'Length' real
#attribute 'Qty' integer
lines containing data using these attributes look like:
{0 0.86, 1 0.98, 2 7}
However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.
I'd like to try a method like the following pseudo-code:
fout = open('output.dat', 'w')
known_features = set()
for records in records:
if record has unknown features:
jump to top of file
delete existing "#attribute" lines and write new lines
jump to bottom of file
fout.write(record)
It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?
I tried something like:
fout.seek(0)
for new_attribute in new_attributes:
fout.write(attribute)
fout.seek(0, 2)
but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.
How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.
Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.
This way your records will be complete (in length), and you don't have to hop around the file.
The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.
FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.
Why don't you build the output in memory first (e.g. as a dict) and write it to a file after all data is known?

Categories