I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file.
However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).
Here is example dataset that I am trying to save:
listing_id | date | gender | price
-------------------------------------------
a | 2019-01-01 | M | 100
b | 2019-01-02 | M | 100
c | 2019-01-03 | F | 200
d | 2019-01-04 | F | 200
When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB of size.
When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.
I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?
My env:
- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas
My dataset source:
https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
# I am using calendar_summary.csv as my data from a group of datasets in that link above
My code to save as parquet file:
# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')
# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)
My code to save as partitioned parquet file:
# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
table=table,
root_path='./calendar_summary/',
partition_cols=['date'])
# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())
There is no minimum file size, but there is an overhead for storing the footer and there is a wasted opportunity for optimizations via encodings and compressions. The various encodings and compressions build on the idea that the data has some amount of self-similarity which can be exploited by referencing back to earlier similar occurances. When you split the data into multiple files, each of them will need a separate "initial data point" that the successive ones can refer back to, so disk usage goes up. (Please note that there are huge oversimplifications in this wording to avoid having to specifically go through the various techniques employed to save space, but see this answer for a few examples.)
Another thing that can have a huge impact on the size of Parquet files is the order in which data is inserted. A sorted column can be stored a lot more efficiently than a randomly ordered one. It is possible that by partitioning the data you inadvertently alter its sort order. Another possibility is that you partition the data by the very attribute that it was ordered by and which allowed a huge space saving when storing in a single file and this opportunity gets lost by splitting the data into multiple files. Finally, you have to keep in mind that Parquet is not optimized for storing a few kilobytes of data but for several megabytes or gigabytes (in a single file) or several petabytes (in multiple files).
If you would like to inspect how your data is stored in your Parquet files, the Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The commands most interesting to you are probably meta and dump.
Related
Using python 3.9 with Pandas 1.4.3 and PyArrow 8.0.0.
I have a couple parquet files (all with the same schema) which I would like to merge up to a certain threshold (not fixed size, but not higher than the threshold).
I have a directory, lets call it input that contains parquet files.
Now, if I use os.path.getsize(path) I get the size on disk, but merging 2 files and taking the sum of that size (i.e os.path.getsize(path1) + os.path.getsize(path2)) naturally won't yield good result due to the metadata and other things.
I've tried the following to see if I can have some sort of indication about the file size before writing it into parquet.
print(df.info())
print(df.memory_usage().sum())
print(df.memory_usage(deep=True).sum())
print(sys.getsizeof(df))
print(df.values.nbytes + df.index.nbytes + df.columns.nbytes)
Im aware that the size is heavily depended on compression, engine, schema, etc, so for that I would like to simply have a factor.
Simply put, if I want a threshold of 1mb per file, ill have a 4mb actual threshold since I assume that the compression will compress the data by 75% (4mb -> 1mb)
So in total i'll have something like
compressed_threshold_in_mb = 1
compression_factor = 4
and the condition to keep appending data into a merged dataframe would be by checking the multiplication of the two, i.e:
if total_accumulated_size > compressed_threshold_in_mb * compression_factor:
assuming total_accumulated_size is the accumulator of how much the dataframe will weigh on disk
You can save the data frame to parquet in memory to have an exact idea of how much data it is going to use:
import io
import pandas as pd
def get_parquet_size(df: pd.DataFrame) -> int:
with io.BytesIO() as buffer:
df.to_parquet(buffer)
return buffer.tell()
Is there a memory-efficient way to merge many large TSV files, only keeping select columns of each TSV file?
Background:
I have many (~100) large TSV files (~800MB) of the form:
filename: 1.txt
ID Char1 Char2 ... P
1 Green Red . 0.1
2 Red Orange . 0.2
3 Purple Blue . 0.13
(Format A)
The ID column is the same between all TSV files. I only care about the ID and P columns for each and would like to condense them into the following form to make them take up less memory (column names are pulled from the Format A file name):
ID 1 2 ... N
1 P_11 P_12 . P_1N
2 P_21 P_22 . P_2N
3 P_31 P_32 . P_3N
(Format B)
I need to do this in a memory-efficient manner as I will be processing as many of these blocks of 100 files in parallel. Without parallelization, the runtime would be extremely long as there are 170 blocks of 100 files to process. My workflow is as follows:
Step 1: Generate M groups of these 100 files in format A
Step 2: Process these 100 files into the one file of format B
Step 3: Repeat the above two steps until all 170 groups of 100 files are processed
I've tried many different approaches, and none is both time and memory efficient.
My first approach involved storing everything in memory. I kept one file's ID and P columns in a list of lists. Then, I iterated through every other file and appended its P value to the end of the appropriate array. This approach had a satisfactory runtime (~30 minutes per block of 100 files). But, this failed as my VM does not have the memory to parallelize storing all of this data at once.
I've gone through a number of iterations over the last few days. In my current iteration, I use one file and only keep its ID and P column. Then, I loop through each file and use fileinput to append the file's P column in place to the end of line. After testing, this method passes on the memory/parallelization side but too slow in terms of runtime. I've also thought of using pandas to merge, but this runs into both a time and memory constraint.
I have scoured every corner of my brain and the internet, but have not produced a satisfactory solution.
I have ~ 4000 parquet files that are each 3mb. I would like to read all of the files from an S3 bucket, do some aggregations, combine the files into one dataframe, and do some more aggregations. The parquet dataframes all have the same schema. The files are not all in the same folder in the S3 bucket but rather are spread across 5 different folders. I need some files in each folder but not all. I am working on my local machine which has 16gb RAM and 8 processors that I hope to leverage to make this faster. Here is my code:
ts_dfs = []
# For loop to read parquet files and append to empty list
for id in ids:
# Make the path to the timeseries data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
ts_data_df = spark.read.parquet(timeseries_path).select('bldg_id',
'`out.natural_gas.heating.energy_consumption_intensity`',
'`out.electricity.heating.energy_consumption_intensity`',
'timestamp')
# Aggregate
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption_intensity`').alias('eui_elec'),
f.sum('`out.natural_gas.heating.energy_consumption_intensity`').alias('eui_gas'))
# Append
ts_dfs.append(ts_data_df)
# Combine all of the dfs into one
ts_sdf = reduce(DataFrame.unionAll, ts_dfs)
# Merge with ids_df
ts = ts_sdf.join(ids_df, on = ['bldg_id'])
# Mean and Standard Deviation by month
stats_df = ts_sdf.groupBy('month', '`in.hvac_heating_type_and_fuel`') \
.agg(f.mean('eui_elec').alias('mean_eui_elec'),
f.stddev('eui_elec').alias('std_eui_elec'),
f.mean('eui_gas'). alias('mean_eui_gas'),
f.stddev('eui_gas').alias('std_eui_gas'))
Reading all of the files through a forloop does not leverage the multiple cores, defeating the purpose of using Spark. As such this process takes 90 minutes on my own (though that may be more a function of my internet connection). Additionally, the next step: ts_sdf = reduce(DataFrame.unionAll, ts_dfs) which combines the dataframes using unionAll is taking even longer. I think my questions are:
How do I avoid having to read the files in a forloop?
Why is the unionAll taking so long? Does PySpark not parallelize this process somehow?
In general, how do I do this better? Perhaps PySpark is not the ideal tool for this?
I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average of each cell over all files.
The pipeline looks like this (python):
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
The FileToRowsFn class looks like this (see below, some details omitted). The row_id is the 1st column and is a unique key of each row; it exists exactly once in each file, so that I can compute the average over all the files. There's some additional value provided as side inputs to the transformer, which is not shown inside the method body below, but is still used by the real implementation. This value is a dictionary that is created outside of the pipeline. I mention it here in case this might be a reason for the lack of parallelization.
class FileToRowsFn(beam.DoFn):
def process(self, file_element, additional_side_inputs):
with file_element.open() as csv_file:
for row_id, *values in csv.reader(TextIOWrapper(csv_file, encoding='utf-8')):
yield row_id, values
The AverageCalculatorFn is a typical beam.CombineFn with an accumulator, that performs the average of each cell of a given row over all rows with the same row_id across all files.
All this works fine, but there's a problem with performances and throughput: it takes more than 60 hours to execute this pipeline. From the monitoring console, I notice that the files are read sequentially (1 file every 2 minutes). I understand that reading a file may take 2 minutes (each file is 50 MB), but I don't understand why dataflow doesn't assign more workers to read multiple files in parallel. The cpu remains at ~2-3% because most of the time is spent in file IO, and the number of workers doesn't exceed 2 (although no limit is set).
The output of ReadMatches is 1000 file records, so why doesn't dataflow create lots of FileToRowsFn instances and dispatch them to new workers, each one handling a single file?
Is there a way to enforce such a behavior?
This is probably because all your steps get fused into a single step by the Dataflow runner.
For such a fused bundle to parallelize, the first step needs to be parallelizable. In your case this is a glob expansion which is not parallelizable.
To make your pipeline parallelizable, you can try to break fusion. This can be done by adding a Reshuffle transform as the consumer of one of the steps that produce many elements.
For example,
from apache_beam import Reshuffle
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Reshuffle' >> Reshuffle()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
You should not have to do this if you use one of the standard sources available in Beam such as textio.ReadFromText() to read data. (unfortunately we do not have a CSV source but ReadFromText supports skipping header lines).
See here for more information regarding the fusion optimization and preventing fusion.
I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services
Based on my research, since files are only few GBs, I found that Spark, etc would be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!)
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes!
What am I doing wrong here?
for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why on earth its ram usage skyrockets to 9GB for a 1GB csv input
(Note: if I don't set the block size dask crashes even on input 1GB)
My code
I can't share the CSV, but it has 1 integer column followed by 8 text columns. Both user_id and order_id columns referenced below are text columns.
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
Pandas
#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys
test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()
for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)
final.to_csv('pandastest.csv', index=False)
Dask
#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys
test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')
# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
The groupby operation is expensive because dask will try to shuffle data across workers to check who has which user_id values. If user_id has a lot of unique values (sounds like it), there is a lot of cross-checks to be done across workers/partitions.
There are at least two ways out of it:
set user_id as index. This will be expensive during the indexing stage, but subsequent operations will be faster because now dask doesn't have to check each partition for the values of user_id it has.
df = df.set_index('user_id')
collection = df.groupby('user_id')['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
if your files have a structure that you know about, e.g. as an extreme example, if user_id are somewhat sorted, so that first the csv file contains only user_id values that start with 1 (or A, or whatever other symbols are used), then with 2, etc, then you could use that information to form partitions in 'blocks' (loose term) in a way that groupby would be needed only within those 'blocks'.