Let's say I've got some data which is in an existing dataset which has no column partitioning but is still 200+ files. I want to rewrite that data into a hive partition. The dataset is too big to open its entirety into memory. In this case the dataset came from a CETAs in Azure Synapse but I don't think that matters.
I do:
dataset=ds.dataset(datadir, format="parquet", filesystem=abfs)
new_part = ds.partitioning(pa.schema([("nodename", pa.string())]), flavor="hive")
scanner=dataset.scanner()
ds.write_dataset(scanner, newdatarootdir,
format="parquet", partitioning=new_part,
existing_data_behavior="overwrite_or_ignore",
max_partitions=2467)
In this case there are 2467 unique nodenames and so I want there to be 2467 directories each with 1 file. However what I get is 2467 directories each with 100+ files of about 10KB each. How do I get just 1 file per partition?
I could do a second step
for node in nodes:
fullpath=f"{datadir}\\{node}"
curnodeds=ds.dataset(fullpath, format="parquet")
curtable=curnodeds.to_table()
os.makedirs(f"{newbase}\\{node}")
pq.write_table(curtable, f"{newbase}\\{node}\\part-0.parquet",
version="2.6", flavor="spark", data_page_version="2.0")
Is there a way to incorporate the second step into the first step?
Edit:
This is with pyarrow 7.0.0
Edit(2):
Using max_open_files=3000 did result in one file per partition. The metadata comparison between the two are that the two step approach has (for one partition)...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A40>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 1
format_version: 2.6
serialized_size: 1673
size on disk: 278kb
and the one step...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A90>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 188
format_version: 1.0
serialized_size: 148313
size on disk: 1.04MB
Obviously in the two step version I'm explicitly setting the version to 2.6 so that explains that difference. The total size of my data after the 2-step is 655MB vs the 1 step is 2.6GB. The time difference was pretty significant too. The two step was something like 20 minutes for the first step and 40 minutes for the second. The one step was like 5 minutes for the whole thing.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset? I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
The dataset writer was changed significantly in 7.0.0. Previously, it would always create 1 file per partition. Now there are a couple of settings that could cause it to write multiple files. Furthermore, it looks like you are ending up with a lot of small row groups which is not ideal and probably the reason the one-step process is both slower and larger.
The first significant setting is max_open_files. Some systems limit how many file descriptors can be open at one time. Linux defaults to 1024 and so pyarrow attempts defaults to ~900 (with the assumption that some file descriptors will be open for scanning, etc.) When this limit is exceeded pyarrow will close the least recently used file. For some datasets this works well. However, if each batch has data for each file this doesn't work well at all. In that case you probably want to increase max_open_files to be greater than your number of partitions (with some wiggle room because you will have some files open for reading too). You may need to adjust OS-specific settings to allow this (generally, these OS limits are pretty conservative and raising this limit is fairly harmless).
I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
In addition, the 7.0.0 release adds min_rows_per_group, max_rows_per_group and max_rows_per_file parameters to the write_dataset call. Setting min_rows_per_group to something like 1 million will cause the writer to buffer rows in memory until it has enough to write. This will allow you to create files with 1 row group instead of 188 row groups. This should bring down your file size and fix your performance issues.
However, there is a memory cost associated with this which is going to be min_rows_per_group * num_files * size_of_row_in_bytes.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset?
The write_dataset call works on several different formats (e.g. csv, ipc, orc) and so the call only has generic options that apply regardless of format.
Format-specific settings can instead be set using the file_options parameter:
# It does not appear to be documented but make_write_options
# should accept most of the kwargs that write_table does
file_options = ds.ParquetFileFormat().make_write_options(version='2.6', data_page_version='2.0')
ds.write_dataset(..., file_options=file_options)
Related
I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.
I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.
My current code is as following.
import pandas as pd
def Group_ID_Company(chunk_of_dataset):
return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()
chunk_size = 9000000
chunk_skip = 1
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')
for i in range(0, 38):
chunk_skip += chunk_size;
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)
There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.
Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.
A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.
The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).
The answer form MarianD worked perfectly, I am answering to share the solution code here.
Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.
import dask.dataframe as dd
transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')
# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None
DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.
However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.
I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t"). The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv(). It is a file with 305 columns and +/- 1 000 000 rows.
When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20 in read_csv() it works fine.
The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.
My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()? Am I doing anything wrong?
EDIT: When executing df.dtypes the types are a mix of object, float64, and int64
UPDATE: I used the following code to overcome the problem and perform my calculations:
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column
Pandas is cutting up the file, and storing the data individually. I don't know the data types, so I'll assume the worst: strings.
In Python (on my machine), an empty string needs 49 bytes, with an additional byte for each character if ASCII (or 74 bytes with extra 2 bytes for each character if Unicode). That's roughly 15Kb for a row of 305 empty fields. A million and a half of such rows would take roughly 22Gb in memory, while they would take about 437 Mb in a CSV file.
Pandas/numpy are good with numbers, as they can represent a numerical series very compactly (like C program would). As soon as you step away from C-compatible datatypes, it uses memory as Python does, which is... not very frugal.
When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.
The Google Colab instance starts with about 12.72GB of RAM available.
After creating the DataFrame, theBigList, about 9.99GB of RAM have been used.
Already this is a rather uncomfortable situation to be in, since it is not unusual for
Pandas operations to require as much additional space as the DataFrame it is operating on.
So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).
However, if we use the code you posted, we can see that the RAM usage slowly increases
as chunks of theBigList is stored in the database iteratively.
theBigList DataFrame stores the strings in a NumPy array. But in the process
of transferring the strings to the sqlite database, the NumPy strings are
converted into Python strings. This takes additional memory.
Per this Theano tutoral which discusses Python internal memory management,
To speed-up memory allocation (and reuse) Python uses a number of lists for
small objects. Each list will contain objects of similar size: there will be a
list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object
needs to be created, either we reuse a free block in the list, or we allocate a
new one.
... The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its
location is not returned to Python’s global memory pool (and even less to the
system), but merely marked as free and added to the free list of items of size
x. The dead object’s location will be reused if another object of compatible
size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that,
like goldfishes, these small object lists only keep growing, never shrinking,
and that the memory footprint of your application is dominated by the largest
number of small objects allocated at any given point.
I believe this accurately describes the behavior you are seeing as this loop executes:
for i in range(0, 588):
theBigList.iloc[i*10000:(i+1)*10000].to_sql(
'CS_table', engine, index=False, if_exists='append')
Even though many dead objects' locations are being reused for new strings, it is
not implausible with essentially random strings such as those in theBigList that extra space will occasionally be
needed and so the memory footprint keeps growing.
The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.
In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:
import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA
def load_and_store(dbpath):
engine = SA.create_engine("sqlite:///{}".format(dbpath))
for i in range(0, 47):
print('step {}: {}'.format(i, mc.report_memory()))
for letter in list('ABCDEF'):
path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
comb = np.load(path, allow_pickle=True)
toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
toPD.columns = ['title', 'abstract']
toPD = toPD.loc[toPD['abstract'] != '']
toPD.to_sql('CS_table', engine, index=False, if_exists='append')
dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)
which prints
step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551
The last number on each line is the amount of memory consumed by the process as reported by
matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting
the size of the physical pages of the core image of the process (including text, data, and stack space).
By the way, another basic trick you can use manage memory is to use functions.
Local variables inside the function are deallocated when the function terminates.
This relieves you of the burden of manually calling del and gc.collect().
I have a few questions wrapped up into this issue. I realize this might be a convoluted post and can provide extra details.
A code package I use can produce large .h5 files (source.h5) (100+ Gb), where almost all of this data resides in 1 dataset (group2/D). I want to make a new .h5 file (dest.h5) using Python that contains all datasets except group2/D of source.h5 without needing to copy the entire file. I then will condense group2/D after some postprocessing and write a new group2/D in dest.h5 with much less data. However, I need to keep source.h5 because this postprocessing may need to be performed multiple times into multiple destination files.
source.h5 is always structured the same and cannot be changed in either source.h5 or dest.h5, where each letter is a dataset:
group1/A
group1/B
group2/C
group2/D
I thus want to initially make a file with this format:
group1/A
group1/B
group2/C
and again, fill in group2/D later. Simply copying source.h5 multiple times is always possible, but I'd like to avoid having to copy a huge file a bunch of times because disk space is limited and this is something that isn't a 1 off case.
I searched and found this question (How to partially copy using python an Hdf5 file into a new one keeping the same structure?) and tested if dest.h5 would be the same as source.h5:
fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group1', fd)
fd.create_group('group2')
fs.copy('group2/C', fd['/group2'])
fd.copy('group2/D', fd['/group2'])
fs.close()
fd.close()
but the code package I used couldn't read the file I created (which I must have happen), implying there was some critical data loss when I did this operation (the file sizes differ by 7 kb also). I'm assuming the problem was when I created group2 manually because I checked with numpy that the values in group1 datasets exactly matched in both source.h5 and dest.h5. Before I did any digging into what data is missing I wanted to get a few things out of the way:
Question 1: Is there .h5 file metadata that accompanies each group or dataset? If so, how can I see it so I can create a group2 in dest.h5 that exactly matches the one in source.h5? Is there a way to see if 2 groups (not datasets) exactly match each other?
Question 2: Alternatively, is it possible to simply copy the data structure of a .h5 file (i.e. groups and datasets with empty lists as a skeleton file) so that fields can be populated later? Or, as a subset of this question, is there a way to copy a blank dataset to another file such that any metadata is retained (assuming there is some)?
Question 3: Finally, to avoid all this, is it possible to just copy a subset of source.h5 to dest.h5? With something like:
fs.copy(['group1','group2/C'], fd)
Thanks for your time. I appreciate you reading this far
I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files