I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.
Related
i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.
I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
How to avoid downloading the parquet file twice?
How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)
The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely
ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
ds = Dataset.File.from_files(target)
ds.download(tmpdir)
df = pd.read_parquet(tmpdir)
The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:
ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()
I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().
download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:
If your script processes all the files in your dataset and the disk on
your compute resource is large enough for the dataset, the download
access mode is the better choice. The download access mode will avoid
the overhead of streaming the data at runtime. If your script accesses
a subset of the dataset or it's too large for your compute, use the
mount access mode.
Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.
I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.
Let's assume I request a large object from a server, and I want to write it to the disk while I don't want to use a lot of memory (the object should not be fully loaded into memory, for example my object is about 5GB but my memory limitation is 100MB).
I write here a code in python for example
response = requests.get('https://example.com/verylargefile.txt', stream=True)
chunk_size = 1024*1024*15 # =15MB chunk for example
with open(f'some_file_name.txt', 'wb') as file:
for block in response.iter_content(chunk_size):
file.write(block)
In this code, for every 15M we get, we write them to the disk, but, in reality, will the file data pile up in memory? Like more data would come than the amount I will be able to write to the disk in a given time?
How does the streaming works? (I am familiar with HTTP streaming) does after each 15MB I handle I ask for the next 15MB from the server? (I'm sure it's not like that, but how it's like?)
I have this Python server, that connecting to sftp server, and pulling CSV files (there is a For loop that running in a nodeJS server, each time a different connection is coming)
In that Python server - i'm reading the CSV file with panda - like this
file = sftp.open(latestfile)
check = pd.read_csv(file).to_csv()
at the end, i return the check with the CSV file data inisde - and than i parse in the nodeJS server.
this process went really good and i managed to achieve a lot of data this way - but my Python server really crashed when he tried to read a big CSV file (22MB)
i searched online and tried to solve it with chunks and with .modin library and dask.dataframe but everytime i tried to use one of thsee methods i couldn't read the file content properly (.to_csv part)
I'm really lost right now because i can't get it to work (there can be larger files than that)
Here is a way to process your large csv files. It allows you to process groups of chunks at a time. You could modify it based on your requirements (such as through sftp etc).
Minimal example
import pandas as pd
chunksize = 10 ** 4
for chunk in pd.read_csv(latestfile, chunksize=chunksize):
process(chunk.to_csv())
I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins.
I am using the same script to process the files with high of volume of data (All the three input files contain about 30 million records). Now the processing taking hours and kept running for very long time.
So I am thinking of breaking the file into 100 small chunks based on the last two digits of the unique id and having it processed parallels. Are there any data pipeline packages that I could use to perform this?
BTW, I am running this process in my VDI machine.
I am not sure of any API as such for the function.But you can try multiprocessing and multithreading to process large volume of data