I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
How to avoid downloading the parquet file twice?
How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)
The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely
ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
ds = Dataset.File.from_files(target)
ds.download(tmpdir)
df = pd.read_parquet(tmpdir)
The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:
ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()
I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().
download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:
If your script processes all the files in your dataset and the disk on
your compute resource is large enough for the dataset, the download
access mode is the better choice. The download access mode will avoid
the overhead of streaming the data at runtime. If your script accesses
a subset of the dataset or it's too large for your compute, use the
mount access mode.
Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.
I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.
Related
I try to read data from several text files in my google drive to google colab notebook by this following python code.
import os
import glob
# Load the Drive helper and mount
from google.colab import drive
# This will prompt for authorization.
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/AMI_2000_customers")
extension = 'txt'
all_filenames = pd.Series([i for i in glob.glob('*.{}'.format(extension))])
searchfor = ['2020', '2021']
result = list(all_filenames[all_filenames.str.contains('|'.join(searchfor))])
After that, I try to combine them together by running the code below. Each raw data file contains my monthly customers data. Thus, time series data continuously is concerned for doing data preparation in the next step.
data = pd.concat([pd.read_csv(f , sep='\t',header=None) for f in result])
My raw data files which meet a "searchfor" condition are around 24 files or around 11.7 GB and look like this in google drive directory.
I face a high RAM consumption problem (almost reach to a maximum limit of available RAM) when I execute the above program and I do not have an adequate available RAM to do other next process in google colab (I subscribed a google colab pro and able to access to used Python 3 Google Compute Engine backend both GPU and TPU which provide a memory space up to 35 GB)
Do we have an appropriate way to complete my task with reasonable RAM usage and computation time consumption to avoid an available RAM problem?.
Might something like this help? yield?
Lazy Method for Reading Big File in Python?
Here is a reference to yield:https://realpython.com/introduction-to-python-generators/
I am running Spark in Kubernetes as Standalone Spark Cluster Manager with two Spark Workers. I use Jupyter to set up Spark Applications. The DeployMode is set to "Client" so when the driver process is generated it will run in the Pod where Jupyter runs. We read from Amazon S3 Proxy a CSV file with request.get and transform it to a RDD and afterwards to a Spark Dataframe. For reading the CSV file from S3 we are not using the spark.read method but request.get(). The whole Process from reading to Spark Dataframe happens in a function which returns the Dataframe.
S3PROXY == Url to proxy
def loadFromS3intoSparkDataframe(s3PathNameCsv):
s3_rdd = spark2.sparkContext.parallelize(
requests.get(S3PROXY + "/object", params="key={0}".format(s3PathNameCsv)).content.decode("UTF-8").split('\n'),24
).map(lambda x: x.split(','))
header = s3_rdd.first()
return s3_rdd.filter(lambda row:row != header).toDF(header)
The RAM consumption for keeping this Spark Dataframe stored is 5 gb, the source CSV File is only 1 gb in size. The 5gb RAM consumption remains in the Driver Process. Some co-workers of mine say, there should be an option to permanently transfer the in memory storage to the Spark-Worker Nodes, to the Spark Executors. As far as i understood, this is only possible as copy with persist() or cache().
So my question is, is my understanding correct that by default the RDDs and Dataframes are stored in the driver process memory? And if so, is it possible to transfer the variables throughout the whole existence of the Spark Application to the executors? And is the 1 to 5 gb DataTransformation uncommon?
Solution:
We die not use the right approach to load the csv. If you do not want to store the data in the driver memory you have the use the spark.read.csv() function.
I wrote a MR job in python running by streaming jar package. I want to know how to use bulk load to put data into HBase.
I konw that there are 2 ways to get the data into hbase by bulk loading.
generate the HFiles in MR job, and use CompleteBulkLoad to load data into hbase.
use ImportTsv option and then use CompleteBulkLoad to load data.
I don't know how to use python generate HFile to fits in Hbase. And then I try to use ImportTsv utility. But got failure. I followed the instructions in this [example](http://hbase.apache.org/book.html#importtsv).But I got exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/Filter...
Now I want to ask 3 questions:
Whether Python could be used to generate HFile by streaming jar or not.
How to use importtsv.
Could bulkload be used to update the table in Hbase. I get a big file bigger than 10GB every day. Could bulkload be used to push the file into Hbase.
The hadoop version is: Hadoop 2.8.0
The hbase version is: HBase 1.2.6
Both running in standalone mode.
Thanks for any answer.
--- update ---
ImportTsv works correctly.
But I stil want to know how to generate the HFile in MR job by streaming jar in Python language.
You could try the happyBase.
table = connection.table("mytable")
with table.batch(batch_size=1000) as b:
for i in range(1200):
b.put(b'row-%04d'.format(i), {
b'cf1:col1': b'v1',
b'cf1:col2': b'v2',
})
As you may have imagined already, a Batch keeps all mutations in memory until the batch is sent, either by calling Batch.send() explicitly, or when the with block ends. This doesn’t work for applications that need to store huge amounts of data, since it may result in batches that are too big to send in one round-trip, or in batches that use too much memory. For these cases, the batch_size argument can be specified. The batch_size acts as a threshold: a Batch instance automatically sends all pending mutations when there are more than batch_size pending operations.
This need a Thrift server stand before hbase. Just a suggestion.
I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.
Spark jobs (I think) create a file for each partition so that it can handle for failures, etc..., so at the end of the job you are left with a folder that can have a lot of folders left in them. These are being automatically loaded to S3, so is there a way to merge them into a single compressed file that is ready for loading to Redshift?
Instead of the following, which will write one uncompressed file per partition in "my_rdd"...
my_rdd.saveAsTextFile(destination)
One could do...
my_rdd.repartition(1).saveAsTextFile(destination, compressionCodecClass=“org.apache.hadoop.io.compress.GzipCodec”)
This sends the data in all partitions to one particular worker node in the cluster to be combined into one massive partition, which will then be written out into a single gzip compressed file.
However, I don't believe this is a desirable solution to the problem. Just one thread writes out and compresses the single result file. If that file is huge, that could take "forever". Every core in the cluster sits idle but one. Redshift doesn't need everything to be in a single file. Redshift can easily handle loading a set of files --- use COPY with a "manifest file" or a "prefix": Using the COPY Command to Load from S3.