dask export dataframe to remote storage (S3) - python

When using
dask_df.to_csv('s3://mybucket/mycsv.csv')
I get an error that I should install s3fs
I did install it in the workers (with client.run()) and still got the error.
So I installed s3fs locally in my machine, then it does works.
But does it means that the data is first sent to my machine and only then exported to S3? Instead of being only processed in the cluster?
Also I get KilledWorker errors. The export is made of two dask dataframes made by dd.concat().

But does it means that the data is first sent to my machine and only then exported to S3? Instead of being only processed in the cluster?
No, it just means that your client process needs to also talk to S3 in order to set things up.
In general, the software environment on your workers and your client process should be the same.

Related

How can I transfer files from Google AI platform training job to my another compute instance or local machine?

Any ideas how I automatically send some files (mainly Tensorflow models) after training in Google AI platform to another compute instance or my local machine? I would like to run in my trainer for instance something like this os.system(scp -r ./file1 user#host:/path/to/folder). Of course I don’t need to use scp. It’s just an example. Is there such a possibility in Google? There is no problem to transfer files from job to Google Cloud Storage like this os.system('gsutil cp ./example_file gs://my_bucket/path/'). However when I try for example os.system('gcloud compute scp ./example_file my_instance:/path/') to transfer data from my AI platform job to another instance I get Your platform does not support SSH. Any ideas how can I do this?
UPDATE
Maybe there is a possibility to automatically download all the files from the google cloud storage which are in chosen folder? So I would for instance upload data from my job instance to the google cloud storage folder and my another instance would automatically detect changes and download all the new files?
UPDATE2
I found gsutil rsync but I am not sure whether it can be constantly running in the background? At this point the only solution that comes to my mind is to use cron job in the backend and run gsutil rsync for example every 10 minutes. But is doesn't seem to be optimal solution. Maybe there is a built-in tool or another better idea?
rsync command makes the contents under destination the same as the contents under source, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. source must specify a directory, bucket, or bucket subdirectory. But it command does not run in background.
Remember that the Notebook you're using is in fact a VM running JupyterLab, based on that you could run the command rsync once Tensorflow finished to created the files and sync it with a directory in another instance trying something like:
import os
os.system("rsync -avrz Tensorflow/outputs/filename root#ip:Tensorflow/otputs/file")
I suggest you to take a look in the rsync documentation to know all the options available to use that command.

Script that can automatically download new data from the server to my local backup

I have an application running on linux server and I need to create a local backup of it's data.
However, new data is being added to the application after every hour and I want to sync my local backup data with server's data.
I want to write a script (shell or python) that can automatically download new added data from the linux server to my local machine backup. But I am newbie to the linux envoirnment and don't know how to write shell script to achieve this.
What is the better way of achieving this ? And what would be the script to do so ?
rsync -r fits in your use case and it's a single line command.
rsync -r source destination
or the options you need according to your specific case.
So, you don't need a python script for that, but you can still write it and let it use the command above.
Moreover, if you want the Python script to do it in an automatic way, you may check the event scheduler module.
This depends on where and how your data is stored on the Linux server, but you could write a network application which pushes the data to a client and the client saves the data on the local machine. You can use sockets for that.
If the data is available via aan http server and you know how to write RESTful APIs, you could use that as well and make a task run on your local machine every hour which calls the REST API and handles its (JSON) data. Keep in mind that you need to secure the API if the server is running online and not in the same LAN.
You could also write a small application which downloads the files every hour from the server over FTP (if you want to backup files stored on the system). You will need to know the exact path of the file(s) to do this though.
All solutions above are meant for Python programming. Using a shell script is possible, but a little more complicated. I would use Python for this kind of tasks, as you have a lot of network related libraries available (ftp, socket, http clients, simple http servers, WSGI libraries, etc.)

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.
During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution
The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.

Where is RDD or Spark SQL dataframe stored or persisted in client deploy mode on a Spark 2.1 Standalone cluster?

I am deploying a Jupyter notebook(using python 2.7 kernel) on client side which accesses data on a remote and does processing in a remote Spark standalone cluster (using pyspark library). I am deploying spark cluster in Client mode. The client machine does not have any Spark worker nodes.
The client does not have enough memory(RAM). I wanted to know that if I perform a Spark action operation on dataframe like df.count()on client machine, will the dataframe be stored in Client's RAM or will it stored on Spark worker's memory?
If i understand correctly, then what you will get on the client side is an int. At least should be, if setup correctly. So the answer is no, the DF is not going to hit your local RAM.
You are interacting with the cluster via SparkSession (SparkContext for earlier versions). Even though you are developing -i.e. writing code- on the client machine, the actual computation of spark operations -i.e. running pyspark code- will not be performed on your local machine.

Deleting a csv file which is created using numpy.savetxt in pyspark

I am new to pyspark and python.
After saving a file in local system using numpy.savetxt("test.csv",file,delimiter=',')
I am using os to delete that file. os.remove("test.csv"). I am getting an error java.io.FileNotFoundException File file:/someDir/test.csv does not exist. The file numpy.savetxt() creates file with only read permission. How can save the same with read and write permission.
Using spark version 2.1
Looks like your spark workers are not able to access the file. You are probably running the master and workers on different servers. When you are trying to work on files, while having setup workers across different machines make sure these workers can access the file.You could keep the same copy of files among all the workers in the exact same location. It is always advisable to use DFS like Hadoop like "hdfs://path/file". When you do the workers can access these files.
More details on:
Spark: how to use SparkContext.textFile for local file system

Categories