I am new to pyspark and python.
After saving a file in local system using numpy.savetxt("test.csv",file,delimiter=',')
I am using os to delete that file. os.remove("test.csv"). I am getting an error java.io.FileNotFoundException File file:/someDir/test.csv does not exist. The file numpy.savetxt() creates file with only read permission. How can save the same with read and write permission.
Using spark version 2.1
Looks like your spark workers are not able to access the file. You are probably running the master and workers on different servers. When you are trying to work on files, while having setup workers across different machines make sure these workers can access the file.You could keep the same copy of files among all the workers in the exact same location. It is always advisable to use DFS like Hadoop like "hdfs://path/file". When you do the workers can access these files.
More details on:
Spark: how to use SparkContext.textFile for local file system
Related
In AWS Glue, for a simple pandas job of reading data in XLSX and writing to CSV. I have a small code. As per the Python Glue instructions, I have zipped the required libraries and provided the as packages to Glue Job while execution.
Question: What do the following logs convey?
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/fsspec.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/jmespath.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/s3fs.zip
....
please elaborate with an example?
In python shell jobs, you should add external libraries in egg file and not zip file. Zip file is for Spark job.
I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation. Script does all automatically. You may find code at https://github.com/fatangare/aws-python-shell-deploy. Script will take csv file and convert it into excel file using pandas and xlswriter libraries.
When using
dask_df.to_csv('s3://mybucket/mycsv.csv')
I get an error that I should install s3fs
I did install it in the workers (with client.run()) and still got the error.
So I installed s3fs locally in my machine, then it does works.
But does it means that the data is first sent to my machine and only then exported to S3? Instead of being only processed in the cluster?
Also I get KilledWorker errors. The export is made of two dask dataframes made by dd.concat().
But does it means that the data is first sent to my machine and only then exported to S3? Instead of being only processed in the cluster?
No, it just means that your client process needs to also talk to S3 in order to set things up.
In general, the software environment on your workers and your client process should be the same.
I want to implement a Machine Learning algorithm which can operate on homomorphic data using PySEAL library. PySEAL library is released as a docker container with an 'examples.py' file which shows some homomorphic encryption example. I want to edit the 'examples.py' file to implement the ML algorithm. I trying to import a CSV file in this way -
dataset = pd.read_csv ('Dataset.csv')
I have imported pandas library successfully. I have tried many approaches to import the CSV file but failed. How can I import it?
I am new to Docker. Detailed procedure will be really helpful.
You can either do it via the Docker build process (assuming you are the one creating the image) or through a volume mapping that would be accessed by the container during runtime.
Building source with Dataset.csv within
For access through the build, you could do a Docker Copy command to get the file within the workspace of the container
FROM 3.7
COPY /Dataset.csv /app/Dataset.csv
...
Then you can directly access the file via /app/Dataset.csv from the container using pandas.read_csv() function, like -
data=pandas.read_csv('/app/Dataset.csv')
Mapping volume share for Dataset.csv
If you don't have direct control over the source image creation, or do not want the dataset packaged with the container (which may be the best practice depending on the use case). You can share it through a volume mapping when starting the container:
dataset = pd.read_csv ('app/Dataset.csv')
Assuming your Dataset.csv is in my/user/dir/Dataset.csv
From CLI:
docker run -v my/user/dir:app my-python-container
The benefit of the latter solution is you can then continue to edit the file 'Dataset.csv' on your host and the file will reflect changes made by you OR the python process should that occur.
What is the best method to grab files from a Windows shared folder on the same network?
Typically, I am extracting data from SFTPs, SalesForce, or database tables, but there are a few cases where end-users need to upload a file to a shared folder that I have to retrieve. My process up to now has been to have a script running on a Windows machine which just grabs any new/changed files and loads them to an SFTP, but that is not ideal. I can't monitor it in my Airflow UI, I need to change my password on that machine physically, mapped network drives seem to break, etc.
Is there a better method? I'd rather the ETL server handle all of this stuff.
Airflow is installed on remote Linux server (same network)
Windows folders are just standard UNC paths where people have access based on their NT ID. These users are saving files which I need to retrieve. These users are non-technical and did not want WinSCP installed to share the data through an SFTP instead or even a Sharepoint (where I could use Shareplum, I think).
I would like to avoid mounting these folders and instead use Python scripts to simply copy the files I need as per an Airflow schedule
Best if I can save my NT ID and password within an Airflow connection to access it with a conn_id
If I'm understanding the question correctly, you have a shared folder mounted on your local machine — not the Windows server where your Airflow install is running. Is it possible to access the shared folder on the server instead?
I think a file sensor would work your use case.
If you could auto sync the shared folder to a cloud file store like S3, then you could use the normal S3KeySensor and S3PrefixSensor that are commonly used . I think this would simplify your solution as you wouldn't have to be concerned with whether the machine(s) the tasks are running on has access to the folder.
Here are two examples of software that syncs a local folder on Windows to S3. Note that I haven't used either of them personally.
https://www.cloudberrylab.com/blog/how-to-sync-local-folder-with-amazon-s3-bucket-with-cloudberry-s3-explorer/
https://s3browser.com/amazon-s3-folder-sync.aspx
That said, I do think using FTPHook.retrieve_file is a reasonable solution if you can't have your files in cloud storage.
I currently have a Python program which reads a local file (containing a pickled database object) and saves to that file when it's done. I'd like to branch out and use this program on multiple computers accessing the same database, but I don't want to worry about synchronizing the local database files with each other, so I've been considering cloud storage options. Does anyone know how I might store a single data file in the cloud and interact with it using Python?
I've considered something like Google Cloud Platform and similar services, but those seem to be more server-oriented whereas I just need to access a single file on my own machines.
You could install gsutil and the boto library and use that.