I currently have a Python program which reads a local file (containing a pickled database object) and saves to that file when it's done. I'd like to branch out and use this program on multiple computers accessing the same database, but I don't want to worry about synchronizing the local database files with each other, so I've been considering cloud storage options. Does anyone know how I might store a single data file in the cloud and interact with it using Python?
I've considered something like Google Cloud Platform and similar services, but those seem to be more server-oriented whereas I just need to access a single file on my own machines.
You could install gsutil and the boto library and use that.
Related
I have a python program that ultimately writes a csv file using pandas:
df.to_csv(r'path\file.csv)
I was able to upload the files to the server via FileZilla and was also able to run the program on the EC2 server normally. However, I would now like to export a csv file to my local machine, but I don't know how to.
Do I have to write the csv file directly to a cloud drive (e.g. google drive via Pydrive)? What would be the easiest way?
You probably do not want to expose your computer to the dangers of the Internet. Therefore, it is better for your computer to 'pull' the data down, rather than allowing something on the Internet to 'push' something to it.
You could send the data to Amazon S3 or, if you are using a cloud-based storage service like Google Drive or Dropbox, use their SDK to upload the file to their storage.
I have a bunch of files in a Google Cloud Storage bucket, including some Python scripts and text files. I want to run the Python scripts on the text files. What would be the best way to go about doing this (App Engine, Compute Engine, Jupyter)? Thanks!
I recommend using Google Cloud Function, that can be triggered automatically each time you upload new file to the Cloud Storage to process it. You can see workflow for this in Cloud Function Storage Tutorial
You will need to at least download the python scripts onto an environment first (be it GCE or GAE). To access the GCS text files, you can use https://pypi.org/project/google-cloud-storage/ library. I don't think you can execute python scripts from the object bucket itself.
If it is troublesome to change the python codes for reading the text files from GCS, you will have to download everything into your environment (e.g. using gsutil)
What is the best method to grab files from a Windows shared folder on the same network?
Typically, I am extracting data from SFTPs, SalesForce, or database tables, but there are a few cases where end-users need to upload a file to a shared folder that I have to retrieve. My process up to now has been to have a script running on a Windows machine which just grabs any new/changed files and loads them to an SFTP, but that is not ideal. I can't monitor it in my Airflow UI, I need to change my password on that machine physically, mapped network drives seem to break, etc.
Is there a better method? I'd rather the ETL server handle all of this stuff.
Airflow is installed on remote Linux server (same network)
Windows folders are just standard UNC paths where people have access based on their NT ID. These users are saving files which I need to retrieve. These users are non-technical and did not want WinSCP installed to share the data through an SFTP instead or even a Sharepoint (where I could use Shareplum, I think).
I would like to avoid mounting these folders and instead use Python scripts to simply copy the files I need as per an Airflow schedule
Best if I can save my NT ID and password within an Airflow connection to access it with a conn_id
If I'm understanding the question correctly, you have a shared folder mounted on your local machine — not the Windows server where your Airflow install is running. Is it possible to access the shared folder on the server instead?
I think a file sensor would work your use case.
If you could auto sync the shared folder to a cloud file store like S3, then you could use the normal S3KeySensor and S3PrefixSensor that are commonly used . I think this would simplify your solution as you wouldn't have to be concerned with whether the machine(s) the tasks are running on has access to the folder.
Here are two examples of software that syncs a local folder on Windows to S3. Note that I haven't used either of them personally.
https://www.cloudberrylab.com/blog/how-to-sync-local-folder-with-amazon-s3-bucket-with-cloudberry-s3-explorer/
https://s3browser.com/amazon-s3-folder-sync.aspx
That said, I do think using FTPHook.retrieve_file is a reasonable solution if you can't have your files in cloud storage.
I am writing an python application which reads/parses a file of this this kind.
myalerts.ini,
value1=1
value2=3
value3=10
value4=15
Currently I store this file in local filesystem. If I need to change this file I need to have physical access to this computer.
I want to move this file to cloud so that I can change this file anywhere (another computer or from phone).
If this application is running on some machine I should be able to change this file on the cloud and the application which is running on another machine which I don't have physical access to will be able to read updated file.
Notes,
I am new to both python and aws.
I am currently running it on my local mac/linux and planning on deploying on aws.
There are many options!
Amazon S3: This is the simplest option. Each computer could download the file at regular intervals or just before they run a process. If the file is big, the app could instead check whether the file has changed before downloading.
Amazon Elastic File System (EFS): If your applications are running on multiple Amazon EC2 instances, EFS provides a shared file system that can be mounted on each instance.
Amazon DynamoDB: A NoSQL database instead of a file. Much faster than parsing a file, but less convenient for updating values — you'd need to write a program to update values, eg from the command-line.
AWS Systems Manager Parameter Store: A managed service for storing parameters. Applications (anywhere on the Internet) can request and update parameters. A great way to configure cloud-based application!
If you are looking for minimal change and you want it accessible from anywhere on the Internet, Amazon S3 is the easiest choice.
Whichever way you go, you'll use the boto3 AWS SDK for Python.
Is it possible to create a new excel spreadsheet file and save it to an Amazon S3 bucket without first saving to a local filesystem?
For example, I have a Ruby on Rails web application which now generates Excel spreadsheets using the write_xlsx gem and saving it to the server's local file system. Internally, it looks like the gem is using Ruby's IO.copy_stream when it saves the spreadsheet. I'm not sure this will work if moving to Heroku and S3.
Has anyone done this before using Ruby or even Python?
I found this earlier question, Heroku + ephemeral filesystem + AWS S3. So, it would seem this is not possible using Heroku. Theoretically, it would be possible using a service which allows adding an Amazon EBS.
You have dedicated Ruby Gem to help you moving file to Amazon S3:
https://rubygems.org/gems/aws-s3
If you want more details about the implementation, here is the git repository. The documentation on the page is very complete, and explain how to move file to S3. Hope it helps.
Once your xls file is created, the library helps you create a S3Objects and store it into a Bucket (which you can also create with the library).
S3Object.store('keyOfYourData', open('nameOfExcelFile.xls'), 'bucketName')
If you want more choice, Amazon also delivered an official Gem for this purpose: https://rubygems.org/gems/aws-sdk