Apache Beam on Dataflow - Load external file

Apache Beam on Dataflow - Load external file - python

I am running an Apache Beam ETL job on Dataflow from a template I created. I want to add an additional step where I upload the processed data to BigQuery and create the table if needed, so I need to declare the table schema.
The schema is actually pretty large, so maybe the best option is to write it on a .json file and import it with a simple:
with open('filename.json', 'r') as f:
schema = json.load(f)
However, how can I declare this external file on my template so that when the job gets deployed and run on the workers they can find the file? Can I set any option like that on my setup.py so that the file gets saved with the template?
Other alternatives I have thought are:
Saving the file to a GCS bucket and reading it using the cloud storage client and pass it to my functions as a side input
Writing the schema manually on my pipeline file, even if it gets dirty
What would be the best way to proceed?

For big files I would store in a GCS bucket (Alternative 1). If this is not applicable you can create a flex template for your pipeline and add the schema to the container.

Related

Is there a way to generate a csv file in certain path in Airflow?

I've a question of Airflow. I get a dataset from SQL statement through Snowflake Hook. I need save this info into a csv file in certain path folder locally, Is there anyway to do that?
Other question, is necessary store files into a cloud storage service?
Thanks for your help.

Where to host a data ingestion ETL ? input data (csv file) automatically from Azure blob storage to Azure Posgresql

I would like to do a daily ingesting job that takes a CSV file from blob storage and put it integrate it into a PostgreSQL database. I have the constraint to use python. Which solution do you recommend me to use for building/hosting my ETL solution ?
Have a nice day :)
Additional information:
The size and shape of the CSV file are 1.35 GB, (1292532, 54).
I will push to the database only 12 columns out of 54.

You can try to use Azure Data Factory to achieve this. New a Copy Data activity, source is your csv and sink is PostgreSQL database. In the Mapping setting, just select the columns you need. Finally, create a schedule trigger to run it.

Write a list directly to gcs file

Currently i have a simple python code which writes elements of a list to file. How can i do the same, but write to file in Google cloud storage
Current code :
with open('/home/nitin/temp.txt', 'w') as f:
for item in ["Nitin", "Agarwal"]:
f.write(item[0]+'\n')

I tend to find blob's "upload_from_string" method preferable when dumping some in-memory data to GSC, instead of dumping to a local file and uploading it to GSC.
Would this work for you?
from google.cloud.storage import Blob
data = ["Nitin", "Agarwal"]
client = storage.Client(project="my-project")
bucket = client.get_bucket("my-bucket")
blob = Blob("data", bucket)
blob.upload_from_string("\n".join(data))
https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.upload_from_string

Google Cloud Storage (GCS) is also known as blob storage as opposed to file storage. Don't think of GCS as holding files, think of it as holding blobs of data. This means that you can't use file system APIs. Instead, think of building the blob of data that you want to write into GCS locally and then writing that complete blob into GCS as a unit. When you write data into GCS, it is immutable ... this means that once written, you can't change it (you can delete it and rewrite a new copy). This means that you can't append to a blob.
Here is a good example:
Using Cloud Storage with Python

AWS Glue - Pick Dynamic File

Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!

You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue

Writing files to google app engine blobstore as the methods are going to be deprecated

I want to save some data fetched from the web to blobstore, but the google doc says that
Deprecated: The Files API feature used here to write files to Blobstore is going to be removed at some time in the future, in favor of writing files to Google Cloud Storage and using Blobstore to serve them.
The code in python is as follows
from __future__ import with_statement
from google.appengine.api import files
# Create the file
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write('data')
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
blob_key = files.blobstore.get_blob_key(file_name)
I am wondering if there is another way to write to blobstore instead of the official upload method.

If you want to use a file-like API, you have to go with GCS.
Blobstore is for uploading more-or-less static images and serving them.
If you want to write using a a file-like API and then serve from Blobstore, you can write to GCS and get a BlobKey to the file.
https://cloud.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
But writing to BlobStore like you want is deprecated. Stop trying to do it that way.

An option may be to put the data in the datastore using a TextProperty

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Beam on Dataflow - Load external file - python

For big files I would store in a GCS bucket (Alternative 1). If this is not applicable you can create a flex template for your pipeline and add the schema to the container.

Related

Is there a way to generate a csv file in certain path in Airflow?

Where to host a data ingestion ETL ? input data (csv file) automatically from Azure blob storage to Azure Posgresql

Write a list directly to gcs file

AWS Glue - Pick Dynamic File

Writing files to google app engine blobstore as the methods are going to be deprecated

Categories

Resources