Google Cloud Dataflow (Python): function to join multiple files

Google Cloud Dataflow (Python): function to join multiple files - python

I am new to Google cloud and know python to write few scripts, currently learning cloud functions and BiqQuery.
my question:
I need to join a large CSV file with multiple lookup files and replace values from lookup files.
learnt that dataflow can be used to do ETL,but don't know how to write the code in Python.
can you please share your insights.
Appreciate your help.

Rather than joining data in python, I suggest you separately extract and load the CSV and lookup data. Then run a BigQuery query that joins the data and writes the result to a permanent table. You can then delete the separately import data.

Related

How to export a huge table from BigQuery into a Google cloud bucket as one file

I am trying to export a huge table (2,000,000,000 rows, roughly 600GB in size) from BigQuery into a google bucket as a single file. All tools suggested in Google's Documentation are limited in export size and will create multiple files.
Is there a pythonic way to do it without needing to hold the entire table in the memory?

While perhaps there are other ways to make it as a script, the recommended solution is to merge the files using Google Storage compose action.
What you have to do is:
export in CSV format
this produces many files
run the compose action batched from 32 files until the final one, the big file is merged
All this can be combined in a cloud Workflow, there is a tutorial here.

How to Perform Spark Streaming with Google Spreadsheets?

I want to build one application which will be running locally supporting real time data processing, and need to built using python.
The input that needs to be provided in real time, and which is in the form of google spreadsheets (Multiple users are providing there data at a time).
Also, needs to write real time output of the code back to spreadsheets in it's adjacent column.
Please help me for the same.
Thanks

You can use the spark-google-spreadsheets library to read and write to Google Sheets from Spark, as described here.
Here's an example of how you can read data from a Google Sheet into a DataFrame:
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
load("<spreadsheetId>/worksheet1")
Incremental updates will be tough. You might want to just try doing full refreshes.

Writing Python validation script for different data files from different data sources on premises for different rules

i am new to python and i want to move data files from different on premises data sources to azure data lake storage and i want to validate these data files for different rules of validation before the get moved using azure data factory pipeline and azure databrick
I am aware how to create custome python copy activity in databricks for running python script
i need help with writing the script for validation

You might want to have a look at Data Factory to move files between different storages. It's pretty straightforward and scalable and you can even create Data Flow pipelines based on a Spark Cluster without having to write a single line of code.

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.

I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.

I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.

You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Choice of technology for loading large CSV files to Oracle tables

I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience.
I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML.
The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.
I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server.
Options I am considering,
Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
Python script to convert the CSV to XML.
SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).
Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.
regards,
-v-
PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.

Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.

In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.

Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.