My usecase goes like this:
Read one or more dataframes in a spark-scala app and register them as tables.
Get a python callable which would run pyspark based transformations on these dataframes.
Register the transformed dataframes as tables into the spark session from the pyspark callable.
Read these transformed dataframes from the scala-spark app and do optional post processing on them.
Can someone help achieve this kind of seamless scala-pyspark integration? The challenge is to be able to run python-based transformations on dataframes from inside the scala-spark app.
A working example would be very much appreciated.
Best Regards
Related
i am writing a very simple ETL(T) pipline currently:
look at ftp if new csv files exist
if yes than donwload them
Some initial Transformations
bulk insert the individual CSVs into a MS sql DB
Some additional Transformations
There can be alot of csv files. The srcript runs ok for the moment, but i have no concept of how to actually create a "managent" layer around this. Currently my pipeline runs linear. I have a list of the filenames that need to be loaded, and ( in a loop) i load them into the DB.
If something fails the whole pipeline has to rerun. I do not manage the state of the pipleine ( i.e. has an specific file already been downloaded and transformed/changed?).
There is no way to start from an intermediate point. How cold i break this down into individual taks that need to be performedß
I rougly now of tools like Airflow, but i feel that this is only a part of the necessary tools, and frankly i am to uneducated in this area to even ask the right questions.
It would be really nice if somebody could point me in the right direction of what i am missing and what tools are available.
Thanks in advance
I´m actually using Airflow to run etl-pipelines with similar steps described by you.
The whole workflow can be partitioned into single tasks. For almost every task Airflow provides an operator.
For
look at ftp if new csv files exist
u could use a file sensor with underlying ftp-connection
FileSensor
For
if yes than donwload them
you could use the BranchPythonOperator.
BranchPythonOperator
All succeding tasks could be wrapped into a .py function and then be executed via the PythonOperator.
Would definitely recommend using Airflow, but if you are looking for alternatives, there are plenty:
airflow-alternatives
I am trying to import multiple dataframes using Dask; however, it seems that unlike pandas, Dask doesn't have any commands to do so. Or at least I haven't been able to find a way to do it within its documentation/examples.
I know that I could transform the databases to csv or chunk it, but I'd like to rule out this alternative due to the complexity of the database.
Is there any command or plugin that allows me to parallelize the import with dask?
Dask dataframe doesn't support pandas.read_spss. There was a user with a similar problem here (reading sas format with dask): https://github.com/dask/dask/issues/1233
I want to build one application which will be running locally supporting real time data processing, and need to built using python.
The input that needs to be provided in real time, and which is in the form of google spreadsheets (Multiple users are providing there data at a time).
Also, needs to write real time output of the code back to spreadsheets in it's adjacent column.
Please help me for the same.
Thanks
You can use the spark-google-spreadsheets library to read and write to Google Sheets from Spark, as described here.
Here's an example of how you can read data from a Google Sheet into a DataFrame:
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
load("<spreadsheetId>/worksheet1")
Incremental updates will be tough. You might want to just try doing full refreshes.
I am new to Google cloud and know python to write few scripts, currently learning cloud functions and BiqQuery.
my question:
I need to join a large CSV file with multiple lookup files and replace values from lookup files.
learnt that dataflow can be used to do ETL,but don't know how to write the code in Python.
can you please share your insights.
Appreciate your help.
Rather than joining data in python, I suggest you separately extract and load the CSV and lookup data. Then run a BigQuery query that joins the data and writes the result to a permanent table. You can then delete the separately import data.
I have a requirement wherein I've to pull data from Neo4j and create Spark RDD's out of that data. I'm using Python in my project. There is this connector for the same purpose but it's written in Scala. So I can think of following workarounds for now -
Query data from neo4j in small chunks/batches, convert each chunk to Spark RDD using parallize() method. Finally merge/combine all the RDD's using union() method to get single RDD. And then I can do transformations & actions on them.
Another approach is to read data from Neo4j and create a Kafka producer out of it. Then use Kafka as a data source for Spark. e.g.
Neo4j -> Kafka -> Spark
I want to know that which one is more efficient for large chunks of data? and if there is any better approach for solving this problem, please help me out with that.
Note: I do tried to extend pyspark API in order to create custom RDD in python. pyspark is API is very different as compared to Spark's Scala/Java API. In case of Scala API's, custom RDD can be created by extending RDD class and overriding compute() and getPartitions() methods. But in pyspark API, I couldn't find compute() under RDD class in rdd.py
This blog post by Michael Hunger talks about using Spark to get CSV data into Neo4j, but maybe some of the Spark code will help you out. Also there is Mazerunner which is a Spark/Neo4j/GraphX integrated tool for using Spark to pass subgraphs to Neo4j.