I have a requirement wherein I've to pull data from Neo4j and create Spark RDD's out of that data. I'm using Python in my project. There is this connector for the same purpose but it's written in Scala. So I can think of following workarounds for now -
Query data from neo4j in small chunks/batches, convert each chunk to Spark RDD using parallize() method. Finally merge/combine all the RDD's using union() method to get single RDD. And then I can do transformations & actions on them.
Another approach is to read data from Neo4j and create a Kafka producer out of it. Then use Kafka as a data source for Spark. e.g.
Neo4j -> Kafka -> Spark
I want to know that which one is more efficient for large chunks of data? and if there is any better approach for solving this problem, please help me out with that.
Note: I do tried to extend pyspark API in order to create custom RDD in python. pyspark is API is very different as compared to Spark's Scala/Java API. In case of Scala API's, custom RDD can be created by extending RDD class and overriding compute() and getPartitions() methods. But in pyspark API, I couldn't find compute() under RDD class in rdd.py
This blog post by Michael Hunger talks about using Spark to get CSV data into Neo4j, but maybe some of the Spark code will help you out. Also there is Mazerunner which is a Spark/Neo4j/GraphX integrated tool for using Spark to pass subgraphs to Neo4j.
Related
My usecase goes like this:
Read one or more dataframes in a spark-scala app and register them as tables.
Get a python callable which would run pyspark based transformations on these dataframes.
Register the transformed dataframes as tables into the spark session from the pyspark callable.
Read these transformed dataframes from the scala-spark app and do optional post processing on them.
Can someone help achieve this kind of seamless scala-pyspark integration? The challenge is to be able to run python-based transformations on dataframes from inside the scala-spark app.
A working example would be very much appreciated.
Best Regards
I am new to Google cloud and know python to write few scripts, currently learning cloud functions and BiqQuery.
my question:
I need to join a large CSV file with multiple lookup files and replace values from lookup files.
learnt that dataflow can be used to do ETL,but don't know how to write the code in Python.
can you please share your insights.
Appreciate your help.
Rather than joining data in python, I suggest you separately extract and load the CSV and lookup data. Then run a BigQuery query that joins the data and writes the result to a permanent table. You can then delete the separately import data.
I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.
More specifically , can i use some bridge for this like first i should copy data to excel fro mongodb and then that excel sheets data could easily be imported into mysql by some scripts like as in Python.
MongoDB does not offer any direct tool to do this, but you have many options to achieve this.
You can:
Write your own tool using your favorite language, that connect to MongoDB & MySQL and copy the data
Use mongoexport to create files and mysqlimport to reimport them into MySQL
Use an ETL (Extract, Transform, Load) that connect to MongoDB, allow you to transform the data and push them into MySQL. You can for example use Talend that has connector for MongoDB, bu you have many other solutions.
Note: Keep in mind that a simple document could contains complex structures such as Array/List, sub-documents, and even an Array of sub-documents. These structures can not be imported directly into a single table record, this is why most of the time you need a small transformation/mapping layer.
I am interested in but am ignorant to the best method of extracting quantitative data fast and efficiently that I have inserted into MongoDB.
I will explain my process. I used MongoDB to hold a variety of quantiative data that I inserted from multiple .log files.
Now that the information is inserted, I would like to extract certain data through queries, format it into an array, and display it in a form of a GUI (matplotlib).
I am confused on how to go about the best method of extracting the data. Thank you.
There are some good tutorials on using the python and mongodb such as this http://api.mongodb.org/python/current/tutorial.html
Also there is some more information on SO on matplotplib with mongodb such as this one Mongodb data statistics visualization using matplotlib
It's probably better to start trying some things and then asking specific questions on SO when you get stuck.
use the standard query API
use Map-Reduce
use the new MongoDB aggregation framework