I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB.
What should be the best solution to do this efficiently?
64 GB file size shouldn't be the reason to worry you as Pyspark is even capable to process even 1TB of data.
You can use Azure Databricks to write the code in Pyspark and run it on cluster.
A Databricks cluster is a set of computation resources and
configurations on which you run data engineering, data science, and
data analytics workloads, such as production ETL pipelines, streaming
analytics, ad-hoc analytics, and machine learning.
You can refer this third-party tutorial Read a CSV file stored in blob container using python in DataBricks.
Though you shouldn't face any issue while processing the file but if required you can consider Boost Query Performance with Databricks and Spark
.
Related
After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.
For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.
I have been tasked with building an ETL job that takes financial csv data from an asset management program, transforms and delivers to our PeopleSoft Financial system.
I am using Talend and also writing some python scripts. This program will run once a week. The PeopleSoft team insists on using this "Excel to Ci" excel xlsm file which is an excel workbook with macros and VBA code. This thing is a nightmare to work with and isn't supported by Talend or fully compatible with python openpyxl package.
Is there a better way to push (csv) data into a PeopleSoft database while executing this supposed business logic?
PeopleTools Integration Broker allows you to create web services that can invoke a CI. Then you could invoke the service using Python.
https://docs.oracle.com/cd/E41633_01/pt853pbh1/eng/pt/tibr/concept_UnderstandingCreatingComponentInterface-BasedServices-076354.html
Another alternative is to develop an App Engine program to read in the csv file and invoke the CI that way using PeopleCode.
I am deploying a Jupyter notebook(using python 2.7 kernel) on client side which accesses data on a remote and does processing in a remote Spark standalone cluster (using pyspark library). I am deploying spark cluster in Client mode. The client machine does not have any Spark worker nodes.
The client does not have enough memory(RAM). I wanted to know that if I perform a Spark action operation on dataframe like df.count()on client machine, will the dataframe be stored in Client's RAM or will it stored on Spark worker's memory?
If i understand correctly, then what you will get on the client side is an int. At least should be, if setup correctly. So the answer is no, the DF is not going to hit your local RAM.
You are interacting with the cluster via SparkSession (SparkContext for earlier versions). Even though you are developing -i.e. writing code- on the client machine, the actual computation of spark operations -i.e. running pyspark code- will not be performed on your local machine.
How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.
I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before