i Launch cluster spark cassandra with datastax dse in aws cloud. So my dataset storage in S3. But i don't know how transfer data from S3 to my cluster cassandra. Please help me
The details depend on your file format and C* data model but it might look something like this:
Read the file from s3 into an RDD
val rdd = sc.textFile("s3n://mybucket/path/filename.txt.gz")
Manipulate the rdd
Write the rdd to a cassandra table:
rdd.saveToCassandra("test", "kv", SomeColumns("key", "value"))
What #phact described is through using the Spark API that comes with the DataStax Enterprise and could be very useful if there's ETL work that needs to be done along with the loading.
For loading only, you can use the sstableloader bulk loading capability. Here's a tutorial to get you started.
Related
I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation, BigQuery Storage API (beta) should be the way to go due to export size quotas (e.g., ExtractBytesPerDay) associated with other methods.
The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs.
from google.cloud import bigquery_storage_v1
client = bigquery_storage_v1.BigQueryReadClient()
table = "projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.
requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.
rows = reader.rows(session)
Is it possible to save data from the stream directly to Google Cloud Storage?
I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename(), but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.
I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.
I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)
Thank you!
Is it possible to save data from the stream directly to Google Cloud Storage?
By itself, the BigQuery Storage API is not capable of writing directly to GCS; you'll need to pair the API with code to parse the data, write it to local storage, and subsequently upload to GCS. This could be code that you write manually, or code from a framework of some kind.
It looks like the code snippet that you've shared processes each partition in a single-threaded fashion, which caps your throughput at the throughput of a single read stream. The storage API is designed to achieve high throughput through parallelism, so it's meant to be used with a parallel processing framework such as Google Cloud Dataflow or Apache Spark. If you'd like to use Dataflow, there's a Google-provided template you can start from; for Spark, you can use the code snippets that David has already shared.
An easy way to that would be to use Spark with the the spark-bigquery-connector? It uses the BigQuery Storage API in order to read the table directly into a Spark's DataFrame. You can create a Spark cluster on Dataproc, which is located as the same data centers as BigQuery and GCS, making the read and write speeds much faster.
A code example will look like this:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.write.format("avro").save("gs://bucket/path")
You can also filter the data and work on each partition separately:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.option("filter", "the_date='2020-05-12'") \
.load()
# OR, in case you don't need to give the partition at load
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.where("the_date='2020-05-12'").write....
Please note that in order to read large amounts of data you would need a sufficiently large cluster.
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
i am trying to export data from Athena (AWS) to Python. Or, is there a way to connect python to Athena just like there is a way to connect python to MySql.
i have around 15gb data in Athena and would like to export and perform further analysis. There needs to be some way to export such a large dataset.
Detailed steps would be appreciated!
Thanks in adavace!
Data is not actually stored in Amazon Athena. Rather, Amazon Athena looks at data that is stored in Amazon S3 and runs queries across it.
Therefore, if you just want the raw data, simply copy the files directly from S3. Simple!
However, if you wish to run a query in Amazon Athena and export/manipulate the results, you can use Athena — Boto 3 Docs documentation to call Athena from Python.
I am working on Glue since january, and have worked multiple POC, production data lakes using AWS Glue / Databricks / EMR, etc. I have used AWS Glue to read data from S3 and perform ETL before loading to Redshift, Aurora, etc.
I have a need now to read data from a source table which is on SQL SERVER, and fetch data, write to a S3 bucket in a custom (user defined) CSV file, say employee.csv.
Am looking for some pointers, to do this please.
Thanks
You can connect using JDBC specifying connectionType=sqlserver to get a dynamic frame connecting to SQL SERVER. See here for GlueContext docs
dynF = glueContext.getSource(connection_type="sqlserver", url = ..., dbtable=..., user=..., password=)
This task fits AWS DMS (Data Migration Service) use case. DMS is designed to either migrate data from one data storage to another or keep them in sync. It can certainly keep in sync as well as transform your source (i.e., MSSQL) to your target (i.e., S3).
There is one non-negligible constraint in your case thought. Ongoing sync with MSSQL source only works if your license is the Enterprise or Developer Edition and for versions 2016-2019.
I want to load data into Spark (on Databricks) from Google BigQuery. I notice that Databricks offers alot of support for Amazon S3 but not for Google.
What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?
The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.
This code should work wherever you happen to locate your Hadoop cluster.
That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.
Databricks now has documented how to use Google BigQuery via Spark here
Set spark config in cluster settings:
credentials <base64-keys>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
In pyspark use:
df = spark.read.format("bigquery") \
.option("table", table) \
.option("project", <project-id>) \
.option("parentProject", <parent-project-id>) \
.load()