I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.
Related
I have 2 stages in my data pipeline, First stage reads data from source and dumps to intermediate bucket and next stage reads data from this intermediate bucket. I have athena setup on intermediate stage and we are planning to read this partition data from athena rather than reading a file (reason for using Athena: We might have scenarios where we need to read from different partitions based on some condition in a single read).
Should we go ahead with this approach, as we know Athena has some limitations while reading data into pandas dataframe, like we can only have 1000 records once.
Is there a better solution for this usecase. We are using Pandas.
We have decided to use awsdatawrangler for our purposes since it is more reliable and is meant for the same purpose that we are trying achieve.
Using beam.io.WriteToBigQuery and beam.io.BigQuerySource
How large is the very large dataset that apache-beam can't handle without partitioning?
They mentioned in the official website :
If you are using the Beam SDK for Python, you might have import size quota issues if you write a very large dataset source
which are really confusing! I have 100,000 row of data in one BigQuery table, and I don't think that is very large?
But I am facing very high latency when read the data and write it again to other table in BigQuery.
I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.
I am working on querying data and then building a visualization on top of it. Currently my whole pipeline works but it can take upwards of 10 minutes sometimes to return the results of my query and I am very sure I am missing some optimization or another crucial step that is causing this slow speed.
Details:
I have about 500gb in 3500 csv’s. I store these in an Azure Blob Storage Account and run a spark cluster on Azure HDInsights. I am using spark 2.1.
Here is the script(PySpark3 on Azure Jupyter Notebook) I use to ingest the data:
csv_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/*.csv', header=True, inferSchema=True) //Read CSV
csv_df.write.parquet('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet’) //Write Parquet
parquet_df = spark.read.csv('wasb://containername#storageaccountname.blob.core.windows.net/folder/parquet_folder/csvdfdata.parquet) //Read Parquet
parquet_df.createOrReplaceTempView(‘temp_table’) //Create a temporary table
spark.sql("create table permenant_table as select * from temp_table"); //Create a permanent table
I then use the ODBC Driver and this code to pull data. I understand odbc can slow things a little but I believe 10 minutes is way more than expected.
https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs
My code to pull data is similar to this ^
The problem is that the pipeline works but it is way too slow for it to be of any use. The visualizations I create need to pull data in a few seconds at best.
Other details:
A good amount of queries use DateID which has dates in int format = 20170629 (29th june 2017)
Sample Query = select DateId, count(PageId) as total from permanent_table where (DateId >= 20170623) and (DateId <= 20170629) group by DateId order by DateId asc
Any help would be greatly appreciated! Thanks in advance!
Thank You!
First, one of clarification: What queries are you running from ODBC connection? Is it table creation queries? They would take long time. Make sure you run only read queries from ODBC on a pre-created hive table.
Now assuming you do the above here is few things you can do to make queries run in few seconds.
Thrift server on HDI uses dynamic resource allocation. So the first query will take extra time while resources are allocated. After that it should be faster. You can check status of Ambari -> Yarn UI -> Thrift application how much resources it uses - it should use all cores of your cluster.
3500 files is too much. When you create parquet table coalesce(num_partitions) (or repartition) it into smaller number of partitions. Adjust it so there is about 100MB per partition or if there is not enough data - at least one partition per core of your cluster.
In your data generation script you can skip one step - instead of creating temp table - directly create hive table in parquet format. Replace csv_df.write.parquet with csv_df.write.mode(SaveMode.Overwrite).saveAsTable("tablename")
For date queries you can partition your data by year, month, day columns (you will need to extract them first). If you do this you won't need to worry about #2. You may end-up with too many files, if so you would need to reduce partitioning to only year, month.
Size of your cluster. For 500GB of text files you should be fine with few nodes of D14v2 (may be 2-4). But depends on complexity of your queries.
I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.
However, it runs very slow and loads only about 50k documents in 6-7 hours.
Is there any way that I can speed this task up?
sometimes RDBMS is not the answer, one sign for such situation is if your data has no relations to one another, for example, if every document stands by itself.
if you'd like to have some unstructured data searchable, consider building a searchable index using pylucene
or maybe put the data in some non-rel database like mongodb
in any case, try to identify what part of your system is slowing down the process, my guess would be the database or the file system, if this is mysql all you can do is throwing more hardware on it.
another way to optimize a system that use IO extensively is to switch to async programming using a library like twisted but it has some learning curve, so better make 100% sure its needed.