I have a 32 GB table in BigQuery that I need to do some adjustments through Jupyter Notebook (using Pandas) and export to Cloud Storage as a .txt file.
How to do this ?
It seems you can only export 1 GB at a time so your best bet is probably to query the first 1 GB worth of rows then next 1 GB, etc. and save them individually that way. That can all be scripted with the bq tool once you know approximately how much storage each row takes up. Does that make sense?
You can use Google Cloud Platform Console to do that.
Go to Bigquery in the Cloud Console.
Select the table you want to export and select "Export" then Export to GCS.
As your table is bigger than 1GB, be sure to put a wildcard in the filename so Bigquery exports the data in 1GB approx chunks (i.e export-*.csv).
You cannot export nested and repeated data in CSV format. Nested and repeated data are supported for Avro, JSON, and Parquet (Preview) exports. Anyway, the form will tell you if you can when you try to select the file format.
Related
I am trying to automate a job where files are written in gcs using data from queried data in BQ.
I have a bigquery table and I need to export files out to GCS, named according to a particular field.
field1 file_name
w filea
x fileb
y filec
z filed
So in this case, I need to produce 4 csv files, filea.csv, fileb.csv, filec.csv, filed.csv.
Is there a way I can automate this with python, so that if a new file shows up (fileE) in the BQ table, the job can export it to gcs with proper name fileE.csv?
Thank you!
I tried exporting one by one using BQ data export and it worked, but I was looking for a python solution.
Thanks
We use the google-cloud-bigquery python library to query Bigquery and process the results in our python script. The processing portion transforms the data and enriches it and in the end creates JSON objects.
This is how we use the BQ library in our script (simplified):
import google.cloud.bigquery
client = bigquery.Client()
query = "SELECT col1,col2,... FROM <table>"
queryjob = client.query(query)
result_set = queryjob.result(page_size=50000)
for page in result_set.pages:
transform_records()
In general and for moderate sized tables, this works just fine. However, we run into a performance issue when querying a table that returns 11 mio records of ~3,5 GB size in total. Even if we leave out the processing, just fetching the pages takes ~ 80 minutes (we did not really observe any significant differences when running it locally or in a VM / cluster that resides in the same region as the bigquery dataset).
Any ideas on how to reduce the loading time?
What we tried:
Varying the page size: The obvious assumption that larger pagesizes hence less pages reduce http overhead holds true. However we noticed that setting page size to above 8.500 did not have any effect (the max number of records returned by the API per page were ~8.500). Still this does only account for improvement in range of a few percent of loading time
Iterating over the result set records instead of pages: Gave us roughly same performance
Separating the data loading and the processing from each other by putting the loading portion into a background thread, using a multiprocessing queue for sharing the data with the processing workers - obviously no impact on the pure time spent on receiving the data from BQ
Trying to fetch multiple pages in parallel - we think this could help reducing the loading time drastically, but did not manage to do so
What we did not try:
Using the BQ storage API, or rather a method that fetches data from BQ using this API (i.e. result_set.to_arrow_iterable / to_dataframe_iterable): We like to avoid the mess of having to deal with data type conversions, as the output of the processing part will be a JSON object
Using the BQ Rest API directly w/o comfort that the bigquery lib offers in order to be able to fetch multiple pages of the result set simultaneously: This seems somewhat complicated and we are not even sure if the API itself allows for this simultaneous access of pages
Exporting the data to GCS first by using client.extract_table-method: We used this approach in other use cases and are aware that fetching data from GCS is way faster. However, as we get acceptable performance for most of our source tables, we'd rather avoid this extra step of exporting to GCS
The approach that you have mentioned should be avoided considering the size of data.
One of the following approaches can be applied:
Transform table data in BigQuery using in-built functions or UDF or Remote Functions and save the transformed data to another table
Export the transformed table data into Cloud Storage in one or more CSV or JSON files.
Load CSV / JSON files to non-GCP system using compute service.
If the transformation is not feasible in BigQuery, then
Export the raw table data in to Cloud Storage in one or more CSV or JSON files.
Load each CSV / JSON file on compute service, transform the data and load the transfomed data to non-GCP system.
Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.
I would like to do a daily ingesting job that takes a CSV file from blob storage and put it integrate it into a PostgreSQL database. I have the constraint to use python. Which solution do you recommend me to use for building/hosting my ETL solution ?
Have a nice day :)
Additional information:
The size and shape of the CSV file are 1.35 GB, (1292532, 54).
I will push to the database only 12 columns out of 54.
You can try to use Azure Data Factory to achieve this. New a Copy Data activity, source is your csv and sink is PostgreSQL database. In the Mapping setting, just select the columns you need. Finally, create a schedule trigger to run it.
I have one program that downloads time series (ts) data from a remote database and saves the data as csv files. New ts data is appended to old ts data. My local folder continues to grow and grow and grow as more data is downloaded. After downloading new ts data and saving it, I want to upload it to a Google BigQuery table. What is the best way to do this?
My current work-flow is to download all of the data to csv files, then to convert the csv files to gzip files on my local machine and then to use gsutil to upload those gzip files to Google Cloud Storage. Next, I delete whatever tables are in Google BigQuery and then manually create a new table by first deleting any existing table in Google BigQuery and then creating a new one by uploading data from Google Cloud Storage. I feel like there is room for significant automation/improvement but I am a Google Cloud newbie.
Edit: Just to clarify, the data that I am downloading can be thought of downloading time series data from Yahoo Finance. With each new day, there is fresh data that I download and save to my local machine. I have to uploading all of the data that I have to Google BigQUery so that I can do SQL analysis on it.
Consider breaking up your data into daily tables (or partitions). Then you only need to upload the CVS from the current day.
The script you have currently defined otherwise seems reasonable.
Extract your new day of CSVs from your source of timeline data.
Gzip them for fast transfer.
Copy them to GCS.
Load the new CVSs into the current daily table/partition.
This avoids the need to delete existing tables and reduces the amount of data and processing that you need to do. As a bonus, it is easier to backfill a single day if there is an error in processing.