Bigquery table to GCS

Bigquery table to GCS - python

I am trying to automate a job where files are written in gcs using data from queried data in BQ.
I have a bigquery table and I need to export files out to GCS, named according to a particular field.
field1 file_name
w filea
x fileb
y filec
z filed
So in this case, I need to produce 4 csv files, filea.csv, fileb.csv, filec.csv, filed.csv.
Is there a way I can automate this with python, so that if a new file shows up (fileE) in the BQ table, the job can export it to gcs with proper name fileE.csv?
Thank you!
I tried exporting one by one using BQ data export and it worked, but I was looking for a python solution.
Thanks

Related

How to load a BigQuery table to Cloud Storage

I have a 32 GB table in BigQuery that I need to do some adjustments through Jupyter Notebook (using Pandas) and export to Cloud Storage as a .txt file.
How to do this ?

It seems you can only export 1 GB at a time so your best bet is probably to query the first 1 GB worth of rows then next 1 GB, etc. and save them individually that way. That can all be scripted with the bq tool once you know approximately how much storage each row takes up. Does that make sense?

You can use Google Cloud Platform Console to do that.
Go to Bigquery in the Cloud Console.
Select the table you want to export and select "Export" then Export to GCS.
As your table is bigger than 1GB, be sure to put a wildcard in the filename so Bigquery exports the data in 1GB approx chunks (i.e export-*.csv).
You cannot export nested and repeated data in CSV format. Nested and repeated data are supported for Avro, JSON, and Parquet (Preview) exports. Anyway, the form will tell you if you can when you try to select the file format.

Create automatic table columns SQL code after a .CSV file in PostgreSQL

I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much

you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.

Load csv data into posgreSQL using Python

I am new to the world of python and have some problems loading data from a csv files unto postgresql.
I have successfully connected to my database from python and created my table. But when I go to load the data from the csv file into the created table in postgresql, I get nothing on the table when I use either the insert function or or copy function and commit.
cur.execute('''COPY my_schema.sheet(id, length, width, change_d, change_h,change_t, change_a, name)
FROM '/private/tmp/data.csv' DELIMITER ',' CSV HEADER;''')
dbase.commit()
I am not sure what I am missing, can anyone please help with this or advise a better way to load csv data using python script

AWS Glue - Pick Dynamic File

Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!

You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue

Allow flexible csv structure import into partitioned bigquery table from gcs

I have a partitioned table in bigquery that gets refreshed weekly with csv files from google cloud storage. During the refresh process the entire table is deleted and rewritten.
A new column was added to the csv files and the table structure was updated accordingly, but now during the refresh I get a column mismatch error because the old csv files don't contain the new column.
Is there a way to use gsutil or bq command line to smoothly add the column to the old csv files during the import into the partitioned table without having to download each file onto my local computer and alter the csv's one by one?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bigquery table to GCS - python

Related

How to load a BigQuery table to Cloud Storage

Create automatic table columns SQL code after a .CSV file in PostgreSQL

Load csv data into posgreSQL using Python

AWS Glue - Pick Dynamic File

Allow flexible csv structure import into partitioned bigquery table from gcs

Categories

Resources