I am trying to automate a job where files are written in gcs using data from queried data in BQ.
I have a bigquery table and I need to export files out to GCS, named according to a particular field.
field1 file_name
w filea
x fileb
y filec
z filed
So in this case, I need to produce 4 csv files, filea.csv, fileb.csv, filec.csv, filed.csv.
Is there a way I can automate this with python, so that if a new file shows up (fileE) in the BQ table, the job can export it to gcs with proper name fileE.csv?
Thank you!
I tried exporting one by one using BQ data export and it worked, but I was looking for a python solution.
Thanks
Related
I have a 32 GB table in BigQuery that I need to do some adjustments through Jupyter Notebook (using Pandas) and export to Cloud Storage as a .txt file.
How to do this ?
It seems you can only export 1 GB at a time so your best bet is probably to query the first 1 GB worth of rows then next 1 GB, etc. and save them individually that way. That can all be scripted with the bq tool once you know approximately how much storage each row takes up. Does that make sense?
You can use Google Cloud Platform Console to do that.
Go to Bigquery in the Cloud Console.
Select the table you want to export and select "Export" then Export to GCS.
As your table is bigger than 1GB, be sure to put a wildcard in the filename so Bigquery exports the data in 1GB approx chunks (i.e export-*.csv).
You cannot export nested and repeated data in CSV format. Nested and repeated data are supported for Avro, JSON, and Parquet (Preview) exports. Anyway, the form will tell you if you can when you try to select the file format.
I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.
I am new to the world of python and have some problems loading data from a csv files unto postgresql.
I have successfully connected to my database from python and created my table. But when I go to load the data from the csv file into the created table in postgresql, I get nothing on the table when I use either the insert function or or copy function and commit.
cur.execute('''COPY my_schema.sheet(id, length, width, change_d, change_h,change_t, change_a, name)
FROM '/private/tmp/data.csv' DELIMITER ',' CSV HEADER;''')
dbase.commit()
I am not sure what I am missing, can anyone please help with this or advise a better way to load csv data using python script
Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue
I have a partitioned table in bigquery that gets refreshed weekly with csv files from google cloud storage. During the refresh process the entire table is deleted and rewritten.
A new column was added to the csv files and the table structure was updated accordingly, but now during the refresh I get a column mismatch error because the old csv files don't contain the new column.
Is there a way to use gsutil or bq command line to smoothly add the column to the old csv files during the import into the partitioned table without having to download each file onto my local computer and alter the csv's one by one?