WriteToBigQuery Dynamic table destinations returns wrong tableId - python

I am trying to write to bigquery to different table destinations and I would like to create the tables dynamically if they don't exist already.
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(lambda e: compute_table_name(e),
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The function compute_table_name is quite simple actually, I am just trying to get it to work.
def compute_table_name(element):
if element['table'] == 'table_id':
del element['table']
return "project_id:dataset.table_id"
The schema is detected correctly and the table IS created and populated with records. The problem is, the table ID I get is something along the lines of:
datasetId: 'dataset'
projectId: 'project_id'
tableId: 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP...
I have also tried returning a bigquery.TableReference object in my compute_table_name function to no avail.
EDIT: I am using apache-beam 2.34.0 and I have opened an issue on JIRA here

You pipeline code is fine. However, you can just pass the callable to the compute_table name function:
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(compute_table_name,
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP' table name in BigQuery probably means that the load job either has not finished yet, or that it has errors ; you should check the "Personnal history" or "Project history" tabs in the BigQuery UI, to see what is the status of the job.

I have found the solution to my answer by following this answer. It felt like a workaround because I'm not passing a callable to WriteToBigQuery(). Testing a bunch of ways I found that giving a string/TableReference to the method directly it worked, but not giving it a callable.
I process ~50 gigs of data every 15 minutes spread across 6 tables and it works decently.

Related

Dataflow with beam.dataframe.io.read_fwf : missing Ptransforms

I'm facing a problem. I have no clue how to fix it.
I have a pipeline in batch mode. I read a file using the method read_fwf : beam.dataframe.io.read_fwf; However, all following PTransforms are ignored. I wonder why ?
As you can see, my pipeline ended up having 1 step:
But, my code has the following pipeline :
#LOAD FILE RNS
elements = p | 'Load File' >> beam.dataframe.io.read_fwf('gs://my_bucket/files/202009.txt', header=None, colspec=col_definition, dtype=str, keep_default_na=False, encoding='ISO-8859-1')
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
#GROUP ALL VALUES
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL GROUPING BASED
Inserted = Grouped_Values | 'Insert PostgreSQL' >> beam.Map(functionMapInsert)
Do you know what I am doing wrong ?
Kindly,
Juliano
I think the problem has to do with the fact that, as you can see in the Apache Beam documentation, beam.dataframe.io.read_fwf returns a deferred Beam dataframe representing the contents of the file, not a PCollection.
You can embed DataFrames in a pipeline and convert between them and PCollections with the help of the functions defined in the apache_beam.dataframe.convert module.
The SDK documentation provides an example of this setup, fully described in Github.
I think it is worth value to try the DataframeTransform as well, perhaps it is more suitable for being integrated in the pipeline with the help of a schema definition.
In relation with this last suggestion, please, consider reviewing this related SO question, especially the answer from #robertwb and the exceptional linked google slides document, I think it can be helpful.

Apache Beam To BigQuery

I am building a process in Google Cloud Dataflow that will consume messages in a Pub/Sub and based on a value of one key it will either write them to BQ or to GCS. I am able to split the messages, but I am not sure how to write the data to BigQuery. I've tried using the beam.io.gcp.bigquery.WriteToBigQuery, but no luck.
My full code is here: https://pastebin.com/4W9Vu4Km
Basically my issue is that I don't know, how to specify in the WriteBatchesToBQ (line 73) that the variable element should be written into BQ.
I've also tried using beam.io.gcp.bigquery.WriteToBigQuery directly in the pipeline (line 128), but then I got an error AttributeError: 'list' object has no attribute 'items' [while running 'Write to BQ/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)'] . This is probably because I am not feeding it a dictionary, but a list of dictionaries (I would like to use 1-minute windows).
Any ideas please? (also if there is something too stupid in the code, let me know - I am playing with apache beam just for a short time and I might be overlooking some obvious issues).
WriteToBigQuery sample format is given below:-
project_id = "proj1"
dataset_id = 'dataset1'
table_id = 'table1'
table_schema = ('id:STRING, reqid:STRING')
| 'Write-CH' >> beam.io.WriteToBigQuery(
table=table_id,
dataset=dataset_id,
project=project_id,
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
))
You can refer this case it will give you a brief understanding of beam data pipeline.
The second approach is the solution to this issue, you need to use WriteToBigQuery function directly in the pipeline. However, a beam.FlatMap step needs to be included so the WriteToBigQuery can process the list of dictionaries correctly.
Hence the complete pipeline splitting data, grouping them by time, and writing them into BQ is defined like this:
accepted_messages = tagged_lines_result[Split.OUTPUT_TAG_BQ] | "Window into BQ" >> GroupWindowsIntoBatches(
window_size) | "FlatMap" >> beam.FlatMap(
lambda elements: elements) | "Write to BQ" >> beam.io.gcp.bigquery.WriteToBigQuery(table=output_table_bq,
schema=(
output_table_bq_schema),
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
The complete working code is here: https://pastebin.com/WFwBvPcU

Writing to BigQuery dynamic table name Python SDK

I'm working on an ETL which is pulling data from a database, doing minor transformation and outputs to BigQuery. I have written my pipeline in Apache Beam 2.26.0 using Python SDK. I'm loading a dozen or so tables, and I'm passing their names as arguments to beam.io.WriteToBigQuery
Now, the documentation says that (https://beam.apache.org/documentation/io/built-in/google-bigquery):
When writing to BigQuery, you must supply a table schema for the destination table that you want to write to, unless you specify a create disposition of CREATE_NEVER.
I believe this is not exactly true. In my tests I saw that this is the case only when passing static table name.
If you have a bunch of tables and want to pass a table name as an argument then it throws an error:
ErrorProto message: 'No schema specified on job or table.'
My code:
bq_data | "Load data to BQ" >> beam.io.WriteToBigQuery(
table=lambda row: bg_config[row['table_name']],
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
)
bq_data is a dict of row of pandas data frame. Where I have a column table_name.
bq_config is a dictionary where key = row['table_name'] and the value is of the format:
[project_id]:[dataset_id].[table_id]
Anyone have some thoughts on this?
Have a look at this thread, I addressed it there. In short; I used the internal python time/date-function to render a variable before executing the python BigQuery API request.

How to format postgreSQL queries in a python script for better readability?

I have another question that is related to a project I am working on for school. I have created a PostgreSQL database, with 5 tables and a bunch of rows. I have created a script that allows a user to search for information in the database using a menu, as well as adding and removing content from one of the tables.
When displaying a table in PostgreSQL CLI itself, it looks pretty clean, however, whenever displaying even a simple table with no user input, it looks really messy. While this is an optional component for the project, I would prefer to have something that looks a little cleaner.
I have tried a variety of potential solutions that I have seen online, even a few from stack overflow, but none of them work. Whenever I try to use any of the methods I have seen and somewhat understand, I always get the error:
TypeError: 'int' object is not subscriptable
I added a bunch of print statements in my code, to try and figure out why it refuses to typecast. It is being dumb. Knowing me it is probably a simple typo that I can't see. Not even sure if this solution will work, just one of the examples I saw online.
try:
connection = psycopg2.connect(database='Blockbuster36', user='dbadmin')
cursor = connection.cursor()
except psycopg2.DatabaseError:
print("No connection to database.")
sys.exit(1)
cursor.execute("select * from Customer;")
tuple = cursor.fetchone()
List_Tuple = list(tuple)
print("Customer_ID | First_Name | Last_Name | Postal_Code | Phone_Num | Member_Date")
print(List_Tuple)
print()
for item in List_Tuple:
print(item[0]," "*(11-len(str(item[0]))),"|")
print(item)
print(type(item))
print()
num = str(item[0])
print(num)
print(type(num))
print(str(item[0]))
print(type(str(item[0])))
cursor.close()
connection.close()
I uploaded the difference between the output I get through a basic python script and in the PostgreSQL CLI. I have blocked out names in the tables for privacy reasons. https://temporysite.weebly.com/
It doesn't have to look exactly like PostgreSQL, but anything that looks better than the current mess would be great.
Use string formatting to do that. You can also set it to pad right or left.
As far as the dates use datetime.strftime.
The following would set the padding to 10 places:
print(”{:10}|{:10}".format(item[0], item[1]))

Updating records in MongoDB through pymongo leads to deletion of most of them

Im working with a remote mongodb database in my python code.The code accessing the database and the database itself are on two different machines. The pymongo module version im using is 1.9+.
The script consists of the following code:
for s in coll.find({ "somefield.a_date" : { "$exists":False },
"somefield.b_date" : { "$exists":False }}):
original = s['details']['c_date']
utc = from_tz.localize(original).astimezone(pytz.utc)
s['details']['c_date'] = utc
if str(type(s['somefield'])) != "<type 'dict'>":
s['somefield'] = {}
s['somefield']['b_date'] = datetime.utcnow()
coll.update({ '_id' : s['_id'] }, s );
After running this code, a strange thing happened. There were millions of records in the collection initially and after running the script,just 29% of the total records remained, the rest were automatically deleted. Is there any known issue with PyMongo driver version 1.9+ ?
What could have been other reasons for this and any ways i can find out what exactly happened ?
What could have been other reasons for this and any ways i can find out what exactly happened ?
The first thing to check is "were there any exceptions"?
In coll.update(), you are not setting the safe variable. If there is an exception on the update, it will not be thrown.
In your code you not do not catch exceptions (which is suggested) and your update does not check for exceptions, so you have no way of knowing what's going on.
The second thing to check is "how are you counting"?
The update command can "blank out" data, but it cannot delete data (or change an _id).
Do you have a copy of the original data? Can you run your code on a small number of those 10 or 100 and see what's happening?
What you describe is not normal with any of the MongoDB drivers. We definitely need more data to resolve this issue.

Categories