Get text of geographical binary PySpark - python

I'm reading a parquet file generated from a MySQL table using AWS DMS service. This table has a field with the type Point (WKB). When I read this parquet file, Spark recognizes it as binary type, as per the code below:
file_dataframe = sparkSession.read.format('parquet')\
.option('encoding', 'UTF-8')\
.option('mode', 'FAILFAST')\
.load(file_path)
file_dataframe.schema
And this is the result:
StructType(List(StructField(id,LongType,true),...,StructField(coordinate,BinaryType,true),...))
I tried casting the column to string, but this is what I get:
file_dataframe = file_dataframe.withColumn('coordinates_str', file_dataframe.coordinate.astype("string"))
file_dataframe.select('coordinates_str').show()
+--------------------+
| coordinates_str|
+--------------------+
|U�...|
|U�...|
|U�...|
|U�...|
|#G
U�...|
|#G
U�...|
|#G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|
This field in MySQL looks like this. If I right click the BLOB I can see its value in the pop-up window.
What I'm interested in doing is getting the POINT (-84.1370849609375 9.982019424438477) that I see in the MySQL viewer as a string column in a Spark Dataframe. Is this possible? I've been Googling about it, but haven't been able to find something that gets me in the right track.

try this:
file_dataframe.withColumn('coordinates_str', decode(col('coordinate'), 'US-ASCII'))

After Googling some more and reading the MySQL documentation on the Point datatype, I found this documentation page that mentions that points are stored as WKB or WKT, which are standard for saving geographical locations.
Many engines have support for this format and one can easily get the text, as well as many other geographical operations, with a specific set of functions.
Nevertheless, after some research I didn't find that Spark has these functions built-in. It seems to require some further configuration in order to expose these functions, so I ended up using AWS Athena which does have built-in support. Athena geospatial documentation can be found here.

Related

WriteToBigQuery Dynamic table destinations returns wrong tableId

I am trying to write to bigquery to different table destinations and I would like to create the tables dynamically if they don't exist already.
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(lambda e: compute_table_name(e),
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The function compute_table_name is quite simple actually, I am just trying to get it to work.
def compute_table_name(element):
if element['table'] == 'table_id':
del element['table']
return "project_id:dataset.table_id"
The schema is detected correctly and the table IS created and populated with records. The problem is, the table ID I get is something along the lines of:
datasetId: 'dataset'
projectId: 'project_id'
tableId: 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP...
I have also tried returning a bigquery.TableReference object in my compute_table_name function to no avail.
EDIT: I am using apache-beam 2.34.0 and I have opened an issue on JIRA here
You pipeline code is fine. However, you can just pass the callable to the compute_table name function:
bigquery_rows | "Writing to Bigquery" >> WriteToBigQuery(compute_table_name,
schema=compute_table_schema,
additional_bq_parameters=additional_bq_parameters,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
)
The 'beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP' table name in BigQuery probably means that the load job either has not finished yet, or that it has errors ; you should check the "Personnal history" or "Project history" tabs in the BigQuery UI, to see what is the status of the job.
I have found the solution to my answer by following this answer. It felt like a workaround because I'm not passing a callable to WriteToBigQuery(). Testing a bunch of ways I found that giving a string/TableReference to the method directly it worked, but not giving it a callable.
I process ~50 gigs of data every 15 minutes spread across 6 tables and it works decently.

Dataflow with beam.dataframe.io.read_fwf : missing Ptransforms

I'm facing a problem. I have no clue how to fix it.
I have a pipeline in batch mode. I read a file using the method read_fwf : beam.dataframe.io.read_fwf; However, all following PTransforms are ignored. I wonder why ?
As you can see, my pipeline ended up having 1 step:
But, my code has the following pipeline :
#LOAD FILE RNS
elements = p | 'Load File' >> beam.dataframe.io.read_fwf('gs://my_bucket/files/202009.txt', header=None, colspec=col_definition, dtype=str, keep_default_na=False, encoding='ISO-8859-1')
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
#GROUP ALL VALUES
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL GROUPING BASED
Inserted = Grouped_Values | 'Insert PostgreSQL' >> beam.Map(functionMapInsert)
Do you know what I am doing wrong ?
Kindly,
Juliano
I think the problem has to do with the fact that, as you can see in the Apache Beam documentation, beam.dataframe.io.read_fwf returns a deferred Beam dataframe representing the contents of the file, not a PCollection.
You can embed DataFrames in a pipeline and convert between them and PCollections with the help of the functions defined in the apache_beam.dataframe.convert module.
The SDK documentation provides an example of this setup, fully described in Github.
I think it is worth value to try the DataframeTransform as well, perhaps it is more suitable for being integrated in the pipeline with the help of a schema definition.
In relation with this last suggestion, please, consider reviewing this related SO question, especially the answer from #robertwb and the exceptional linked google slides document, I think it can be helpful.

Writing to BigQuery dynamic table name Python SDK

I'm working on an ETL which is pulling data from a database, doing minor transformation and outputs to BigQuery. I have written my pipeline in Apache Beam 2.26.0 using Python SDK. I'm loading a dozen or so tables, and I'm passing their names as arguments to beam.io.WriteToBigQuery
Now, the documentation says that (https://beam.apache.org/documentation/io/built-in/google-bigquery):
When writing to BigQuery, you must supply a table schema for the destination table that you want to write to, unless you specify a create disposition of CREATE_NEVER.
I believe this is not exactly true. In my tests I saw that this is the case only when passing static table name.
If you have a bunch of tables and want to pass a table name as an argument then it throws an error:
ErrorProto message: 'No schema specified on job or table.'
My code:
bq_data | "Load data to BQ" >> beam.io.WriteToBigQuery(
table=lambda row: bg_config[row['table_name']],
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
)
bq_data is a dict of row of pandas data frame. Where I have a column table_name.
bq_config is a dictionary where key = row['table_name'] and the value is of the format:
[project_id]:[dataset_id].[table_id]
Anyone have some thoughts on this?
Have a look at this thread, I addressed it there. In short; I used the internal python time/date-function to render a variable before executing the python BigQuery API request.

How to partially overwrite blob in an sqlite3 db in python via SQLAlchemy?

I'm having a db that contains a blob column with the binary representation as follows
The value that I'm interested in is encoded as little endian unsigned long long (8 byte) value in the marked. Reading this value works fine like this
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
id = unpack("<Q", p.value[-8:])[0]
id in the above example is 1657266.
Now what I would like to do is the reverse. I have the row object p, I have a number in decimal format (using the same 1657266 for testing purposes) and I want to write that number in little endian format to those same 8 byte.
I've been trying to do so via SQL statement
UPDATE properties SET value = (SELECT substr(value, 1, length(value)-8) || x'b249190000000000' FROM properties WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%') WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%'
But when I do it like that I then can't read it anymore. At least not with SQLAlchemy. When I try the same code as above, I get the error message Could not decode to UTF-8 column 'properties_value' with text '☻' so it looks like it's written in a different format.
Interestingly using a normal select statement in DB Browser still works fine and the blob is still displayed exactly as in the screenshot above.
Now ideally I'd like to be able to write just those 8 bytes using the SQLAlchemy ORM but I'd settle for a raw SQL statement if that's what it takes.
I managed to get it to work with SQLAlchemy by basically reversing the process that I used to read it. In hindsight using the + to concatenate and the [:-8] to slice the correct part seems pretty obvious.
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
p.value = p.value[:-8] + pack("<Q", 1657266)
By turning on ECHO for SQLAlchemy I got the following raw SQL statement:
UPDATE properties SET value=? WHERE properties.object_id = ? AND properties.name = ?
(<memory at 0x000001B93A266A00>, 1817012, 'BP_ThrallComponent_C.OwnerUniqueID')
Which is not particularly helpful if you want to do the same thing manually I suppose.
It's worth noting that the raw SQL statement in my question not only works as far as reading it with the DB Browers is concerned but also with the game client that uses the db in question. It's only SQLAlchemy that seems to have troubles, trying to decode it as UTF-8 it seems.

how to use select query inside a python udf for redshift?

I tried uploading modules to redshift through S3 but it always says no module found. please help
CREATE or replace FUNCTION olus_layer(subs_no varchar)
RETURNS varchar volatile AS
$$
import plpydbapi
dbconn = plpydbapi.connect()
cursor = dbconn.cursor()
cursor.execute("SELECT count(*) from busobj_group.olus_usage_detail")
d=cursor.fetchall()
dbconn.close()
return d
$$
LANGUAGE plpythonu;
–
You cannot do this in Redshift. so you will need to find another approach.
1) see here for udf constraints http://docs.aws.amazon.com/redshift/latest/dg/udf-constraints.html
2) see here http://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html
especially this part:
Important Amazon Redshift blocks all network access and write access
to the file system through UDFs.
This means that even if you try to get around the restriction, it won't work!
If you don't know an alternative way to get what you need, you should ask a new question specifying exactly what your challenge is and what you have tried, (leave this question ans answer here for future reference by others)
It can't connect to DB inside UDF, Python functions are scalar in Redshift, meaning it takes one or more values and returns only one output value.
However, if you want to execute a function against a set of rows try to use LISTAGG function to build an array of values or objects (if you need multiple properties) into a large string (beware of string size limitation), pass it to UDF as parameter and parse/loop inside the function.
Amazon has recently announced the support for Stored Procedures in Redshift. Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. Along with that, it also supports looping and conditional expressions, to control logical flow.
https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-overview.html

Categories