Fixing 413 Request Entity Too Large Errors : - python

When I tried to save the data in the cloudant database from watson studion using spark & python environment, I got this error "HTTPError: 413 Client Error: Request Entity Too Large document_too_large " How could I fix Error 413?
My code:
def db_data_writing(credential:dict, db_name:str, files:dict):
"""
DB data writing
This function reach the online database and write the input files as a new document
Parameters
----------
credential: dict
credential to access the online Cloudant database instance
db_name: str
name of the database on which write data
files: dict
json containing data to write on db
Returns
-------
bool
exit status
"""
# Feed credential and establish a connection
client = Cloudant.iam(credential['username'],credential['apikey'],connect=True)
# Select the database instance on which write results
db = client[db_name]
# write and save document
db.create_document(files).save()
# shutdown the connection
client.disconnect()
credential={xxxxxxxxxxxxxxxxxx}
db_name = "xxxxxxxxx"
for k in range(len(dataPreparation.index)):# type of dataPreparation is DataFrame
print(dataPreparation.loc[k])
i+=1
db_data_writing(credential, db_name, dataPreparation.loc[k] )
Thanks

Cloudant imposes limits on document sizes and also on the total request body size. A document must not be larger than 1MB, and a request body must be no larger than 11MB. If you exceed either of those, you will get a 413 error back.
These limits are there "for your own protection". Even using documents approaching 1MB is somewhat suboptimal -- you really should try to keep documents within a few kB to hit the performance sweetspot when using Cloudant.
https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-documents
So to work around this you need to revisit your data model and ensure your documents are smaller than 1MB.

Related

I am tying to build a snowflake connector for Kafka. Is there a way the connector handle events like update, delete and create?

I am using python connector for the transformations and pushing data to snowflake it is taking longer time for insertions into snowflake, since I have to handle transformations message by message to produce the same sequence. The existing Kafka snowflake connector can flatten the json messages but cannot handle the other events.
I am looking for faster ways to transfer and transform kafka json messages to snowflake tables.
You have two options in this case IMO.
Write some consumer code that will transform the data from the original kafka topic (input topic) and then write it to a new topic (output topic). Then you can use the Kafka snowflake connector to write to snowflake.
or
Write the consumer code to do the transform and then write the data directly to snowflake in that consumer.
Option 1 has an extra step and requires using the kafka connector, so something extra to manage. Option 2 sounds like what you are currently trying. Using option 1 would allow you to leverage the community maintained snowflake connector, which is probably quite efficient. You will need to use partitions in your kafka topic to get higher throughput.
Regardless of your choice of 1. or 2. It sounds like you need to write a consumer for your kafka topic to transform the data first. In which case I would recommend you use a stream processor to do this so you do not need to manage the complexities of state, recovery and parallelism. If you are dead set on using Python, the options are Bytewax or Faust. Below is some code using Bytewax to transform kafka topic data.
import json
from bytewax.dataflow import Dataflow
from bytewax.execution import spawn_cluster
from bytewax.inputs import KafkaInputConfig
from bytewax.outputs import KafkaOutputConfig
def deserialize(key_bytes__payload_bytes):
key_bytes, payload_bytes = key_bytes__payload_bytes
key = json.loads(key_bytes) if key_bytes else None
payload = json.loads(payload_bytes) if payload_bytes else None
return key, payload
def my_transformations(data):
### WRITE YOUR PYTHON TRANSFORMATION CODE HERE ###
return data
def serialize_with_key(key_payload):
key, payload = key_payload
new_key_bytes = key if key else json.dumps("my_key").encode("utf-8")
return new_key_bytes, json.dumps(payload).encode("utf-8")
flow = Dataflow()
flow.input("inp", KafkaInputConfig(
brokers=["localhost:9092"],
topic="input_topic",
),
)
flow.map(deserialize)
flow.map(my_transformations)
flow.map(serialize_with_key)
flow.capture(KafkaOutputConfig(
brokers=["localhost:9092"],
topic="output_topic",
)
)
if __name__ == "__main__":
spawn_cluster(flow, proc_count = 2, worker_count_per_proc = 1,)

Spark Repartition Issue

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.
Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

BigQuery: too many table dml insert operations for this table

I'm trying to import more than 200M records on different computers (n=20) to my BigQuery table via Python client. Each computer runs every 10. second a job (with multiple rows)
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + '/api.json'
print('Import job started, total rows:' + str(len(queries)))
client = bigquery.Client()
for q in queries:
results = client.query(q)
for err in results:
print(err)
But I'm getting the following error:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too
many table dml insert operations for this table. For more information,
see https://cloud.google.com/bigquery/troubleshooting-errors
The data are being generated on run-time. So I have to import the data on run-time. I'm also not sure if BigQuery is good for that. Spanner seems to be better but it costs me too much.
How can I avoid this error?
Thank you very much.
There are 4 major ways to insert data into BigQuery tables.
Batch load a set of data records.
Stream individual records or batches of records.
Use queries to generate new data and append or overwrite the results to a table.
Use a third-party application or service.
I think you are using the 3rd option, which is DML INSERT. It's not designed for large-scale high-frequency data loading use case.
In your use case, it seems the 2nd option, streaming data, could be a good fit.
Example
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of table to append to.
# table_id = "your-project.your_dataset.your_table"
rows_to_insert = [
{u"full_name": u"Phred Phlyntstone", u"age": 32},
{u"full_name": u"Wylma Phlyntstone", u"age": 29},
]
errors = client.insert_rows_json(table_id, rows_to_insert) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
You could see more details here.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

Redis get data using python

I have done the following to get json file data into redis using this python script-
import json
import redis
r = redis.StrictRedis(host='127.0.0.1', port=6379, db=1)
with open('products.json') as data_file:
test_data = json.load(data_file)
r.set('test_json', test_data)
When I use the get commmand from redis-cli (get test_json) I get nil back.
I must be using the wrong command?
Please help for my understanding on this.
You should use hmset instead of set and hgetall instead of get to store multiple keys, your code should look like:
r.hmset('test_json', test_data) #to set multiple index data
r.hgetall('test_json') #to get multiple index data
I deleted the previous answer. Didn't noticed the problem there is that u specified 'db=1' in the redis constructor. So you are saving the data in the db 1. type 'select 1' in the redis client, or remove that from the constructor (by default, with redis-cli you connect to database 0)

Categories