I have done the following to get json file data into redis using this python script-
import json
import redis
r = redis.StrictRedis(host='127.0.0.1', port=6379, db=1)
with open('products.json') as data_file:
test_data = json.load(data_file)
r.set('test_json', test_data)
When I use the get commmand from redis-cli (get test_json) I get nil back.
I must be using the wrong command?
Please help for my understanding on this.
You should use hmset instead of set and hgetall instead of get to store multiple keys, your code should look like:
r.hmset('test_json', test_data) #to set multiple index data
r.hgetall('test_json') #to get multiple index data
I deleted the previous answer. Didn't noticed the problem there is that u specified 'db=1' in the redis constructor. So you are saving the data in the db 1. type 'select 1' in the redis client, or remove that from the constructor (by default, with redis-cli you connect to database 0)
Related
I created an API using FastAPI that returned a JSON. First, I used to turn the Dataframe to JSON using the Pandas .to_json() method, which allowed me to choose the correct "orient" parameter. This saved a .json file and then opened it to make fastAPI return it as it follows:
DATA2.to_json("json_records.json",orient="records")
with open('json_records.json', 'r') as f:
data = json.load(f)
return(data)
This worked perfectly, but i was told that my script shouldn't save any files since this script would be running on my company's server, so I had to directly turn the dataframe into JSON and return it. I tried doing this:
data = DATA2.to_json(orient="records")
return(data)
But now the API's output is a JSON full of "\". I guess there is a problem with the parsing but i can't really find a way to do it properly.
The output now looks like this:
"[{\"ExtraccionHora\":\"12:53:00\",\"MiembroCompensadorCodigo\":117,\"MiembroCompensadorDescripcion\":\"OMEGA CAPITAL S.A.\",\"CuentaCompensacionCodigo\":\"1143517\",\"CuentaNeteoCodigo\":\"160234117\",\"CuentaNeteoDescripcion\":\"UNION FERRO SRA A\",\"ActivoDescripcion\":\"X17F3\",\"ActivoID\":8,\"FinalidadID\":2,\"FinalidadDescripcion\":\"Margenes\",\"Cantidad\":11441952,\"Monto\":-16924935.3999999985,\"Saldo\":-11379200.0,\"IngresosVerificados\":11538288.0,\"IngresosNoVerificado\":0.0,\"MargenDelDia\":0.0,\"SaldoConsolidadoFinal\":-16765847.3999999985,\"CuentaCompensacionCodigoPropia\":\"80500\",\"SaldoCuentaPropia\":-7411284.3200000003,\"Resultado\":\"0\",\"MiembroCompensadorID\":859,\"CuentaCompensacionID\":15161,\"CuentaNeteoID\":7315285}.....
What would be a proper way of turning my dataframe into a JSON using the "records" orient, and then returning it as the FastAPI output?
Thanks!
update: i changed the to_json() method to to_dict() using the same parameters and seems to work... don't know if its correct.
data = DATA2.to_dict(orient="records")
return(data)
I am using python connector for the transformations and pushing data to snowflake it is taking longer time for insertions into snowflake, since I have to handle transformations message by message to produce the same sequence. The existing Kafka snowflake connector can flatten the json messages but cannot handle the other events.
I am looking for faster ways to transfer and transform kafka json messages to snowflake tables.
You have two options in this case IMO.
Write some consumer code that will transform the data from the original kafka topic (input topic) and then write it to a new topic (output topic). Then you can use the Kafka snowflake connector to write to snowflake.
or
Write the consumer code to do the transform and then write the data directly to snowflake in that consumer.
Option 1 has an extra step and requires using the kafka connector, so something extra to manage. Option 2 sounds like what you are currently trying. Using option 1 would allow you to leverage the community maintained snowflake connector, which is probably quite efficient. You will need to use partitions in your kafka topic to get higher throughput.
Regardless of your choice of 1. or 2. It sounds like you need to write a consumer for your kafka topic to transform the data first. In which case I would recommend you use a stream processor to do this so you do not need to manage the complexities of state, recovery and parallelism. If you are dead set on using Python, the options are Bytewax or Faust. Below is some code using Bytewax to transform kafka topic data.
import json
from bytewax.dataflow import Dataflow
from bytewax.execution import spawn_cluster
from bytewax.inputs import KafkaInputConfig
from bytewax.outputs import KafkaOutputConfig
def deserialize(key_bytes__payload_bytes):
key_bytes, payload_bytes = key_bytes__payload_bytes
key = json.loads(key_bytes) if key_bytes else None
payload = json.loads(payload_bytes) if payload_bytes else None
return key, payload
def my_transformations(data):
### WRITE YOUR PYTHON TRANSFORMATION CODE HERE ###
return data
def serialize_with_key(key_payload):
key, payload = key_payload
new_key_bytes = key if key else json.dumps("my_key").encode("utf-8")
return new_key_bytes, json.dumps(payload).encode("utf-8")
flow = Dataflow()
flow.input("inp", KafkaInputConfig(
brokers=["localhost:9092"],
topic="input_topic",
),
)
flow.map(deserialize)
flow.map(my_transformations)
flow.map(serialize_with_key)
flow.capture(KafkaOutputConfig(
brokers=["localhost:9092"],
topic="output_topic",
)
)
if __name__ == "__main__":
spawn_cluster(flow, proc_count = 2, worker_count_per_proc = 1,)
I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
I am using redis-py and whenever I store a list or a dict in the cache, running the get function return a string. how do I get back the original datatype?
cache = redis.StrictRedis(host='localhost', port=6379, decode_responses=True)
cache.set("posts",[["bob","My first post"],["mary","My second post"]])
cache.get("post")
>>>"[["bob","My first post"],["mary","My second post"]]"
Is this something that I have to do manually?
The list of lists is your issue, as Redis doesn't like nested structures.
Try something like converting to json before storing and convert back on access.
Your issue is quite similar to how to store a complex object in redis (using redis-py)
In the 3rd answer (from CivFan) an example is given that would translate pretty directly for what it appears you are trying to do. For reference, the code snippet provided in that question/answer:
import json
import redis
r = redis.StrictRedis(host='localhost', port=6379, db=0)
images= [
{'type':'big', 'url':'....'},
{'type':'big', 'url':'....'},
{'type':'big', 'url':'....'},
]
json_images = json.dumps(images)
r.set('images', json_images)
unpacked_images = json.loads(r.get('images'))
images == unpacked_images
There are some additional points to consider worth reading in the linked question as well.
If you are storing standard json data, then you should use json module to serialize before putting into redis.
import json
val = json.dumps([["bob","My first post"],["mary","My second post"]])
redis_cache.set("posts",val)
str_val = redis_cache.get("posts")
obj = json.loads(str_val)
If you want to store any python object, use pickle to serialize
import pickle
val = pickle.dumps([["bob","My first post"],["mary","My second post"]])
redis_cache.set("posts",val)
str_val = redis_cache.get("posts")
obj = pickle.loads(str_val)
I am working with Azure functions to create triggers that aggregate data on an hourly basis. The triggers get data from blob storage, and to avoid aggregating the same data twice I want to add a condition that lets me only process blobs that were modified the last hour.
i am using the SDK, and my code for doing this looks like this:
''' Timestamp variables for t.now and t-1 '''
timestamp = datetime.now(tz=utc)
timestamp_negative1hr = timestamp+timedelta(hours=1)
''' Read data from input enivonment '''
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('directory')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name, if_modified_since=timestamp_negative1hr)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
When I run this, the error I get is azure.common.AzureHttpError: The condition specified using HTTP conditional header(s) is not met. It is also specified that it is due to a timeout. The blobs are recieving data when I run it, so in any case it is not the correct return message. If I add .strftime("%Y-%m-%d %H:%M:%S:%z")to the end of my timestamp i get another error AttributeError: 'str' object has no attribute 'tzinfo'. This must mean that azure expects a datetime object, but for some reason it is not working for me.
Any ideas on how to solve it? Thanks