Serialization optimization using Marshmallow, other solutions

Serialization optimization using Marshmallow, other solutions - python

This seems like it should be straightforward, but alas:
I have the following SQLAlchemy query object:
all = db.session.query(label('sid', distinct(Clinical.patient_sid))).all()
With the desired to serialize the output like [{'sid': 1}, {'sid': 2},...]
To do this, I am trying to use the following simple Marshmallow schema:
class TestSchema(Schema):
sid = fields.Int()
However, when I do
schema = TestSchema()
result = schema.dump(record)
print result
pprint(result.data)
I get:
MarshalResult(data={}, errors={})
{}
for my output.
However, when I only select only one row from my query, e.g.,
one_record = db.session.query(label('sid', distinct(Clinical.patient_sid))).first()
I get the desired results:
MarshalResult(data={u'sid': 1}, errors={})
{u'sid': 1}
I DO know the query with .all() is returning data, since when I print it I get a list of tuples:
[(1L,), (2L,), (3L,), ...]
I am assuming Marshmallow can handle list of tuples, since, in the documentation to marshaling.py under the serialize method, it says:
"Takes raw data (a dict, list, or other object) and a dict of..." However, this may be an incorrect assumption to think that lists of tuples could be classified as either "lists" or "other objects."
I like Marshmallow otherwise, and was hoping to use it as an optimization over serializing my SQLAlchemy output using an iterative method, like:
all = db.session.query(label('sid', distinct(Clinical.patient_sid)))
out = []
for result in all:
data = {'sid': result.sid}
out.append(data)
Which, for large records sets can take a while to process.
EDIT
Even if Marshmallow were able to serialize the entire record set as output by SQLAlchemy, I am not sure I would get any increase in speed, since it looks like it too iterates over the data.
Any suggestions for optimized serialization for the SQLAlchemy output, short of modifying the class definition for Clinical?

The solution to optimize my code was to go directly from my SQLAlchemy query object to a pandas data frame (I forgot to mention that I am doing some heavy lifting in pandas after I get my queried record set).
I thus was able to skip this step
out = []
for result in all:
data = {'sid': result.sid
out.append(data)
by using the sql_read method of Pandas as follows:
import pandas as pd
pd.read_sql(all.statement, all.session.bind)
and then doing all my data manipulations and gyrations, thereby shaving off several seconds of processing time.

Related

AWS DynamoDB execute_statement Without Data Types in Python

I am using boto3 to query my DynamoDB table using PartiQL,
dynamodb = boto3.client(
'dynamodb',
aws_access_key_id='<aws_access_key_id>',
aws_secret_access_key='<aws_secret_access_key>',
region_name='<region_name>'
)
resp = dynamodb.execute_statement(Statement='SELECT * FROM TryDaxTable')
Now, the response you get contains a list of dictionaries that looks something like this,
{'some_data': {'S': 'XXX'},
'sort_key': {'N': '1'},
'partition_key': {'N': '7'}
}
Along with the attribute name (e.g. partition_key), you get the data type of the value (e.g. 'N') and then the actual value (e.g. '7'). It can also be seen that value does not actually come in specified data type either (e.g. partition_key is supposed to be a number (N), but the value is a string.
Is there some way I can get my results in a list of dictionaries without the types and also with the types applied?
That would mean something like this,
{'some_data': 'XXX',
'sort_key': 1,
'partition_key': 7
}
Notice that in addition to removing the data types, the values have also been converted to the correct type.
This is a simple record, but more complex ones can have lists and nested dictionaries. More information is available here,
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.execute_statement
Is there some way that I can get the data in the format I desire?
Or has somebody already written a function to parse the data?
I know there are several questions regarding this posted already, but most of them relate to SDKs in other languages. For instance,
AWS DynamoDB data with and/or without types?
I did not find one that has addressed this issue in Python.
Note: I want to continue to use PartiQL to query my table.

If you can use .resource instead of .client you'll live at a higher abstraction layer.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#service-resource
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('name')
You want to use PartiQL which returns the lower-level data format, so you'll probably have to follow the advice at How to convert a boto3 Dynamo DB item to a regular dictionary in Python?

SQLAlchemy returns a result double - even when only one is expected

I'm writing a flask - sqlalchemy database.
I'dont understand nor do I find a solution. If I write a query, it returns the row double...
The database Class has indeed two rows, but they are different.
data = db.session.query(Class).filter_by(Class.id==1).first()
print(data)
<Class>
<Class>

Check if 'data' variable is not defined somewhere earlier in the code, resulting in two rows.
The query, as you have written it, should actually result in a TypeError (wrong use of filter_by method).
It should be either:
data = db.session.query(Class).filter_by(id=1).first()
or
data = db.session.query(Class).filter(Class.id==1).first()

A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.
MemoryError:
My code looks like this:
cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.
https://stackoverflow.com/a/17805626/2297475
https://stackoverflow.com/a/16255680/2297475
However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.
https://stackoverflow.com/a/13215411/2297475
In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

I've modified my code to the following:
cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.
This post has a nice and concise explanation.

an elegant way of doing it would be as follows:
import pandas as pd
def my_transform_logic(x):
if x :
do_something
return result
def process(cursor):
df = pd.DataFrame(list(cursor))
df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))
#making list off dictionaries
db.collection_name.insert_many(final_df.to_dict('records'))
# or update
db.collection_name.update_many(final_df.to_dict('records'),upsert=True)
#make a list of cursors.. you can read the parallel_scan api of pymongo
cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
process(cursor)
I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors
and the processing finished in 2 hrs.

The from_records classmethod is probably the best way to do it:
from pandas import pd
import pymongo
client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)
df = pd.DataFrame.from_records(data)

Creating a dictionary with multiple values for every key from SQLITE in Python

Right now, I am trying to figure out a way to use an SQLITE database to create the following dictionary automatically.
produce = {1:[2,4], 2:[1,2,3], 3:[2,3,4]}
The key is a factory, the values attached to it are the products, this factory is able to produce. I.e. 1:[2,4] --> Factory one is able to produce products 2 and 4.
In my SQLITE database I have a table with four fields: idfactory[INT], idproduct[INT],production[BOOL]. I am then using the following code to get the relevant dataset:
cur.execute('SELECT idfactory,idproduct FROM production WHERE production=1')
result = cur.fetchall()
My idea would then be, to use a loop to fill my dictionary, with something like this:
for idfactory,idproduct in result:
p[idfactory] = idproduct
However, this code produces an error, and would also be problematic, because there are no unique KEYs in my database.
I am not quite sure if my explanation is sufficient, but any help would be very much appreciated!

setdefault method of dict may help.
Try this:
p = {}
for idfactory, idproduct in result:
p.setdefault(idfactory, []).append(idproduct)

defaultdict is faster than using setdefault, and in my opinion, a much cleaner way of performing the same action. Consider using this if you want to assume an empty list for all lookups to unknown keys.
from collections import defaultdict
p = defaultdict(list)
for idfactory, idproduct in result:
p[idfactory].append(idproduct)

Create dictionary of a sqlalchemy query object in Pyramid

I am new to Python and Pyramid. In a test application I am using to learn more about Pyramid, I want to query a database and create a dictionary based on the results of a sqlalchemy query object and finally send the dictionary to the chameleon template.
So far I have the following code (which works fine), but I wanted to know if there is a better way to create my dictionary.
...
index = 0
clients = {}
q = self.request.params['q']
for client in DBSession.query(Client).filter(Client.name.like('%%%s%%' % q)).all():
clients[index] = { "id": client.id, "name": client.name }
index += 1
output = { "clients": clients }
return output
While learning Python, I found a nice way to create a list in a for loop statement like the following:
myvar = [user.name for user in users]
So, the other question I had: is there a similar 'one line' way like the above to create a dictionary of a sqlalchemy query object?
Thanks in advance.

well, yes, we can tighten this up a bit.
First, this pattern:
index = 0
for item in seq:
frobnicate(index, item)
item += 1
is common enough that there's a builtin function that does it automatically, enumerate(), used like this:
for index, item in enumerate(seq):
frobnicate(index, item)
but, I'm not sure you need it, Associating things with an integer index starting from zero is the functionality of a list, you don't really need a dict for that; unless you want to have holes, or need some of the other special features of dicts, just do:
stuff = []
stuff.extend(seq)
when you're only interested in a small subset of the attributes of a database entity, it's a good idea to tell sqlalchemy to emit a query that returns only that:
query = DBSession.query(Client.id, Client.name) \
.filter(q in Client.name)
In the above i've also shortened the .name.like('%%%s%%' % q) into just q in name since they mean the same thing (sqlalchemy expands it into the correct LIKE expression for you)
Queries constructed in this way return a special thing that looks like a tuple, and can be easily turned into a dict by calling _asdict() on it:
so to put it all together
output = [row._asdict() for row in DBSession.query(Client.id, Client.name)
.filter(q in Client.name)]
or, if you really desperately need it to be a dict, you can use a dict comprehension:
output = {index: row._asdict()
for index, row
in enumerate(DBSession.query(Client.id, Client.name)
.filter(q in Client.name))}

#TokenMacGuy gave a nice and detailed answer to your question. However, I have a feeling you've asked a wrong question :)
You don't need to convert SQLALchemy objects to dictionaries before passing them to the template - that would be quite inconvenient. You can pass the result of a query as is and directly use SQLALchemy mapped objects in your template
q = self.request.params['q']
clients = DBSession.query(Client).filter(q in Client.name).all()
return {'clients': clients}

If you want to turn a SqlAlchemy object into a dict, you can use this code:
def obj_to_dict(obj):
return dict((col.name, getattr(obj, col.name)) for col in sqlalchemy_orm.class_mapper(obj.__class__).mapped_table.c)
there is another attribute of the mapped table that has the relationships in it , but the code gets dicey.
you don't need to cast an object into a dict for any of the template libraries, but if you decide to persist the data ( memcached, session, pickle, etc ) you'll either need to use dicts or write some code to 'merge' the persisted data back into the session.
a quick side note- if you render any of this data through json , you'll either need to have a custom json renderer that can handle datetime objects , or change the values in a function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Serialization optimization using Marshmallow, other solutions - python

Related

AWS DynamoDB execute_statement Without Data Types in Python

SQLAlchemy returns a result double - even when only one is expected

A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

Creating a dictionary with multiple values for every key from SQLITE in Python

Create dictionary of a sqlalchemy query object in Pyramid

Categories

Resources