After insert, transform and insert into another table

After insert, transform and insert into another table - python

I have two tables Document and Picture. The relationship is that one document can have several pictures. What should happen is that once a document is uploaded to the PostgreSQL, the document should be downloaded and transformed into a jpeg, and then uploaded to the Picture table.
I'm using sqlalchemy and flask in my application.
So far I tried using events to trigger a method after insert. Unfortunately, I am receiving the error sqlalchemy.exc.ResourceClosedError: This transaction is closed when I commit.
The code:
from app.models.ex_model import Document, Picture
from pdf2image import convert_from_bytes
from sqlalchemy import event
import io
import ipdb
from app.core.app_setup import db
#event.listens_for(Document, 'after_insert')
def receive_after_insert(mapper, connection, target):
document = target.document
images = convert_from_bytes(document, fmt='jpeg')
images_bytes = map(lambda img: convert_to_byte(img), images)
map(lambda img_byte: upload_picture(img_byte, target.id, ), images_bytes)
db.session.commit()
def convert_img_to_byte(img):
img_byte = io.BytesIO()
img.save(img_byte, format='jpeg')
img_byte = img_byte.getvalue()
return img_byte
def upload_picture(img_byte, document_id):
picture = Picture(picture=img_byte, document_id=document_id)
db.session.add(picture)

As Session.add method states:
Its state will be persisted to the database on the next flush
operation.Repeated calls to add() will be ignored.
So your add calls should be followed by session.flush() call.
...
def upload_picture(img_byte, document_id):
picture = Picture(picture=img_byte, document_id=document_id)
db.session.add(picture)
db.session.flush()
Furthermore I would pay attention to performance of inserting records. There's a good article for this point in official docs: https://docs.sqlalchemy.org/en/13/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow
As such the current approach is not the fastest one, therefore I would choose either sqlalchemy_orm_bulk_insert or sqlalchemy_core approach.

Related

How to delete collection automatically in mongodb with pymongo ? (Django doesn't delete mongodb collection)

I have Django app which creates collections in MongoDB automatically. But when I tried to integrate the delete functionality, collections that are created with delete functionality are not deleted. Collections that are automatically created are edited successfully. This method is called in another file, with all parameters.
An interesting thing to note is when I manually tried to delete via python shell it worked. I won't be deleting the collections which are not required anymore.
import pymongo
from .databaseconnection import retrndb #credentials from another file all admin rights are given
mydb = retrndb()
class Delete():
def DeleteData(postid,name):
PostID = postid
tbl = name + 'Database'
liketbl = PostID + 'Likes'
likecol = mydb[liketbl]
pcol = mydb[tbl]
col = mydb['fpost']
post = {"post_id":PostID}
ppost = {"PostID":PostID}
result1 = mydb.commentcol.drop() #this doesn't work
result2 = mydb.likecol.drop() #this doesn't work
print(result1,'\n',result2) #returns none for both
try:
col.delete_one(post) #this works
pcol.delete_one(ppost) #this works
return False
except Exception as e:
return e
Any solutions, I have been trying to solve this thing for a week.
Should I change the database engine as Django doesn't support NoSQL natively. Although I have written whole custom scripts that do CRUD using pymongo.

How to pass dataset id to bigquery client for python

I just started playing around with bigquery, and I am trying to pass the dataset id to the python client. It should be a pretty basic operation, but I can't find it on other threads.
In practice I would like to to take the following example
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
and make it look something like
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)\
.A_function_to_specify_id(bigquery-public-data.stackoverflow)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `posts_questions` -- No dataset ID here anymore
WHERE tags like 'google-bigquery'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
The documentation eludes me, so any help would be appreciated.

The closest thing to what you're asking for is the default_dataset (reference) property of the query job config. The query job config is an optional object that can be passed into the query() method of the instantiated BigQuery client.
You don't set default dataset as part of instantiating a client as not all resources are dataset scoped. You're implicitly working with a query job in your example, which is a project scoped resource.
So, to adapt your sample a bit, it might look something like this:
# skip the irrelevant bits like imports and client construction
job_config = bigquery.QueryJobConfig(default_dataset="bigquery-public-data.stackoverflow")
sql = "SELECT COUNT(1) FROM posts_questions WHERE tags like 'google-bigquery'"
dataframe = client.query(sql, job_config=job_config).to_dataframe()
If you're issuing multiple queries against this same dataset you could certainly reuse the same job config object with multiple query invocations.

Django TestCase: recreate database in self.subTest(...)

I need to test a function with different parameters, and the most proper way for this seems to be using the with self.subTest(...) context manager.
However, the function writes something to the db, and it ends up in an inconsistent state. I can delete the things I write, but it would be cleaner if I could recreate the whole db completely. Is there a way to do that?

Not sure how to recreate the database in self.subTest() but I have another technique I am currently using which might be of interest to you. You can use fixtures to create a "snapshot" of your database which will basically be copied in a second database used only for testing purposes. I currently use this method to test code on a big project I'm working on at work.
I'll post some example code to give you an idea of what this will look like in practice, but you might have to do some extra research to tailor the code to your needs (I've added links to guide you).
The process is rather straighforward. You would be creating a copy of your database with only the data needed by using fixtures, which will be stored in a .yaml file and accessed only by your test unit.
Here is what the process would look like:
List item you want to copy to your test database to populate it using fixtures. This will only create a db with the needed data instead of stupidly copying the entire db. It will be stored in a .yaml file.
generate.py
django.setup()
stdout = sys.stdout
conf = [
{
'file': 'myfile.yaml',
'models': [
dict(model='your.model', pks='your, primary, keys'),
dict(model='your.model', pks='your, primary, keys')
]
}
]
for fixture in conf:
print('Processing: %s' % fixture['file'])
with open(fixture['file'], 'w') as f:
sys.stdout = FixtureAnonymiser(f)
for model in fixture['models']:
call_command('dumpdata', model.pop('model'), format='yaml',indent=4, **model)
sys.stdout.flush()
sys.stdout = stdout
In your test unit, import your generated .yaml file as a fixture and your test will automatically use this the data from the fixture to carry out the tests, keeping your main database untouched.
test_class.py
from django.test import TestCase
class classTest(TestCase):
fixtures = ('myfile.yaml',)
def setUp(self):
"""setup tests cases"""
# create the object you want to test here, which will use data from the fixtures
def test_function(self):
self.assertEqual(True,True)
# write your test here
You can read up more here:
Django
YAML
If you have any questions because things are unclear just ask, I'd be happy to help you out.

Maybe my solution will help someone
I used transactions to roll back to the database state that I had at the start of the test.
I use Eric Cousineau's decorator function to parametrizing tests
More about database transactions at django documentation page
import functools
from django.db import transaction
from django.test import TransactionTestCase
from django.contrib.auth import get_user_model
User = get_user_model()
def sub_test(param_list):
"""Decorates a test case to run it as a set of subtests."""
def decorator(f):
#functools.wraps(f)
def wrapped(self):
for param in param_list:
with self.subTest(**param):
f(self, **param)
return wrapped
return decorator
class MyTestCase(TransactionTestCase):
#sub_test([
dict(email="new#user.com", password='12345678'),
dict(email="new#user.com", password='password'),
])
def test_passwords(self, email, password):
# open a transaction
with transaction.atomic():
# Creates a new savepoint. Returns the savepoint ID (sid).
sid = transaction.savepoint()
# create user and check, if there only one with this email in DB
user = User.objects.create(email=email, password=password)
self.assertEqual(User.objects.filter(email=user.email).count(), 1)
# Rolls back the transaction to savepoint sid.
transaction.savepoint_rollback(sid)

Sharing large sqlalchemy objects between flask templates

I have a flask application, which I want to use to do some data visualization of a very large dataset that I have in a postgresql database.
I have sqlalchemy, plotly, google maps set up, but my current problem is that I want to fetch a rather large dataset from my database and pass that to a template that will take care of doing some different visualizations. I only want to load the data in once, when the site is started, and then avoid running it later on when navigating the other pages.
which methods can I use to achieve this? And how can I access the data that I get in other pages (templates)?
I read somewhere about session and I tried to implement such a functionality, but then I got a "not serializable error", which I fixed by parsing my object to JSON, but then I got a "too large header" error. I feel like I'm sort of out of possiblities..
Just to show some code, here is what I'm currently messing around with.
Models.py:
from sqlalchemy import BigInteger, Column, JSON, Text
from app import db
class Table910(db.Model):
__tablename__ = 'table_910'
id = db.Column(BigInteger, primary_key=True)
did = db.Column(Text)
timestamp = db.Column(BigInteger)
data = db.Column(JSON)
db_timestamp = db.Column(BigInteger)
def items(self):
return {'id': self.id, 'did': self.did, 'timestamp': self.timestamp, 'data': self.data, 'db_timestamp': self.db_timestamp}
def __repr__(self):
return '<1_Data %r, %r>' % (self.did, self.id)
Views.py
from flask import render_template, session
from app import app, db, models
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import json
import plotly
#app.route('/')
#app.route('/index')
def index():
# fetching from the database (takes ~50 seconds to run)
rows = models.Table910.query \
.filter_by(did='357139052424715') \
.filter((models.Table910.timestamp > 1466846920000) & (models.Table910.timestamp < 1467017760000)) \
.order_by(models.Table910.timestamp.asc())
#
# managing and drawing data on plots
#
return render_template('index.html',
ids=ids,
graphJSON=graphJSON,
rows=rows,
routeCoordinates=routeCoordinates))
EDIT:
I might have explained my problem a little too vaguely, therefore to elaborate a little I want to use a specific page to load the data, and then have another page take care of drawing the graphs. The best procedure I can think of currently would be to be able to press a button to get the data and once data is loaded press another button to draw the graphs.. How can I achieve this?

Switch collection in mongoengine for find query

I've read mongoengine documentation about switching collection to save document. And test this code and it worked successfully:
from mongoengine.context_managers import switch_db
class Group(Document):
name = StringField()
Group(name="test").save() # Saves in the default db
with switch_collection(Group, 'group2000') as Group:
Group(name="hello Group 2000 collection!").save() # Saves in group2000 collection
But the problem is when I want to find saved document in switch collection switch_collection doesn't work at all.
with switch_collection(Group, 'group2000') as GroupT:
GroupT.objects.get(name="hello Group 2000 collection!") # Finds in group2000 collection

As of mongoengine==0.10.0 mongoengine.context_managers.switch_collection(cls, collection_name)
used as "with switch_collection(Group, 'group1') as Group:" in the example
doesn't work inside functions. It gives unboundlocalerror. A simple get around with existing resources is :
To get:
new_group = Group.switch_collection(Group(),'group1')
from mongoengine.queryset import QuerySet
new_objects = QuerySet(Group,new_group._get_collection())
Use new_objects.all() to get all objects etc.
To save:
group_obj = Group()
group_obj.switch_collection('group2')
group_obj.save()

Although Prachetos Sadhukhan answer works for me, I prefer to get the collection directly, not relying on private _get_collection method:
from mongoengine import connection
new_group_collection = connection.get_db()['group1']
from mongoengine.queryset import QuerySet
new_objects = QuerySet(Group, new_group_collection)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

After insert, transform and insert into another table - python

Related

How to delete collection automatically in mongodb with pymongo ? (Django doesn't delete mongodb collection)

How to pass dataset id to bigquery client for python

Django TestCase: recreate database in self.subTest(...)

Sharing large sqlalchemy objects between flask templates

Switch collection in mongoengine for find query

Categories

Resources