This is how I get all of the field topicid values in Topics table.
all_topicid = [i.topicid for i in session.query(Topics)]
But when Topics table have lots of values, the vps killed this process. So is there some good method to resolve this?
Thanks everyone. I edit my code again, My code is below:
last = session.query(Topics).order_by('-topicid')[0].topicid
all_topicid = [i.topicid for i in session.query(Topics.topicid)]
all_id = range(1, last+1)
diff = list(set(all_id).difference(set(all_topicid)))
I want to get diff. Now it is faster than before. So are there other method to improve this code?
you could try by changing your query to return a list of id's with something like:
all_topic_id = session.query(Topics.topicid).all()
if the table contains duplicate topicid's you could add distinct to the above to return unique values
from sqlalchemy import distinct
all_topic_id = session.query(distinct(Topics.topicid)).all()
if this still causes an issue I would probably go for writing a stored procedure that returns the list of topicid's and have sqlalchemy call it.
for the second part I would do something like the below.
from sqlalchemy import distinct, func
all_topic_id = session.query(distinct(Topics.topicid)).all() # gets all ids
max_id = session.query(func.max(Topics.topicid)).one() # gets the last id
all_ids = range(1, max_number[0] + 1)) # creates list of all id's
missing_ids = list(set(all_topic_ids) - set(max_id)) # creates a list of missing id's
Related
I'm getting data from an API and storing it on Python dictionary (and then a list of dictionaries).
I need to do calculations (max, sum, divisions...) on the dictionary data to create extra data to add to the same dictionary/list.
My current code looks like this:
stream = whatever (whatever, whatever)
keywords = []
for batch in stream:
for row in batch.results:
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords = {}
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)
This doesn't work, it gives UnboundLocalError (local variable 'data_keywords' referenced before assignment). I've tried different options and got different errors.
data_keywords["etv"] is what I want to calculate ("max_clicks", "weighted_clicks" and data_keywords["keywords_weighted"] are intermediate calculations for that)
The main problem is that I need to calculate max and sum for all values inside the dictionary, then do a calculation using that max and sum for each value and then store the results in the dictionary itself.
So I don't know where to put the code to do the calculations (before the dictionary, inside the dictionary, after the dictionary or a mix)
I guess it should be possible, but I'm a Python/programming newbie and can't figure this out.
It's probably not relevant, but in case you are wondering, I'm trying to create a weighted sort (https://moz.com/blog/build-your-own-weighted-sort). And I can't use models/database to store data.
Thanks!
EDIT: Some extra info, in case it helps understand better what I need: The results that the keywords list gives without the calculations is something like this:
[{'keywords_text': 'whatever', 'keywords_clicks': 5, 'keywords_conversion_rate': 6.3}, {'keywords_text': 'whatever2', 'keywords_clicks': 50, 'keywords_conversion_rate': 2.3}, {'keywords_text': 'whatever3', 'keywords_clicks': 20, 'keywords_conversion_rate': 2.0}]
I want basically to add to this keywords list a new key/value of 'etv': 8.5 or whatever for each keyword. That etv should come from the formula that I put on my code (data_keywords["etv"] = ...) but maybe it needs changes to work in Python.
The info from this "original" keywords list comes directly from the API (I don't have that data stored anywhere) and it works perfectly if I just request the info and store it in that list. But when the problems come when I introduce the calculations (specially using sum and max inside a loop I guess).
The UnboundLocalError is because you are trying to access data_keywords["keywords_clicks"] before you have declared data_keywords or set the value for "keywords_clicks".
Also, I think you need to be clearer about what data structure you are trying to create. You mention "a list of dictionaries" which I don't see. Maybe you are trying to create a dictionary of lists, but it looks like you overwrite the dictionary values each time you go through your loop.
adding my response as an answer, as I do not have enough reputation to comment
To get rid of assignment error just move the line data_keywords = {} above max_clicks = max(data_keywords["keywords_clicks"])
Here you are trying to access a local variable before its declaration. The code in this case is trying to access a global variable which doesn't seems to exist.
stream = whatever (whatever, whatever)
keywords = []
for batch in stream:
for row in batch.results:
data_keywords = {}
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)
More on that here
You can't refer to elements of the dictionary before you create it. Move those variable assignments down to after you assign the dictionary elements.
for batch in stream:
for row in batch.results:
data_keywords = {}
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)
So, i'm building a little tool to save errors and their solutions as a knowledge base. It is stored in a SQL Database (i'm using pyodbc). The users don't have access to the database, just the GUI.
The GUI has three buttons, one for add a new ErrorID, one for search for an ErrorID (if it exists in the database), and one for delete.
It has too a text panel where it should show the solution of the error searched.
So, need to extract the columns and rows of my DB and put them in a dictionary, then I need to run through that dict in search for the error searched and show it solution on the text panel.
My issue is that the dict that I get has this form: {{('Error', 1): ('Solution', one)}} and so on, so I cannot seem to run succesfully through it and show ONLY the word "one" on the text panel.
In other words, when I search "1", it should print "one" on the text panel.
My question is: How can I transform this {{('Error', 1): ('Solution', one)}} INTO this {"1": "one"} ?
Edit:
Sorry, I forgot to add some parts of my code.
This part is what appends every row in a dict:
readsql = sql_conn.sql()
readsql.sqlread()
columns = [column[0] for column in readsql.cursorv.description]
results = []
for row in readsql.cursorv.fetchall():
results.append(zip(columns, row))
results = dict(results)
I tried to do this, like storing part of the dict that I know it's gonna show on a string named, well, string. And then compare it to 'k' in the for loop, but it doesn't work.
string = "('Error', " + str(error_number) + ")"
for k in results.keys():
if k == string:
post = readsql.cursorv.execute('SELECT * FROM master.dbo.Errors WHERE Error = (?)', (error_number))
text_area.WriteText(post)
break
Here is sql class:
class sql():
def __init__(self):
self.conn = pyodbc.connect('Driver={SQL Server};'
'Server=.\SQLEXPRESS;'
'Database=master;'
'Trusted_Connection=yes;')
# cursor variable
self.cursorv = self.conn.cursor()
def sqlread(self):
self.cursorv.execute('SELECT * FROM master.dbo.Errors')
Your problem comes from the following code unnecessarily zipping the column headers into the resulting dict:
for row in readsql.cursorv.fetchall():
results.append(zip(columns, row))
results = dict(results)
You can instead construct the desired dict directly from the sequence of tuples returned by the fetchall method:
results = dict(readsql.cursorv.fetchall())
I am using Django ORM to query two models, each model return huge amount of data( 1500 records/model) and then I iterate through the records and store it in the python dictionary. This takes a very long time for the view to execute and as a result the user is just waiting for the page to load while all this processing happens in the views. Is there any way by which I could make this process fast?
meter = ostk_vm_tenant_quota_stats.objects.filter(cluster=site, collected_at=time.strftime("%Y-%m-%d"))
records = []
for record in meter:
record_dict = {}
record_dict['cluster'] = record.cluster
record_dict['tenant'] = record.tenant
record_dict['instances_limit'] = record.instances_limit
record_dict['instances_used'] = record.instances_used
record_dict['vcpu_limit'] = record.vcpu_limit
record_dict['vcpu_used'] = record.vcpu_used
record_dict['memory_limit'] = record.memory_limit
record_dict['memory_used'] = record.memory_used
record_dict['disk_limit'] = record.disk_limit
record_dict['disk_used'] = record.disk_used
records.append(record_dict)
return render_to_response('tabs1.html', {'data': records})
I do the same thing for other model. "meter" has huge amount of records which I am iterating through to store in the dictionary. Can I make this process faster?
Since (as you mentioned in the comments) record.cluster is a foreign key, every time you access it django is performing another db lookup. This means you are hitting the db for every iteration of the loop. Check the docs on select_related for a more info.
You can either use select_related to prefetch the related clusters, or use record.cluster_id if you only need the primary key value.
# using select related
meter = ostk_vm_tenant_quota_stats.objects \
.select_related('cluster') \
.filter(cluster=site, collected_at=time.strftime("%Y-%m-%d"))
# or only use the pk value in the result dict
record_dict['cluster'] = record.cluster_id
meter = ostk_vm_tenant_quota_stats.objects.filter(cluster=site,
collected_at=time.strftime("%Y-%m-%d")).values()
will give you a ValueQuerySet, which is basically a Python list of dictionaries. Each dictionary will be all of the fields for each item that matches your query; basically exactly the same as what you have above.
I am trying to get a random object from a model A
For now, it is working well with this code:
random_idx = random.randint(0, A.objects.count() - 1)
random_object = A.objects.all()[random_idx]
But I feel this code is better:
random_object = A.objects.order_by('?')[0]
Which one is the best? Possible problem with deleted objects using the first code? Because, for example, I can have 10 objects but the object with the number 10 as id, is not existing anymore? Did I have misunderstood something in A.objects.all()[random_idx] ?
Just been looking at this. The line:
random_object = A.objects.order_by('?')[0]
has reportedly brought down many servers.
Unfortunately Erwans code caused an error on accessing non-sequential ids.
There is another short way to do this:
import random
items = list(Product.objects.all())
# change 3 to how many random items you want
random_items = random.sample(items, 3)
# if you want only a single random item
random_item = random.choice(items)
The good thing about this is that it handles non-sequential ids without error.
Improving on all of the above:
from random import choice
pks = A.objects.values_list('pk', flat=True)
random_pk = choice(pks)
random_obj = A.objects.get(pk=random_pk)
We first get a list of potential primary keys without loading any Django object, then we randomly choose one primary key, and then we load the chosen object only.
The second bit of code is correct, but can be slower, because in SQL that generates an ORDER BY RANDOM() clause that shuffles the entire set of results, and then takes a LIMIT based on that.
The first bit of code still has to evaluate the entire set of results. E.g., what if your random_idx is near the last possible index?
A better approach is to pick a random ID from your database, and choose that (which is a primary key lookup, so it's fast). We can't assume that our every id between 1 and MAX(id) is available, in the case that you've deleted something. So following is an approximation that works out well:
import random
# grab the max id in the database
max_id = A.objects.order_by('-id')[0].id
# grab a random possible id. we don't know if this id does exist in the database, though
random_id = random.randint(1, max_id + 1)
# return an object with that id, or the first object with an id greater than that one
# this is a fast lookup, because your primary key probably has a RANGE index.
random_object = A.objects.filter(id__gte=random_id)[0]
How about calculating maximal primary key and getting random pk?
The book ‘Django ORM Cookbook’ compares execution time of the following functions to get random object from a given model.
from django.db.models import Max
from myapp.models import Category
def get_random():
return Category.objects.order_by("?").first()
def get_random3():
max_id = Category.objects.all().aggregate(max_id=Max("id"))['max_id']
while True:
pk = random.randint(1, max_id)
category = Category.objects.filter(pk=pk).first()
if category:
return category
Test was made on a million DB entries:
In [14]: timeit.timeit(get_random3, number=100)
Out[14]: 0.20055226399563253
In [15]: timeit.timeit(get_random, number=100)
Out[15]: 56.92513192095794
See source.
After seeing those results I started using the following snippet:
from django.db.models import Max
import random
def get_random_obj_from_queryset(queryset):
max_pk = queryset.aggregate(max_pk=Max("pk"))['max_pk']
while True:
obj = queryset.filter(pk=random.randint(1, max_pk)).first()
if obj:
return obj
So far it did do the job as long as there is an id.
Notice that the get_random3 (get_random_obj_from_queryset) function won’t work if you replace model id with uuid or something else. Also, if too many instances were deleted the while loop will slow the process down.
Yet another way:
pks = A.objects.values_list('pk', flat=True)
random_idx = randint(0, len(pks)-1)
random_obj = A.objects.get(pk=pks[random_idx])
Works even if there are larger gaps in the pks, for example if you want to filter the queryset before picking one of the remaining objects at random.
EDIT: fixed call of randint (thanks to #Quique). The stop arg is inclusive.
https://docs.python.org/3/library/random.html#random.randint
I'm sharing my latest test result with Django 2.1.7, PostgreSQL 10.
students = Student.objects.all()
for i in range(500):
student = random.choice(students)
print(student)
# 0.021996498107910156 seconds
for i in range(500):
student = Student.objects.order_by('?')[0]
print(student)
# 0.41299867630004883 seconds
It seems that random fetching with random.choice() is about 2x faster.
in python for getting a random member of a iterable object like list,set, touple or anything else you can use random module.
random module have a method named choice, this method get a iterable object and return a one of all members randomly.
so becouse random.choice want a iterable object you can use this method for queryset in django.
first import the random module:
import random
then create a list:
my_iterable_object = [1, 2, 3, 4, 5, 6]
or create a query_set like this:
my_iterable_object = mymodel.objects.filter(name='django')
and for getting a random member of your iterable object use choice method:
random_member = random.choice(my_iterable_object)
print(random_member) # my_iterable_object is [1, 2, 3, 4, 5, 6]
3
full code:
import random
my_list = [1, 2, 3, 4, 5, 6]
random.choice(my_list)
2
import random
def get_random_obj(model, length=-1):
if length == -1:
length = model.objects.count()
return model.objects.all()[random.randint(0, length - 1)]
#to use this function
random_obj = get_random_obj(A)
I have written the following function:
def auto_update_ratings(amounts, assessment_entries_qs, lowest_rating=-1):
start = 0
rating = lowest_rating
ids = assessment_entries_qs.values_list('id', flat=True)
for i in ids: # I have absolutely no idea why this seems to be required:
pass # without this loop, the last AssessmentEntries fail to update
# in the following for loop.
for amount in amounts:
end_mark = start + amount
entries = ids[start:end_mark]
a = assessment_entries_qs.filter(id__in=entries).update(rating=rating)
start = end_mark
rating += 1
It does what it is supposed to do (i.e. update the relevant number of entries in assessment_entries_qs with each rating (starting at lowest_rating) as specified in amounts). Here is a simple example:
>>> assessment_entries = AssessmentEntry.objects.all()
>>> print [ae.rating for ae in assessment_entries]
[None, None, None, None, None, None, None, None, None, None]
>>>
>>> auto_update_ratings((2,4,3,1), assessment_entries, 1)
>>> print [ae.rating for ae in assessment_entries]
[1, 1, 2, 2, 2, 2, 3, 3, 3, 4]
However, if I do not iterate through ids before iterating through amounts, the function only updates a subset of the queryset: with my current test data (approximately 250 AssessmentEntries in the queryset), it always results in exactly 84 AssessmentEntries not being updated.
Interestingly, it is always the last iteration of the second for loop that does not result in any updates (although the rest of the code in that iteration does execute properly), as well as a portion of the previous iteration. The querysets are ordered_by('?') prior to being passed to this function, and the intended results are achieved if I simply add the previous 'empty' for loop, so it does not appear to be an issue with my data).
A few more details, just in case they prove to be relevant:
AssessmentEntry.rating is a standard IntegerField(null=True,blank=True).
I am using this function purely for testing purposes, so I have only been executing it from iPython.
Test database is SQLite.
Question: Can someone please explain why I appear to need to iterate through ids, despite not actually touching the data in any way, and why without doing so the function still (sort of) executes correctly, but always fails to update the last few items in the queryset despite apparently still iterating through them?
QuerySets and QuerySet slicing are evaluated lazily. Iterating ids executes the query and makes ids behave like a static list instead of a QuerySet. So when you loop through ids, it causes entries later on to be a fixed set of values; but if you don't loop through ids, then entries is just a subquery with a LIMIT clause added to represent the slicing you do.
Here is what is happening in detail:
def auto_update_ratings(amounts, assessment_entries_qs, lowest_rating=-1):
# assessment_entries_qs is an unevaluated QuerySet
# from your calling code, it would probably generate a query like this:
# SELECT * FROM assessments ORDER BY RANDOM()
start = 0
rating = lowest_rating
ids = assessment_entries_qs.values_list('id', flat=True)
# ids is a ValueQuerySet that adds "SELECT id"
# to the query that assessment_entries_qs would generate.
# So ids is now something like:
# SELECT id FROM assessments ORDER BY RANDOM()
# we omit the loop
for amount in amounts:
end_mark = start + amount
entries = ids[start:end_mark]
# entries is now another QuerySet with a LIMIT clause added:
# SELECT id FROM assessments ORDER BY RANDOM() LIMIT start,(start+end_mark)
# When filter() gets a QuerySet, it adds a subquery
a = assessment_entries_qs.filter(id__in=entries).update(rating=rating)
# FINALLY, we now actually EXECUTE a query which is something like this:
# UPDATE assessments SET rating=? WHERE id IN
# (SELECT id FROM assessments ORDER BY RANDOM() LIMIT start,(start+end_mark))
start = end_mark
rating += 1
Since the subquery in entries is executed every time you insert and it has a random order, the slicing you do is meaningless! This function does not have deterministic behavior.
However when you iterate ids you actually execute the query, so your slicing has deterministic behavior again and the code does what you expect.
Let's see what happens when you use a loop instead:
ids = assessment_entries_qs.values_list('id', flat=True)
# Iterating ids causes the query to actually be executed
# This query was sent to the DB:
# SELECT id FROM assessments ORDER BY RANDOM()
for id in ids:
pass
# ids has now been "realized" and contains the *results* of the query
# e.g., [5,1,2,3,4]
# Iterating again (or slicing) will now return values rather than modify the query
for amount in amounts:
end_mark = start + amount
entries = ids[start:end_mark]
# because ids was executed, entries contains definite values
# When filter() gets actual values, it adds a simple condition
a = assessment_entries_qs.filter(id__in=entries).update(rating=rating)
# The query executed is something like this:
# UPDATE assessments SET rating=? WHERE id IN (5,1)
# "(5,1)" will change on each iteration, but it will always be a set of
# scalar values rather than a subquery.
start = end_mark
rating += 1
If you ever need to eagerly evaluate a QuerySet to get all its values at a moment in time, rather than perform a do-nothing iteration just convert it to a list:
ids = list(assessment_entries_qs.values_list('id', flat=True))
Also the Django docs go into detail about when exactly a QuerySet is evaluated.