I'm trying to understand how to create a search function using dynamodb. This answer helped me to understand better the use of Global Secondary Indexes but I still have some questions. Suppose we have an structure like this and a GSI called last_name_index:
+------+-----------+----------+---------------+
| User | FirstName | LastName | Email |
+------+-----------+----------+---------------+
| 1001 | Test | Test | test#mail.com |
| 1002 | Jonh | Doe | jdoe#mail.com |
| 1003 | Another | Test | mail#mail.com |
+------+-----------+----------+---------------+
Using boto3 I can search now for a user if I know the last name:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').eq(name)
)
But what if want to search for users and I only know part of the last name? I know there is a contains function on boto3 but this only works with non index keys. Do I need to change the GSI? Or is there something I'm missing? I want to be able to do something like:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').contains(name) # part of the name
)
I am new to Python and am currently trying to create a Web-form to edit customer data. The user selects a customer and gets all DSL-Products linked to the customer. What I am now trying is to get the maximum downstream possible for a customer. So when the customer got DSL1, DSL3 and DSL3 then his MaxDownstream is 550. Sorry for my poor english skills.
Here is the structure of my tables..
Customer_has_product:
Customer_idCustomer | Product_idProduct
----------------------------
1 | 1
1 | 3
1 | 4
2 | 5
3 | 3
Customer:
idCustomer | MaxDownstream
----------------------------
1 |
2 |
3 |
Product:
idProduct | Name | downstream
-------------------------------------------------
1 | DSL1 | 50
2 | DSL2 | 100
3 | DSL3 | 550
4 | DSL4 | 400
5 | DSL5 | 1000
And the code i've got so far:
db_session = Session(db_engine)
customer_object = db_session.query(Customer).filter_by(
idCustomer=productform.Customer.data.idCustomer
).first()
productlist = request.form.getlist("DSLPRODUCTS_PRIVATE")
oldproducts = db_session.query(Customer_has_product.Product_idProduct).filter_by(
Customer_idCustomer=customer_object.idCustomer)
id_list_delete = list(set([r for r, in oldproducts]) - set(productlist))
for delid in id_list_delete:
db_session.query(Customer_has_product).filter_by(Customer_idCustomer=customer_object.idCustomer,
Product_idProduct=delid).delete()
db_session.commit()
for product in productlist:
if db_session.query(Customer_has_product).filter_by(
Customer_idCustomer=customer_object.idCustomer,
Product_idProduct=product
).first() is not None:
continue
else:
product_link_to_add = Customer_has_product(
Customer_idCustomer=productform.Customer.data.idCustomer,
Product_idProduct=product
)
db_session.add(product_link_to_add)
db_session.commit()
What you want to do is JOIN the tables onto each other. All relational database engines support joins, as does SQLAlchemy.
So how do you do that in SQLAlchemy?
You have two options, really. One is to use the Query builder of SQLAlchemy's ORM, the other is using SQLAlchemy Core (upon which the ORM is built) directly. I really prefer the later, because it maps more directly to SELECT statements, but I'm going to show both.
Using SQLAlchemy Core
How to do a join in Core is documented here. First argument is the table to JOIN to, second argument is the JOIN-condition.
from sqlalchemy import select, func
query = select(
[
Customer.idCustomer,
func.max(Product.downstream),
]
).select_from(
Customer.__table__
.join(Customer_has_product.__table__,
Customer_has_product.Customer_idCustomer ==
Customer.idCustomer)
.join(Product.__table__,
Product.idProduct == Customer_has_product.Product_idProduct)
).group_by(
Customer.idCustomer
)
# Now we can execute the built query on the database.
result = db_session.execute(query).fetchall()
print(result) # Should now give you the correct result.
Using SQLAlchemy ORM
To simplify this it's best to declare some [relationships on your models][2].joinis documented [here][2]. First argument tojoin` is the model to join onto and the second argument is the JOIN-condition again.
Without the relationships you'll have to do it like this.
result = (db_session
.query(Customer.idCustomer, func.max(Product.downstream))
.join(Customer_has_product,
Customer_has_product.Customer_idCustomer ==
Customer.idCustomer)
.join(Product,
Product.idProduct == Customer_has_product.Product_idProduct)
.group_by(Customer.idCustomer)
).all()
print(result)
This should be enough to get the idea on how to do this.
I have table that is holding some data about users. There are two fields there like and smile. I need to get data from table, grouped by user_id that will show if user has likes or smiles. Query that I would write in SQL looks like:
select sum(smile) > 0 as has_smile,
sum(like) > 0 as has_like,
user_id
from ratings
group by user_id.
This would provide output like:
| has_smile | has_like | user_id |
+-----------+----------+---------+
| 1 | 0 | 1 |
| 1 | 1 | 2 |
Is there any chance this query can be translated to SQLAlchemy (Flask-SQLAlchemy to be precise)? I know there is db.func.sum but I don't know how to add comparison there, and to have label. What I did for now is:
cls.query.with_entities("user_id").group_by(user_id).\
add_columns(db.func.sum(cls.smile).label("has_smile"),
db.func.sum(cls.like).label("has_like")).all()
but that will return exact number of smiles/likes instead of just 1/0 if there is or there is not smile/like.
Thanks to operator overloading you'd do comparison the way you're used to doing in Python in general:
db.func.sum(cls.smile) > 0
which produces an SQL expression object that you can then give a label to:
(db.func.sum(cls.smile) > 0).label('has_smile')
I believe that I am simply failing to search correctly, so please redirect me to the appropriate question if this is the case.
I have a list of orders for an ecommerce platform. I then have two tables called checkout_orderproduct and catalog_product structured as:
|______________checkout_orderproduct_____________|
| id | order_id | product_id | qty | total_price |
--------------------------------------------------
|_____catalog_product_____|
| id | name | description |
---------------------------
I am trying to get all of the products associated with an order. My thought is something along the lines of:
for order in orders:
OrderProduct.objects.filter(order_id=order.id, IM_STUCK_HERE)
What should the second part of the query be so that I get back a list of products such as
["Fruit", "Bagels", "Coffee"]
products = (OrderProduct.objects
.filter(order_id=order.id)
.values('product_id'))
Product.objects.filter(id__in=products)
Or id__in=list(products): see note "Performance considerations" link.
I have the following type of data:
The data is segmented into "frames" and each frame has a start and stop "gpstime". Within each frame are a bunch of points with a "gpstime" value.
There is a frames model that has a frame_name,start_gps,stop_gps,...
Let's say I have a list of gpstime values and want to find the corresponding frame_name for each.
I could just do a loop...
framenames = [frames.objects.filter(start_gps__lte=gpstime[idx],stop_gps__gte=gpstime[idx]).values_list('frame_name',flat=True) for idx in range(len(gpstime))]
This will give me a list of 'frame_name', one for each gpstime. This is what I want. However this is very slow.
What I want to know: Is there a better way to preform this lookup to get a framename for each gpstime that is more efficient than iterating over the list. This list could get faily large.
Thanks!
EDIT: Frames model
class frames(models.Model):
frame_id = models.AutoField(primary_key=True)
frame_name = models.CharField(max_length=20)
start_gps = models.FloatField()
stop_gps = models.FloatField()
def __unicode__(self):
return "%s"%(self.frame_name)
If I understand correctly, gpstime is a list of the times, and you want to produce a list of framenames with one for each gpstime. Your current way of doing this is indeed very slow because it makes a db query for each timestamp. You need to minimize the number of db hits.
The answer that comes first to my head uses numpy. Note that I'm not making any extra assumptions here. If your gpstime list can be sorted, i.e. the ordering does not matter, then it could be done much faster.
Try something like this:
from numpy import array
frame_start_times=array(Frame.objects.all().values_list('start_time'))
frame_end_times=array(Frame.objects.all().values_list('end_time'))
frame_names=array(Frame.objects.all().values_list('frame_name'))
frame_names_for_times=[]
for time in gpstime:
frame_inds=frame_start_times[(frame_start_times<time) & (frame_end_times>time)]
frame_names_for_times.append(frame_names[frame_inds].tostring())
EDIT:
Since the list is sorted, you can use .searchsorted():
from numpy import array as a
gpstimes=a([151,152,153,190,649,652,920,996])
starts=a([100,600,900,1000])
ends=a([180,650,950,1000])
names=a(['a','b','c','d',])
names_for_times=[]
for time in gpstimes:
start_pos=starts.searchsorted(time)
end_pos=ends.searchsorted(time)
if start_pos-1 == end_pos:
print time, names[end_pos]
else:
print str(time) + ' was not within any frame'
The best way to speed things up is to add indexes to those fields:
start_gps = models.FloatField(db_index=True)
stop_gps = models.FloatField(db_index=True)
and then run manage.py dbsync.
The frames table is very large, but I have another value that lowers
the frames searched in this case to under 50. There is not really a
pattern, each frame starts at the same gpstime the previous stops.
I don't quite understand how you lowered the number of searched frames to 50, but if you're searching for, say, 10,000 gpstime values in only 50 frames, then it's probably easiest to load those 50 frames into RAM, and do the search in Python, using something similar to foobarbecue's answer.
However, if you're searching for, say, 10 gpstime values in the entire table which has, say, 10,000,000 frames, then you may not want to load all 10,000,000 frames into RAM.
You can get the DB to do something similar by adding the following index...
ALTER TABLE myapp_frames ADD UNIQUE KEY my_key (start_gps, stop_gps, frame_name);
...then using a query like this...
(SELECT frame_name FROM myapp_frames
WHERE 2.5 BETWEEN start_gps AND stop_gps LIMIT 1)
UNION ALL
(SELECT frame_name FROM myapp_frames
WHERE 4.5 BETWEEN start_gps AND stop_gps LIMIT 1)
UNION ALL
(SELECT frame_name FROM myapp_frames
WHERE 7.5 BETWEEN start_gps AND stop_gps LIMIT 1);
...which returns...
+------------+
| frame_name |
+------------+
| Frame 2 |
| Frame 4 |
| Frame 7 |
+------------+
...and for which an EXPLAIN shows...
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
| 1 | PRIMARY | myapp_frames | range | my_key | my_key | 8 | NULL | 3 | Using where; Using index |
| 2 | UNION | myapp_frames | range | my_key | my_key | 8 | NULL | 5 | Using where; Using index |
| 3 | UNION | myapp_frames | range | my_key | my_key | 8 | NULL | 8 | Using where; Using index |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
...so you can do all the lookups in one query which hits that index, and the index should be cached in RAM.