Create a search function using Dynamodb and boto3

Create a search function using Dynamodb and boto3 - python

I'm trying to understand how to create a search function using dynamodb. This answer helped me to understand better the use of Global Secondary Indexes but I still have some questions. Suppose we have an structure like this and a GSI called last_name_index:
+------+-----------+----------+---------------+
| User | FirstName | LastName | Email |
+------+-----------+----------+---------------+
| 1001 | Test | Test | test#mail.com |
| 1002 | Jonh | Doe | jdoe#mail.com |
| 1003 | Another | Test | mail#mail.com |
+------+-----------+----------+---------------+
Using boto3 I can search now for a user if I know the last name:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').eq(name)
)
But what if want to search for users and I only know part of the last name? I know there is a contains function on boto3 but this only works with non index keys. Do I need to change the GSI? Or is there something I'm missing? I want to be able to do something like:
table.query(
IndexName = "last_name_index",
KeyConditionExpression=Key('LastName').contains(name) # part of the name
)

Related

How do i modify this python code as SQL query in redshift

I am trying to see if theres anyway i can implement this piece of code using only sql REDSHIFT
a = '''
SELECT to_char(DATE '2022-01-01'
+ (interval '1 day' * generate_series(0,365)), 'YYYY_MM_DD') AS ym
'''
dfa = pd.read_sql(a, conn)
b = f'''
select account_no, {','.join('"' + str(x) + '"' for x in dfa.ym)}
from loan__balance_table
where account_no =
'''
dfb = pd.read_sql(b, conn)
the first query will yield something like this
| ym |
| ---------- |
| 2022_01_01 |
| 2022_01_02 |
...
| 2022_12_31|
Then i used string concatenation to combime the dates together and use then in the second query to select all columns in ym. The result of the second query should be something like this.
| account_no | 2022_01_01 | 2022_01_01 | ...
| ---------- | ---------- | ---------- | ...
| 1234 | 234,987.09 | 233,989.19 | ...
I just want to know if theres a way i can combine both queries together as one in sql without using python to concat the column_names.
I tried using CTE but i cant seem to get it right i dont even know if this is the right approach, The database is REDSHIFT

django list model entry with multiple references

I have the following models which represent songs and the plays of each song:
from django.db import models
class Play(models.Model):
play_day = models.PositiveIntegerField()
source = models.CharField(
'source',
choices=(('radio', 'Radio'),('streaming', 'Streaming'), )
)
song = models.ForeignKey(Song, verbose_name='song')
class Song(models.Model):
name = models.CharField('Name')
Image I have the following entries:
Songs:
|ID | name |
|---|---------------------|
| 1 | Stairway to Heaven |
| 2 | Riders on the Storm |
Plays:
|ID | play_day | source | song_id |
|---|----------|-----------|---------|
| 1 | 2081030 | radio | 1 |
| 1 | 2081030 | streaming | 1 |
| 2 | 2081030 | streaming | 2 |
I would like to list all the tracks as follows:
| Name | Day | Sources |
|---------------------|------------|------------------|
| Stairway to Heaven | 2018-10-30 | Radio, Streaming |
| Riders on the Storm | 2018-10-30 | Streaming |
I am using Django==1.9.2, django_tables2==1.1.6 and django-filter==0.13.0 with PostgreSQL.
Problem:
I'm using Song as the model of the table and the filter, so the queryset starts with a select FROM song. However, when joining the Play table, I get two entries in the case of "Stairway to Heaven" (I know, even one is too much: https://www.youtube.com/watch?v=RD1KqbDdmuE).
What I tried:
I tried putting a distinct to the Song, though this yields the problem that I cannot sort for other columns than the Song.id (supposing I do distinct on that column)
Aggregate: this yields a final state, actually, a dictionary and which cannot be used with django_tables.
I found this solution for PostgreSQL Selecting rows ordered by some column and distinct on another though I don't know how to do this with django.
Question:
What would be the right approach to show one track per line "aggregating" information from references using Django's ORM?

I think that the proper way to do it is to use the array_agg postgresql function (http://postgresql.org/docs/9.5/static/functions-aggregate.html and http://lorenstewart.me/2017/12/03/postgresqls-array_agg-function).
Django seems to actually support this (in v. 2.1 at least: http://docs.djangoproject.com/en/2.1/ref/contrib/postgres/aggregates/) thus that seems like the way to go.
Unfortunately I don't have time to test it right now so I can't provide a thorough answer; however try something like: Song.objects.all().annotate(ArrayAgg(...))

Django multipart ORM query including JOINs

I believe that I am simply failing to search correctly, so please redirect me to the appropriate question if this is the case.
I have a list of orders for an ecommerce platform. I then have two tables called checkout_orderproduct and catalog_product structured as:
|______________checkout_orderproduct_____________|
| id | order_id | product_id | qty | total_price |
--------------------------------------------------
|_____catalog_product_____|
| id | name | description |
---------------------------
I am trying to get all of the products associated with an order. My thought is something along the lines of:
for order in orders:
OrderProduct.objects.filter(order_id=order.id, IM_STUCK_HERE)
What should the second part of the query be so that I get back a list of products such as
["Fruit", "Bagels", "Coffee"]

products = (OrderProduct.objects
.filter(order_id=order.id)
.values('product_id'))
Product.objects.filter(id__in=products)
Or id__in=list(products): see note "Performance considerations" link.

Having trouble with a PostgreSQL query

I've got the following two tables:
User
userid | email | phone
1 | some#email.com | 555-555-5555
2 | some#otheremail.com | 555-444-3333
3 | one#moreemail.com | 333-444-1111
4 | last#one.com | 123-333-2123
UserTag
id | user_id | tag
1 | 1 | tag1
2 | 1 | tag2
3 | 1 | cool_tag
4 | 1 | some_tag
5 | 2 | new_tag
6 | 2 | foo
6 | 4 | tag1
I want to run a query in SQLAlchemy to join those two tables and return all users who do NOT have the tags "tag1" or "tag2". In this case, the query should return users with userid 2, and 3. Any help would be greatly appreciated.
I need the opposite of this query:
users.join(UserTag, User.userid == UserTag.user_id)
.filter(
or_(
UserTag.tag.like('tag1'),
UserTag.tag.like('tag2')
))
)
I have been going at this for hours but always end up with the wrong users or sometimes all of them. An SQL query which achieves this would also be helpful. I'll try to convert that to SQLAlchemy.

Not sure how this would look in SQLAlchemy, but hopefully and explanation of why the query is the way it is will help you get there.
This is an outer join - you want all the records from one table (User) even if there are no records in the other table (UserTag) if we put User first it would be a left join. Beyond that you want all the records that don't have a match in the UserTag for a specific filter.
SELECT user.user_id, email, phone
FROM user LEFT JOIN usertag
ON usertag.user_id = user.user_id
AND usertag.tag IN ('tag1', 'tag2')
WHERE usertag.user_id IS NULL;

SQL will go like this
select u.* from user u join usertag ut on u.id = ut.user_id and ut.tag not in ('tag1', 'tag2');
I have not used SQLAlchemy so you need to convert it to equivalent SQLAlchemy query.
Hope it helps.
Thanks.

Assuming your model defines a relationship as below:
class User(Base):
...
class UserTag(Base):
...
user = relationship("User", backref="tags")
the query follows:
qry = session.query(User).filter(~User.tags.any(UserTag.tag.in_(tags))).order_by(User.id)

Django lte/gte query on a list

I have the following type of data:
The data is segmented into "frames" and each frame has a start and stop "gpstime". Within each frame are a bunch of points with a "gpstime" value.
There is a frames model that has a frame_name,start_gps,stop_gps,...
Let's say I have a list of gpstime values and want to find the corresponding frame_name for each.
I could just do a loop...
framenames = [frames.objects.filter(start_gps__lte=gpstime[idx],stop_gps__gte=gpstime[idx]).values_list('frame_name',flat=True) for idx in range(len(gpstime))]
This will give me a list of 'frame_name', one for each gpstime. This is what I want. However this is very slow.
What I want to know: Is there a better way to preform this lookup to get a framename for each gpstime that is more efficient than iterating over the list. This list could get faily large.
Thanks!
EDIT: Frames model
class frames(models.Model):
frame_id = models.AutoField(primary_key=True)
frame_name = models.CharField(max_length=20)
start_gps = models.FloatField()
stop_gps = models.FloatField()
def __unicode__(self):
return "%s"%(self.frame_name)

If I understand correctly, gpstime is a list of the times, and you want to produce a list of framenames with one for each gpstime. Your current way of doing this is indeed very slow because it makes a db query for each timestamp. You need to minimize the number of db hits.
The answer that comes first to my head uses numpy. Note that I'm not making any extra assumptions here. If your gpstime list can be sorted, i.e. the ordering does not matter, then it could be done much faster.
Try something like this:
from numpy import array
frame_start_times=array(Frame.objects.all().values_list('start_time'))
frame_end_times=array(Frame.objects.all().values_list('end_time'))
frame_names=array(Frame.objects.all().values_list('frame_name'))
frame_names_for_times=[]
for time in gpstime:
frame_inds=frame_start_times[(frame_start_times<time) & (frame_end_times>time)]
frame_names_for_times.append(frame_names[frame_inds].tostring())
EDIT:
Since the list is sorted, you can use .searchsorted():
from numpy import array as a
gpstimes=a([151,152,153,190,649,652,920,996])
starts=a([100,600,900,1000])
ends=a([180,650,950,1000])
names=a(['a','b','c','d',])
names_for_times=[]
for time in gpstimes:
start_pos=starts.searchsorted(time)
end_pos=ends.searchsorted(time)
if start_pos-1 == end_pos:
print time, names[end_pos]
else:
print str(time) + ' was not within any frame'

The best way to speed things up is to add indexes to those fields:
start_gps = models.FloatField(db_index=True)
stop_gps = models.FloatField(db_index=True)
and then run manage.py dbsync.

The frames table is very large, but I have another value that lowers
the frames searched in this case to under 50. There is not really a
pattern, each frame starts at the same gpstime the previous stops.
I don't quite understand how you lowered the number of searched frames to 50, but if you're searching for, say, 10,000 gpstime values in only 50 frames, then it's probably easiest to load those 50 frames into RAM, and do the search in Python, using something similar to foobarbecue's answer.
However, if you're searching for, say, 10 gpstime values in the entire table which has, say, 10,000,000 frames, then you may not want to load all 10,000,000 frames into RAM.
You can get the DB to do something similar by adding the following index...
ALTER TABLE myapp_frames ADD UNIQUE KEY my_key (start_gps, stop_gps, frame_name);
...then using a query like this...
(SELECT frame_name FROM myapp_frames
WHERE 2.5 BETWEEN start_gps AND stop_gps LIMIT 1)
UNION ALL
(SELECT frame_name FROM myapp_frames
WHERE 4.5 BETWEEN start_gps AND stop_gps LIMIT 1)
UNION ALL
(SELECT frame_name FROM myapp_frames
WHERE 7.5 BETWEEN start_gps AND stop_gps LIMIT 1);
...which returns...
+------------+
| frame_name |
+------------+
| Frame 2 |
| Frame 4 |
| Frame 7 |
+------------+
...and for which an EXPLAIN shows...
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
| 1 | PRIMARY | myapp_frames | range | my_key | my_key | 8 | NULL | 3 | Using where; Using index |
| 2 | UNION | myapp_frames | range | my_key | my_key | 8 | NULL | 5 | Using where; Using index |
| 3 | UNION | myapp_frames | range | my_key | my_key | 8 | NULL | 8 | Using where; Using index |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+-------+---------------+--------+---------+------+------+--------------------------+
...so you can do all the lookups in one query which hits that index, and the index should be cached in RAM.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a search function using Dynamodb and boto3 - python

Related

How do i modify this python code as SQL query in redshift

django list model entry with multiple references

Django multipart ORM query including JOINs

Having trouble with a PostgreSQL query

Django lte/gte query on a list

Categories

Resources