Query and group nested documents with mongoengine

Query and group nested documents with mongoengine - python

I have kind of a dual question which is keeping me from proceeding for a while already. I have read lots of articles, checked stackoverflow numerous times and read again the docs of mongoengine but I can not find the answer which works for me. I am using mongoDB to store the data of a Flask webb-app. To query the DB, I am using mongoengine. Now suppose my users model lookes like this:
Users
name: Superman
kudos:
0 0 date
1 category A
1 0 date
1 category B
name: Superman
kudos:
0 0 date
1 category A
1 0 date
1 category A
2 0 date
1 category B
The kudo's are nested documents which get created whenever a user receives a kudo. I store them as a db.ListField(date=now). This is working perfectly fine.
In a relational DB I would have a seperate kudo scheme. In mongoDB I assumend it would be the better solution to create nested documents wihtin the User collections. Otherwise you are still creating all kind of seperate scheme's which relations to others.
So here are my two main questions:
Am I correct in that my architecture is true to how mongoengine should be implemented?
How can I get a list (dict actually) of kudo's per category? So I would like to query and get Category - Count
Result should be:
kudos=[(category A, 3),(category B, 2)
If I already had something even remotely working I would provide it but I am completely stuck. Thats why I even started doubting storing the kudos in a seperate collection but I feel like I am than starting to get off track in correctly using a noSQL DB.

Assuming you have the following schema and data:
import datetime as dt
from mongoengine import *
connect(host='mongodb://localhost:27017/testdb')
class Kudo(EmbeddedDocument):
date = DateTimeField(default=dt.datetime.utcnow)
category = StringField()
class User(Document):
name = StringField(required=True)
kudos = EmbeddedDocumentListField(Kudo)
superman = User(name='superman', kudos=[Kudo(category='A')]).save()
batman = User(name='batman', kudos = [Kudo(category='A'), Kudo(category='B')]).save()
This isn't the most efficient but you can get the distribution with the following simple snippet:
import itertools
from collection import Counter
raw_kudos = User.objects.scalar('kudos')
categories_counter = Counter(k.category for k in itertools.chain.from_iterable(raw_kudos)) # raw_kudos is a list of list
print(categories_counter) # is a dict --> Counter({u'A': 1, u'B': 1})
And if you need higher performance, you'll need to use an aggregation pipeline

Related

Using SQLalchemy ORM for Python In my REST api, how can I aggregate resources to the hour to the day?

I have a MySql db table with that looks like:
time_slot | sales
2022-08-026T01:00:00 | 100
2022-08-026T01:06:40 | 103
...
I am serving the data via api to a client. The FE engineer wants the data aggregated by hour for each day within the query period (atm it's a week). So he gives from and to and wants the sum of sales within each hour for each day as a nested array. Because it's a week, it's a 7 element array, where each element is an array containing all the hourly slots where we have data.
[
[
"07:00": 567,
"08:00": 657,
....
],
[], [], ...
]
The api is built in python. There is an ORM (sqlalchemy) for the data, that looks like:
class HourlyData(Base):
hour: Column(Datetime)
sales: Column(Float)
I can query the hourly data, and then in python memory aggregate it into list of lists. But to save compute time (and conceptual complexity), I would like to run the aggregation through orm queries.
What is the sqlalchemy syntax to achieve this?

The below should get you started, where the solution is a mix of SQL and Python using existing tools, and it should work with any RDBMS.
Assumed model definition, and imports
from itertools import groupby
import json
class TimelyData(Base):
__tablename__ = "timely_data"
id = Column(Integer, primary_key=True)
time_slot = Column(DateTime)
sales = Column(Float)
We get the data from the DB aggregated enough for us to group properly
# below works for Posgresql (tested), and should work for MySQL as well
# see: https://mode.com/blog/date-trunc-sql-timestamp-function-count-on
col_hour = func.date_trunc("hour", TimelyData.time_slot)
q = (
session.query(
col_hour.label("hour"),
func.sum(TD.sales).label("total_sales"),
)
.group_by(col_hour)
.order_by(col_hour) # this is important for `groupby` function later on
)
Group the results by date again using python groupby
groups = groupby(q.all(), key=lambda row: row.hour.date())
# truncate and format the final list as required
data = [
[(f"{row.hour:%H}:00", int(row.total_sales)) for row in rows]
for _, rows in groups
]
Example result:
[[["01:00", 201], ["02:00", 102]], [["01:00", 103]], [["08:00", 104]]]
I am not familiar with MySQL, but with Postgresql one could implement all at the DB level due to extensive JSON support. However, I would argue the readability of that implementation will not be improve, and so will not the speed assuming we get from the database at most 168 rows = 7 days x 24 hours).

How to have the possibility to call name of columns in db.session.query with 2 tables in Flask Python?

I am developing a web application with Flask, Python, SQLAlchemy, and Mysql.
I have 2 tables:
TaskUser:
- id
- id_task (foreign key of id column of table Task)
- message
Task
- id
- id_type_task
I need to extract all the tasksusers (from TaskUser) where the id_task is in a specific list of Task ids.
For example, all the taskusers where id_task is in (1,2,3,4,5)
Once I get the result, I do some stuff and use some conditions.
When I make this request :
all_tasksuser=TaskUser.query.filter(TaskUser.id_task==Task.id) \
.filter(TaskUser.id_task.in_(list_all_tasks),Task.id_type_task).all()
for item in all_tasksuser:
item.message="something"
if item.id_type_task == 2:
#do some stuff
if item.id_task == 7 or item.id_task == 7:
#do some stuff
I get this output error:
if item.id_type_task == 2:
AttributeError: 'TaskUser' object has no attribute 'id_type_task'
It is normal as my SQLAlchemy request is calling only one table. I can't access to columns of table Task.
BUT I CAN call the columns of TaskUser by their names (see item.id_task).
So I change my SQLAlchemy to this:
all_tasksuser=db_mysql.session.query(TaskUser,Task.id,Task.id_type_task).filter(TaskUser.id_task==Task.id) \
.filter(TaskUser.id_task.in_(list_all_tasks),Task.id_type_task).all()
This time I include the table Task in my query BUT I CAN'T call the columns by their names. I should use the [index] of columns.
I get this kind of error message:
AttributeError: 'result' object has no attribute 'message'
The problem is I have many more columns (around 40) on both tables. It is too complicated to handle data with index numbers of columns.
I need to have a list of rows with data from 2 different tables and I need to be able to call the data by column name in a loop.
Do you think it is possible?

The key point leading to the confusion is the fact that when you perform a query for a mapped class like TaskUser, the sqlalchemy will return instances of that class. For example:
q1 = TaskUser.query.filter(...).all() # returns a list of [TaskUser]
q2 = db_mysql.session.query(TaskUser).filter(...).all() # ditto
However, if you specify only specific columns, you will receive just a (special) list of tuples:
q3 = db_mysql.session.query(TaskUser.col1, TaskUser.col2, ...)...
If you switch your mindset to completely use the ORM paradigm, you will work mostly with objects. In your specific example, the workflow could be similar to below, assuming you have relationships defined on your models:
# model
class Task(...):
id = ...
id_type_task = ...
class TaskUser(...):
id = ...
id_task = Column(ForeignKey(Task.id))
message = ...
task = relationship(Task, backref="task_users")
# query
all_tasksuser = TaskUser.query ...
# work
for item in all_tasksuser:
item.message = "something"
if item.task.id_type_task == 2: # <- CHANGED to navigate the relationship ...
#do some stuff
if item.task.id_task == 7 or item.task.id_task == 7: # <- CHANGED
#do some stuff

First error message is the fact that query and filter without join (our any other joins) cannot give you columns from both tables. You need to either put both tables into session query, or join those two tables in order to gather column values from different tables, so this code:
all_tasksuser=TaskUser.query.filter(TaskUser.id_task==Task.id) \
.filter(TaskUser.id_task.in_(list_all_tasks),Task.id_type_task).all()
needs to look more like this:
all_tasksuser=TaskUser.query.join(Task) \
.filter(TaskUser.id_task.in_(list_all_tasks),Task.id_type_task).all()
or like this:
all_tasksuser=session.query(TaskUser, Task).filter(TaskUser.id_task==Task.id) \
.filter(TaskUser.id_task.in_(list_all_tasks),Task.id_type_task).all()
Another thing is that the data will be structured differently so in the first example, you will need two for loops:
for taskuser in all_taskuser:
for task in taskuser.task:
# to reference to id_type_task : task.id_type_task
and in the second example your result is the tuple, so for loop should look like this
for taskuser, task in all_taskuser:
# to reference to id_type_task : task.id_type_task
NOTE: I haven't check all these examples, so there may be errors, but concepts are there. For more info, please refer yourself to this page:
https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_orm_working_with_joins.htm

List/Dict structure issue

I'm confused on how to structure a list/dict I need. I have scraped three pieces of info off ESPN: Conference, Team, and link to team homepage for future stat scrapping.
When the program first runs, id like to build a dictionary/list so that one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools. The link associated with each school isn't important that the end user know about but it is important that the correct link is associated with the correct school so that future stats from that specific school can be scraped.
For example the info scrapped is:
SEC, UGA, www.linka.com
ACC, FSU, www.linkb.com
etc...
I know i could create a list of dictionaries like:
sec_list=[{UGA: www.linka.com, Alabama: www.linkc.com, etc...}]
acc_list=[{FSU: www.linkb.com, etc...}]
The problem is id have to create about 26 lists here to hold every conference which sounds excessive. Is there a way to lump everything into one list but still have the ability to to extract schools from a specific conference or search for a school and the correct conference is also returned? Of course, the link to the school must also correspond to the correct school.

Python ships with sqlite3 to handle database problems and it has an :memory: mode for in-memory databases. I think it will solve your problem directly and with clear code.
import sqlite3
from pprint import pprint
# Load the data from a list of tuples in the from [(conf, school, link), ...]
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Espn (conf text, school text, link text)')
c.execute('CREATE INDEX Cndx ON Espn (conf)')
c.execute('CREATE INDEX Sndx ON Espn (school)')
c.executemany('INSERT INTO Espn VALUES (?, ?, ?)', data)
conn.commit()
# Run queries
pprint(c.execute('SELECT * FROM Espn WHERE conf = "Big10"').fetchall())
pprint(c.execute('SELECT * FROM Espn WHERE school = "Alabama"').fetchall())
In memory databases are so easy to create and query that they are often the easiest solution to the problem of how to have multiple lookup keys and doing analytics on relational data. Trying to use dicts and lists for this kind of work just makes the problem unnecessarily complicated.

It's true you can do this with a list of dictionaries, but you might find it easier to be able to look up information with named fields. In that case, I'd recommend storing your scraped data in a Pandas DataFrame.
You want it so that "one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools".
Here's an example of what that would look like, using Pandas and a couple of convenience functions.
First, some example data:
confs = ['ACC','Big10','BigEast','BigSouth','SEC',
'ACC','Big10','BigEast','BigSouth','SEC']
teams = ['school{}'.format(x) for x in range(10)]
links = ['www.{}.com'.format(x) for x in range(10)]
scrape = zip(confs,teams,links)
[('ACC', 'school0', 'www.0.com'),
('Big10', 'school1', 'www.1.com'),
('BigEast', 'school2', 'www.2.com'),
('BigSouth', 'school3', 'www.3.com'),
('SEC', 'school4', 'www.4.com'),
('ACC', 'school5', 'www.5.com'),
('Big10', 'school6', 'www.6.com'),
('BigEast', 'school7', 'www.7.com'),
('BigSouth', 'school8', 'www.8.com'),
('SEC', 'school9', 'www.9.com')]
Now convert to DataFrame:
import pandas as pd
df = pd.DataFrame.from_records(scrape, columns=['conf','school','link'])
conf school link
0 ACC school0 www.0.com
1 Big10 school1 www.1.com
2 BigEast school2 www.2.com
3 BigSouth school3 www.3.com
4 SEC school4 www.4.com
5 ACC school5 www.5.com
6 Big10 school6 www.6.com
7 BigEast school7 www.7.com
8 BigSouth school8 www.8.com
9 SEC school9 www.9.com
Type in school, get conference:
def get_conf(df, school):
return df.loc[df.school==school, 'conf'].values
get_conf(df, school = 'school1')
['Big10']
Type in conference, get schools:
def get_schools(df, conf):
return df.loc[df.conf==conf, 'school'].values
get_schools(df, conf = 'Big10')
['school1' 'school6']
It's unclear from your question whether you also want the links associated with schools returned when searching by conference. If so, just update get_schools() to:
def get_schools(df, conf):
return df.loc[df.conf==conf, ['school','link']].values

Populate Unique ID field after Sorting, Python

I am trying to create an new unique id field in an access table. I already have one field called SITE_ID_FD, but it is historical. The format of the unique value in that field isn't what our current format is, so I am creating a new field with the new format.
Old Format = M001, M002, K003, K004, S005, M006, etc
New format = 12001, 12002, 12003, 12004, 12005, 12006, etc
I wrote the following script:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
This works fine, but it populates the new id field based on the default sorting of objectID. I need to sort 2 fields first and then populate the new id field based on that sorting (I want to sort by a field called SITE and then by the old id field SITE_ID_FD)
I tried manually sorting the 2 fields in hopes that Python would honor the sort, but it doesn't. I'm not sure how to do this in Python. Can anyone suggest a method?

A possible solution is when you are creating your update cursor. you can specify to the cursor the fields by which you wish it to be sorted (sorry for my english..), they explain this in the documentation: http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//000v0000003m000000
so it goes like this:
UpdateCursor(dataset, {where_clause}, {spatial_reference}, {fields}, {sort_fields})
and you are intrested only in the sort_fields so assuming that your code will work well on a sorted table and that you want the table ordered asscending the second part of your code should look like this:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc,"","","","SITE A, SITE_ID_FD A")
#if you want to sort it descending you need to write it with a D
#>> cursor = arcpy.UpdateCursor(fc,"","","","SITE D, SITE_ID_FD D")
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
i hope this helps

Added a link to the arcpy docs in a comment, but from what I can tell, this will create a new, sorted dataset--
import arcpy
from arcpy import env
env.workspace = r"z:\test.gdb"
arcpy.Sort_management("testfc", "testfc_sort", [["SITE", "ASCENDING"],
["SITE_IF_FD", "ASCENDING]])
And this will, on the sorted dataset, do what you want:
fc = r"Z:\test.gdb\testfc_sort"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
I'm assuming there's some way to just copy the sorted/modified dataset back over the original, so it's all good?
I'll admit, I don't use arcpy, and the docs could be a lot more explicit.

Storing a List into Python Sqlite3

I am trying to scrape form field IDs using Beautiful Soup like this
for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
if link.has_key('id'):
print link['id']
Lets us assume that it returns something like
username
email
password
passwordagain
terms
button_register
I would like to write this into Sqlite3 DB.
What I will be doing down the line in my application is... Use these form fields' IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this...
Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on
What I am unsure here is... How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns - say
COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)
One thing to be remembered is... Down the line in my application I will need to do something like this...
Pseudocode
If Domain is "http://somedomain.com":
For ever item in the COL2 (which is a list of form field ids):
Assign some set of values to each of the form fields & then make a POST request
Can any one guide, please?
EDITed on 22/07/2011 - Is My Below Database Design Correct?
I have decided to have a solution like this. What do you guys think?
I will be having three tables like below
Table 1
Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT
Sample Data would be something like:
1 http://url1.com
2 http://url2.com
3 http://url3.com
Table 2
Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)
Sample Data would be something like:
1 http://url1.com/register field1
1 http://url1.com/register field2
1 http://url1.com/register field3
2 http://url2.com/register field1
2 http://url2.com/register field2
2 http://url2.com/register field3
3 http://url3.com/register field1
3 http://url3.com/register field2
3 http://url3.com/register field3
Table 3
Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)
Sample Data would be something like:
1 Pass user1 pass1
2 Fail user2 pass2
3 Pass user3 pass3
Do you think this table design is good? Or are there any improvements that can be made?

There is a normalization problem in your table.
Using 2 tables with
TABLE domains
int id primary key
text name
TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value
is a better solution.

Proper database design would suggest you have a table of URLs, and a table of fields, each referenced to a URL record. But depending on what you want to do with them, you could pack lists into a single column. See the docs for how to go about that.
Is sqlite a requirement? It might not be the best way to store the data. E.g. if you need random-access lookups by URL, the shelve module might be a better bet. If you just need to record them and iterate over the sites, it might be simpler to store as CSV.

Try this to get the ids:
ids = (link['id'] for link in
BeautifulSoup(content, parseOnlyThese=SoupStrainer('input'))
if link.has_key('id'))
And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It's the simplest solution, and perfectly adequate for a relatively small number of rows of data.
from itertools import izip, repeat
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')
domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()
domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))
# this is just an example
def some_function_of_row(row):
return row[1] + ' value'
fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.