Sum same column in different ways depending on other column - Django ORM - python

I have the following table assignment structure:
| employee | product | process | qty |
| Swati | PROD1 | issue | 60 |
| Rohit | PROD1 | issue | 30 |
| Rohit | PROD2 | issue | 40 |
| Swati | PROD1 | receive | 40 |
| Swati | PROD2 | issue | 70 |
I want the final table to look like this for each employee (say for employee = 'Swati'):
| product | sum_issued | sum_received
| PROD1 | 60 | 40 |
| PROD2 | 70 | 0 |
The SQL query which does this is:
select product
, sum(case when process='issue' then qty else 0 end) as sum_issued
, sum(case when process='receive' then qty else 0 end) as sum_received
from assignment
where employee = 'Swati'
group
by product;
What should the Django query be, corresponding to this result?

I guess your Model name is 'Assignment'. You can use the below query
from django.db.models import Case, Value, When, Sum, IntegerField, Count
result = Assignment.objects.filter(employee="Swati").values('product').annotate(
sum_issued=Sum(
Case(When(process='issue', then='qty'), default=Value(0), output_field=IntegerField())),
sum_recived=Sum(Case(When(process='receive', then='qty'), default=Value(0), output_field=IntegerField()))
)
If you print the above query print result.query , the result is,
SELECT "product", SUM(CASE WHEN "process" = issue THEN "qty" ELSE 0 END) AS "sum_issued", SUM(CASE WHEN "process" = receive THEN "qty" ELSE 0 END) AS "sum_recived" FROM "assignment" WHERE "employee" = 'Swati' GROUP BY "product"

Related

How can I retrieve the last date a product was introduced to my store inventory data?

+---------+-------------+---------+
| Product | Date | On Hand |
+---------+-------------+---------+
| Item_1 | 11-Nov-2020 | 1 |
| Item_1 | 14-Nov-2020 | 0 |
| Item_1 | 18-Nov-2020 | 0 |
| Item_1 | 25-Nov-2020 | 1 | <--- for Item_1
| Item_1 | 28-Nov-2020 | 1 |
| item_2 | 11-Nov-2020 | 1 | <--- for Item_2
| item_2 | 14-Nov-2020 | 1 |
| item_2 | 18-Nov-2020 | 1 |
| item_2 | 25-Nov-2020 | 1 |
| item_2 | 28-Nov-2020 | 1 |
| item_3 | 11-Nov-2020 | 1 |
| item_3 | 14-Nov-2020 | 0 |
| item_3 | 18-Nov-2020 | 1 |
| item_3 | 25-Nov-2020 | 0 |
| item_3 | 28-Nov-2020 | 0 | <-- Out of stock
+---------+-------------+---------+
I have a data frame like the one above and I would like to get a new data frame with the last date the product was introduced to the store. Something like this:
+---------+--------------+
| Product | Last Entry |
+---------+--------------+
| Item_1 | 25-Nov-2020 |
| Item_2 | 11-Nov-2020 |
| Item_3 | Out of stock |
+---------+--------------+
I would like to get a solution either for Python or SQL.
Hmmmm . . . One method is to use do a reverse sum of the days when onhand = 0. This is 0 for the last group. Then you want the earliest date for each item:
select item, min(date)
from (select t.*,
sum(1 - onhand) over (partition by product order by date desc) as grp
from t
) t
where grp = 0 and onhand = 1
group by item;
Even though you can do things like #gordonlinoff said or something like this
with helper as
(select t.product, max(t.date) date from test t group by t.product)
select
h.product,
case when t.on_hand = 0 then 'out of date' else h.date::text end
from helper h
join test t using(date, product);
I think you should keep your table with integrity, which means that you should avoid mix text type with date type. Maybe you could do this with a small and better query, but just to show my point.
with helper as
(select t.product, max(t.date) date from test t group by t.product)
select
h.*, t.on_hand
from helper h
join test t using(date, product);
Giving you a result like this
product | date | on_hand
---------+------------+---------
item_1 | 2020-10-03 | 1
item_2 | 2020-10-03 | 0
So you can work with this data in a more appropriated way.

SQLAlchemy: Insert or Update when column value is a duplicate

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

How to identify a first occurence of a condition in python data frame and perform a calculation on it?

I am really struggling to get a logic for this . I have data set called Col as shown below . I am using Python and Pandas
I want to set a new column called as "STATUS" . The logic is
a. When Col==0 , i will Buy . But this Buy will happen only when Col==0 is the first value in the data set or after the Status Sell. There cannot be two Buy values without a Sell in between
b. When Col<=-8 I will Sell. But this will happen if there is a Buy preceding it in the Satus Column. There cannot be two Sells without a Buy in between them .
I have provided the example of how i want my output as. Any help is really appreciated
Here the raw data is in the column : Col and output i want is in Status
+-------+--------+
| Col | Status |
+-------+--------+
| 0 | Buy |
| -1.41 | 0 |
| 0 | 0 |
| -7.37 | 0 |
| -8.78 | Sell |
| -11.6 | 0 |
| 0 | Buy |
| -5 | 0 |
| -6.1 | 0 |
| -8 | Sell |
| -11 | 0 |
| 0 | Buy |
| 0 | 0 |
| -9 | Sell |
+-------+--------+
Took me some time.
Relies on the following property : the last order you can see from now, even if you chose not to send it, is always the last decision that you took. (Otherwise it would have been sent.)
df['order'] = (df['order'] == 0).astype(int) - (df['order'] <= -8).astype(int)
orders_no_filter = df.loc[df['order'] != 0, 'order']
possible = (orders_no_filter != orders_no_filter.shift(1))
df['order'] = df['order'] * possible.reindex(df.index, fill_value=0)

-find top x by count from MySQL in Python?

I have a csv file like this:
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#gmail.com, 01-05-2014
I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.
Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?
Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?
#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb
from collections import defaultdict, Counter
domain_counts = defaultdict(Counter)
# ======================== Defined Functions ======================
def get_file_path(filename):
currentdirpath = os.getcwd()
# get current working directory path
filepath = os.path.join(currentdirpath, filename)
return filepath
# ===========================================================
def read_CSV(filepath):
with open('emails.csv') as f:
reader = csv.reader(f)
for row in reader:
domain_counts[row[0].split('#')[1].strip()][row[1]] += 1
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="abcdef1234", # your password
db="test") # name of the data base
cur = db.cursor()
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""
for domain, data in domain_counts.iteritems():
for email_date, email_count in data.iteritems():
cur.execute(q, (domain, email_count, email_date))
db.commit()
# ======================= main program =======================================
path = get_file_path('emails.csv')
read_CSV(path) # read the input file
What is the right way to do the reporting task while using domains table.
Update:
Here is my domains table:
mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| domain_name | varchar(20) | NO | | NULL | |
| cnt | int(11) | YES | | NULL | |
| date_of_entry | date | NO | | NULL | |
+-------------+-------------+------+-----+---------+----------------+
And here is data I have in them:
mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name | count | date_entry |
+----+---------------+-------+------------+
| 1 | wawa.com | 2 | 2014-04-30 |
| 2 | wawa.com | 2 | 2014-05-01 |
| 3 | wawa.com | 3 | 2014-05-31 |
| 4 | uwaterloo.ca | 4 | 2014-04-30 |
| 5 | uwaterloo.ca | 3 | 2014-05-01 |
| 6 | uwaterloo.ca | 1 | 2014-05-31 |
| 7 | anonymous.com | 2 | 2014-04-30 |
| 8 | anonymous.com | 4 | 2014-05-01 |
| 9 | anonymous.com | 8 | 2014-05-31 |
| 10 | hotmail.com | 4 | 2014-04-30 |
| 11 | hotmail.com | 1 | 2014-05-01 |
| 12 | hotmail.com | 3 | 2014-05-31 |
| 13 | gmail.com | 6 | 2014-04-30 |
| 14 | gmail.com | 4 | 2014-05-01 |
| 15 | gmail.com | 8 | 2014-05-31 |
+----+---------------+-------+------------+
Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.
Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:
((this month domain total cnt) - (last month domain total cnt))
/ (last month all domains total cnt)
SQL
SELECT domain_name, pct_growth
FROM (
SELECT t1.domain_name,
# SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO
(Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
-
# SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
) /
# SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
(SELECT SUM(t2.cnt) FROM domains t2
WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
As pct_growth
FROM domains t1
GROUP BY t1.domain_name
) As derivedTable
ORDER BY pct_growth DESC
LIMIT 50;
Python
cur = db.cursor()
sql = "SELECT * FROM ..." # SEE ABOVE
cur.execute(sql)
for row in cur.fetchall():
print(row)
If I understand correctly, you just need the ratio of the past thirty days to the total count. You can get this using conditional aggregation. So, assuming that cnt is always greater than 0:
select d.domain_name,
sum(cnt) as CntTotal,
sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
(sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;

Why django order_by is so slow in a manytomany query?

I have a ManyToMany field. Like this:
class Tag(models.Model):
books = models.ManyToManyField ('book.Book', related_name='vtags', through=TagBook)
class Book (models.Model):
nump = models.IntegerField (default=0, db_index=True)
I have around 450,000 books, and for some tags, it related around 60,000 books. When I did a query like:
tag.books.order_by('nump')[1:11]
It gets extremely slow, like 3-4 minutes.
But if I remove order_by, it run queries as normal.
The raw sql for the order_by version looks like this:
'SELECT `book_book`.`id`, ... `book_book`.`price`, `book_book`.`nump`,
FROM `book_book` INNER JOIN `book_tagbook` ON (`book_book`.`id` =
`book_tagbook`.`book_id`) WHERE `book_tagbook`.`tag_id` = 1 ORDER BY
`book_book`.`nump` ASC LIMIT 11 OFFSET 1'
Do you have any idea on this? How could I fix it? Thanks.
---EDIT---
Checked the previous raw query in mysql as #bouke suggested:
SELECT `book_book`.`id`, `book_book`.`title`, ... `book_book`.`nump`,
`book_book`.`raw_data` FROM `book_book` INNER JOIN `book_tagbook` ON
(`book_book`.`id` = `book_tagbook`.`book_id`) WHERE `book_tagbook`.`tag_id` = 1
ORDER BY `book_book`.`nump` ASC LIMIT 11 OFFSET 1;
11 rows in set (4 min 2.79 sec)
Then use explain to find out why:
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
| 1 | SIMPLE | book_tagbook | ref | book_tagbook_3747b463,book_tagbook_752eb95b | book_tagbook_3747b463 | 4 | const | 116394 | Using temporary; Using filesort |
| 1 | SIMPLE | book_book | eq_ref | PRIMARY | PRIMARY | 4 | legend.book_tagbook.book_id | 1 | |
+----+-------------+--------------+--------+---------------------------------------------+-----------------------+---------+-----------------------------+--------+---------------------------------+
2 rows in set (0.10 sec)
And for the table book_book:
mysql> explain book_book;
+----------------+----------------+------+-----+-----------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+----------------+------+-----+-----------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(200) | YES | | NULL | |
| href | varchar(200) | NO | UNI | NULL | |
..... skip some part.............
| nump | int(11) | NO | MUL | 0 | |
| raw_data | varchar(10000) | YES | | NULL | |
+----------------+----------------+------+-----+-----------+----------------+
24 rows in set (0.00 sec)

Categories