Calculate lag difference in group - python

I am trying to solve a problem using just SQL (I am able to do this when combining SQL and Python).
Basically what I want to do is to calculate score changes per candidate, where a score consists of joining a score lookup table and then summing these individual event scores. If a candidate fails, they are required to retake the events. Here is an example output:
| brandi_id | retest | total_score |
|-----------|--------|-------------|
| 1 | true | 128 |
| 1 | false | 234 |
| 2 | true | 200 |
| 2 | false | 230 |
| 3 | false | 265 |
What I want is to first only calculate a score change for those candidates who took a retest, where the score change will just be the difference in total_score for retest is true minus retest = false:
| brandi_id | difference |
|-----------|------------|
| 1 | 106 |
| 2 | 30 |
This is the SQL that I am using (with this I need to use Python)
select e.brandi_id, e.retest, sum(sl.scaled_score) as total_score
from event as e
left join apf_score_lookup as sl
on sl.asmnt_code = e.asmnt_code
and sl.raw_score = e.score
where e.asmnt_code in ('APFPS','APFSU','APF2M')
group by e.brandi_id, e.retest
order by e.brandi_id;
I think the solution involves using LAG and PARTITION but I cannot get it. Thanks!

If someone does the retest only once, then you can use a join:
select tc.*, tr.score, (tc.score - tr.score) as diff
from t tc join
t tr
on tc.brandi_id = tr.brandi_id and
tc.retest = 'true' and tr.retest = 'false';
You don't describe your table layout. If the results are from the query in your question, you can just plug that in as a CTE.

Related

Is there a way to improve a MERGE query?

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

Pandas groupby count with conditions

Example Data
Given the following data frame:
| feature | gene | target | pos |
| 1_1_1 | NRAS | AATTGG | 60 |
| 1_1_1 | NRAS | TTGGCC | 6 |
| 1_1_1 | NRAS | AATTGG | 20 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_2 | NRAS | CCTTAA | 2 |
| 1_1_2 | NRAS | GGAATT | 8 |
| 1_1_2 | NRAS | AATTGG | 60 |
The problem
For each feature, I would like to count how many targets appear in each gene, with the following rules:
If a target appears in only one position (pos column) for each gene, it gets a count of 1 for every time seen
If the same target appears in multiple positions for each gene, it gets a count of (count at position/total positions found)
Summarize total counts of each gene per feature
What I've done so far
matches.groupby(["FeatureID", "gene"]).size().reset_index()
matches['multi_mapped'] = np.where(matches.groupby(["FeatureID", "gene", "target"]).pos.transform('nunique') > 1, "T", '')
Which gives me a dataframe where targets that appear at more than one position are flagged as true. Now I just need to figure out how to normalize the counts.
Desired output
| feature | gene | count
| 1_1_1 | NRAS | 2
| 1_1_1 | KRAS | 1
| 1_1_2 | NRAS | 3
So in the example above for 1_1_1 NRAS, where AATTGG is found at both position 60 and position 20, each would get a count of .5. Since TTGGCC was found once at one position, so it gets a count of 1. This makes a total count of 2.
If for 1_1_1 NRAS TTGGCC was found 3 times at the same position, each of those would get a count of 1, for a total of 3 + .5 + .5 = 4.
The solution needs to check for the same target appearing at different positions and then adjust the counts accordingly, and that is the part I'm having a difficult time with. My ultimate goal is to choose the gene with the highest count per group.
It's not really clear to me why the count on first row should be 2. Could you try to play around this:
import pandas as pd
feature = ["1_1_1"]*6 +["1_1_2"]*3
gene = ["NRAS"]*3+["KRAS"]*3+["NRAS"]*3
target = ["AATTGG","TTGGCC", "AATTGG"]+ ["GGGGTT"]*3 + ["CCTTAA", "GGGGTT", "AATTGG"]
pos = [60,6,20,0,0,0,2,8,60]
df = pd.DataFrame({"feature":feature,
"gene":gene,
"target":target,
"pos":pos})
df.groupby(["feature", "gene"])\
.apply(lambda x:len(x.drop_duplicates(["target", "pos"])))
Okay, I figured it out. If there is a more efficient way to to do this, I'm all ears!
# flag targets that are multi-mapped and add flag as new column
matches['multi_mapped'] = np.where(matches.groupby(["FeatureID", "gene", "target"]).pos.transform('nunique') > 1, "T", '')
# separate multi and non multi mapped reads using flag
non = matches[matches["multi_mapped"] != "T"]\
.drop("multi_mapped", axis=1)
multi = matches[matches["multi_mapped"] == "T"]\
.drop("multi_mapped", axis=1)
# add counts to non multi mapped reads
non = non.groupby(["FeatureID", "gene", "target"])\
.count().reset_index().rename(columns={"pos":"count"})
# add counts to multi-mapped reads with normaliztion
multi["count"] = multi.groupby(["FeatureID", "gene", "target"])\
.transform(lambda x: 1/x.count())
multi.drop("pos", axis=1, inplace=True)
# join the multi and non back together
counts = pd.concat([multi, non], axis=0)

Get network edges from SQL tables for networkX in python

I am trying to get the number of network edges from a normalised SQLite database which has been normalised as follows:
Authors Paper Paper_Authors
| authorID | name | etc | paperID | title | etc | paperID | authorID |
| 1 | .... | ... | 1 | ..... | ... | 1 | 1 |
| 2 | .... | ... | 2 | ..... | ... | 1 | 2 |
| 3 | .... | ... | . | ..... | ... | 1 | 3 |
| 4 | .... | ... | 60,000 | ..... | ... | 2 | 1 |
| 5 | .... | ... | 2 | 4 |
| . | .... | ... | 2 | 5 |
| 120,000 | .... | ... | . | . |
| 60,000 | 120,000 |
With somewhere in the region of 120,000 authors and 60,000 papers, and the index table has around 250,000 rows.
I am trying to get this into networkX to do some connectivity analysis, inputting the nodes is simple:
conn = sqlite3.connect('../input/database.sqlite')
c = conn.cursor()
g = nx.Graph()
c.execute('SELECT authorID FROM Authors;')
authors = c.fetchall()
g.add_nodes_from(authors)
The problem I am having arises from trying to determine the edges to feed to networkX, which requires the values in a tuple of the two nodes to connect, using the data above as an example;
[(1,1),(1,2),(1,3),(2,3),(1,4),(1,5),(4,5)]
Would describe the dataset above.
I have the following code, which works, but is inelegant:
def coauthors(pID):
c.execute('SELECT authorID \
FROM Paper_Authors \
WHERE paperID IS ?;', (pID,))
out = c.fetchall()
g.add_edges_from(itertools.product(out, out))
c.execute('SELECT COUNT() FROM Papers;')
papers = c.fetchall()
for i in range(1, papers[0][0]+1):
if i % 1000 == 0:
print('On record:', str(i))
coauthors(i)
This works by looping through each of the papers in the database, returning a list of authors and iteratively making list of author combination tuples and adding them to the network in a piecemeal way, which works, but took 30 - 45 minutes:
print(nx.info(g))
Name:
Type: Graph
Number of nodes: 120670
Number of edges: 697389
Average degree: 11.5586
So my question is, is there a more elegant way to come to the same result, ideally with the paperID as the edge label, to make it easier to navigate the the network outside of networkX.
You can get all combinations of authors for each paper with a self join:
SELECT paperID,
a1.authorID AS author1,
a2.authorID AS author2
FROM Paper_Authors AS a1
JOIN Paper_Authors AS a2 USING (paperID)
WHERE a1.authorID < a2.authorID; -- prevent duplicate edges
This will be horribly inefficient unless you have an index on paperID, or better, a covering index on both paperID and authorID, or better, a WITHOUT ROWID table.

How to efficiently generate a special co-author network in python pandas?

I'm trying to generate a network graph of individual authors given a table of articles. The table I start with is of articles with a single column for the "lead author" and a single column for "co-author". Since each article can have up to 5 authors, article rows may repeat as such:
| paper_ID | project_name | lead_id | co_lead_id | published |
|----------+--------------+---------+------------+-----------|
| 1234 | "fubar" | 999 | 555 | yes |
| 1234 | "fubar" | 999 | 234 | yes |
| 1234 | "fubar" | 999 | 115 | yes |
| 2513 | "fubar2" | 765 | 369 | no |
| 2513 | "fubar2" | 765 | 372 | no |
| 5198 | "fubar3" | 369 | 325 | yes |
My end goal is to have a nodes table, where each row is a unique author, and an edge table, where each row contains source and target author_id columns. The edges table is trivial, as I can merely create a dataframe using the requisite columns of the article table.
For example, for the above table I would have the following node table:
| author_id | is_published |
|-----------+--------------|
| 999 | yes |
| 555 | yes |
| 234 | yes |
| 115 | yes |
| 765 | no |
| 369 | yes |
| 372 | no |
| 325 | yes |
Notice how the "is_published" shows if the author was ever a lead or co-author on at least one published paper. This is where I'm running into trouble creating a nodes table efficiently. Currently I iterate through every row in the article table and run checks on if an author exists yet in the nodes table and whether to turn on the "is_published" flag. See the following code snippet as an example:
articles = pd.read_excel('excel_file_with_articles_table')
nodes = pd.DataFrame(columns=('is_published'))
nodes.index.name = 'author_id'
for row in articles.itertuples():
if not row.lead_id in nodes.index:
author = pd.Series([False], index=["is_published"], name=row.lead_id)
pi_nodes = pi_nodes.append(author)
if not row.co_lead_id in nodes.index:]
investigator = pd.Series([False], index=["is_published"], name=row.co_lead_id)
pi_nodes = pi_nodes.append(investigator)
if row.published == "yes":
pi_nodes.at[row.lead_id,"is_published"]=True
pi_nodes.at[row.co_lead_id,"is_published"]=True
For my data set (with tens of thousands of rows), this is somewhat slow, and I understand that loops should be avoided when possible when using pandas dataframes. I feel like the pandas apply function may be able to do what I need, but I'm at a loss as to how to implement it.
With df as your first DataFrame, you should be able to:
nodes = pd.concat([df.loc[:, ['lead_id', 'is_published']].rename(columns={'lead_id': 'author_id'}, df.loc[:, ['co_lead_id', 'is_published']].rename(columns={'co_lead_id': 'author_id'}]).drop_duplicates()
for a unique list of author_id and co_author_id with their respective is_published information.
To only keep is_published=True if there is also a False entry:
nodes = nodes.sort_values('is_published', ascending=False).drop_duplicates(subset=['author_id'])
.sort_values() will sort True (==1) before False, and .drop_duplicates() by default keeps the first occurrence (see docs). With this addition I guess you don't really need the first .drop_duplicates() anymore.

Categories