Weighted database query with randomisation - python

I have a database model with a positive integer column named 'weight'. There's also other columns but they're not important for this problem. The weight column basically describes how 'important' this row is. The higher the value of weight, the more important. The weight will only range from 0 - 3. The default is 0 (least important).
I'd like to perform a query which selects 50 rows ordered by the weight column, but has been slightly randomised and includes rows with weights lower than what's in the results.
For example, the first 50 rows ordered by weight may all have a weight of 3 and 2. The query needs to include mostly these results, but also include some with a weight of 1 and 0. They need to be slightly randomised as well so the same query won't always return the same results. Also, even though it's limiting the results to 50, it needs to do this last, otherwise the same 50 results will be returned just in a different order.
This will be integrated in a Django project, but the DB is MySQL, so raw SQL is OK.
Performance is critical because this will happen on a landing page of a high traffic website.
Any ideas would be appreciated.
Thanks

You can use the rand() function combined with your weight column
select * from YOUR_TABLE order by weight * rand() desc
Note that this means that a weight 3 can appear more probably at the beginning than a weight 2.
Weight 0 appears always at the end because 0 * any number is always 0. If you don't like that you can add 1 to weight and transform the query to
select * from YOUR_TABLE order by (weight + 1) * rand() desc
Obviously if you need only the first 50 values you can add the limit clause to the query

Related

Conflicting results when grouping observations in Stata vs Python

I have a longitudinal dataset and I am trying to create two variables that correspond to two time periods based on specific date ranges (period_1 and period_2) to be able to analyze the effect of each of those time periods on my outcome.
My Stata code for grouping variables by ID is
gen period_1 = date_eval < mdy(5,4,2020)
preserve
collapse period_1=period_1
count if period_1
and it gives me a number of individuals during that period.
However, I get a different number if I use the SQL query in Python
evals_period_1 = ps.sqldf('SELECT id, COUNT(date_eval) FROM df WHERE strftime(date_eval) < strftime("%m/%d/%Y",{}) GROUP BY id'.format('5/4/2020'))
Am I grouping by ID differently in these two codes? Please let me know what you think.
Agree with Nick that a reproducible example would have been useful. Or at least a description of the results and how it is not as you expected. However, I can still say something about your Stata code. See a reproducible example below, and see how your code always results in the count 1. Even though the example below randomize the data to be different each time.
* Create a data set with 50 rows where period_1 is dummy (0,1) randomized
* differently each run
clear
set obs 50
gen period_1 = (runiform() < .5)
* List the first 5 rows
list in 1/5
* This collapses all rows and what you are left with is one row where the value
* is the average of all rows
collapse period_1=period_1
* List the one remaining observation
list
* Here Stata syntax is probably not what you are expecting. period_1 will
* here be replaced with the value in the first row. The random mean around .5.
* (This is my understanding assuming it follows what "display period_1" would do)
count if period_1
* That is identical to count if .5. And Stata evaluates
* any number >0 to "true" meaning the count where
* this statement is true to 1. This will always be the case in this code
* unless the random number generator creates the corner case where all rows are 0
count if .5
You probably want to drop the row with collapse and change the last row to count if period_1 == 1. But how your data is formatted is relevant for if this is the solution to your original question.

Python Texttable - Total width for each row

Is there a method to determine the column width for the Table after the rows have been added.
For example even if I set initialization parameters below:
text_table = Texttable( max_width= 160)
The table may default to a smaller size if the row's total width is less than the number, which is a good rendering.
However, I would like to know what was the actual width for the entire row, in the case of rows where the width does not hit the max_width limit.
The solution text_table.width returns a list of numbers but the sum does not equate the table width.
Using the following accurately returns.
table_width = max( [ len(x) for x in table_as_str.split('\n') ])
where table_as_str is the result from text_table.draw()
Thanks

Is there a query to retrieve data from Sybase table on record count condition

I have a situation where I need to select records from a Sybase table based on a certain condition
Record needs to extract on batches. If the total count is 2000 then I need to extract 500 in first batch and 500 in next batch till 2000 record count is reached.
I used the limit condition but it's giving a incorrect syntax
select top 2 *
from CERD_CORPORATE..BOOK
where id_bo_book in('5330')
limit(2,3)
You can't use the range for LIMIT condition, but you can use OFFSET keyword for this:
SELECT top 2 * FROM CERD_CORPORATE.BOOK
WHERE id_bo_book in('5330')
LIMIT 2 OFFSET 1;
On ASE 12.5.1 and onwards this can be done with a "SQL Derived Table" or "Inline View". The query requires that each row has a unique key so the table can be joined with itself and a count of the rows where the key value is less than the row being joined can be returned. This gives a monotonically increasing number with which to specify the limit and offset.
The equivalents of limit and offset are the values compared against x.rowcounter.
select
x.rowcounter,
x.error,
x.severity
from
(
select
t1.error,
t1.severity,
t1.description,
count(t2.error) as rowcounter
from
master..sysmessages t1,
master..sysmessages t2
where
t1.error >= t2.error
group by
t1.error,
t1.severity,
t1.description
) x
where
x.rowcounter >= 50
and x.rowcounter < 100
SQL Derived Tables are available as far back as Sybase ASE 12.5.1, SQL Derived Tables
The use of master..sysmessages in the example provides a reasonable (10,000 rows) data set with which to experiment.

How to down weight the average in SQLITE3 and Python

So I'm knee deep in my first Python project.
I wondered if someone can help me downweight the returned average from an SQLITE3 query?
I'm currently using DESC LIMIT to get 3 averages of a data set then averaging the three results. But this is very tedious and inaccurate.
Is there a simple function to reduce the importance of older data by rowid? I've searched everywhere but I'm struggling to find a simplified answer.
My current query in Python:
c.execute("SELECT "
"AVG(HomeGoals),AVG(AwayGoals)"
"FROM (SELECT HomeGoals,AwayGoals FROM '%s' WHERE AwayId='%s' AND Year<='%s' AND rowid<'%s' AND Year>'%s' ORDER BY rowid DESC LIMIT '%s')" %
(League,each['AwayID'],current_season_year,each['rowid'],previous_year,desc_limit))
Example Data:
rowid HomeGoals
1 3
2 1
3 5
4 6
5 2
So for example AVG(HomeGoals) with the above data would return the average from the HomeGoals column which is 3.4
What I want to do is add less weighting to older results so for example HomeGoals on rowid 5 would have less significance than HomeGoals on rowid 1. The hope is to return an adjusted average by giving older results less significance.
I hope this makes sense, I've looked at exponential down weighting but I have no idea how to implement it.
Thank You

How can I make this loop more efficient?

I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp. lcd_temp has information on the loan size (loan_amnt), if loan has defaulted or not (Total Defaults), annual loan rate (clean_rate),term of loan (clean_term), and months from origination to default (mos_to_default). mos_to_default is equal to clean_term if no default.
I would like to calculate the Cumulative Cashflow [cum_cf] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt if it pays back on time.
Here's my code, which takes an awful long time to run:
severity = 1
for i in range (0,len(lcd_temp['Total_Defaults'])-1):
if (lcd_temp.loc[i,'Total_Defaults'] ==1):
# Default, pay coupon only until time of default, plus (1-severity)
lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12) * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
else:
# Total cf is sum of coupons (non compounded) + principal
lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']
Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!
Assuming you are using Pandas/NumPy, the standard way to replace an if-then construction such as the one you are using is to use np.where(mask, A, B). The mask is an array of boolean values. When True, the corresponding value from A is returned. When False, the corresponding value from B is returned. The result is an array of the same shape as mask with values from A and/or B.
severity = 1
mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
+ (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']
lcd_temp['cum_cf'] = np.where(mask, A, B)
Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine.
If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.
Even though A and B computes the values for the entire column -- and some values are not used in the final result returned by np.where -- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.

Categories