So I'm knee deep in my first Python project.
I wondered if someone can help me downweight the returned average from an SQLITE3 query?
I'm currently using DESC LIMIT to get 3 averages of a data set then averaging the three results. But this is very tedious and inaccurate.
Is there a simple function to reduce the importance of older data by rowid? I've searched everywhere but I'm struggling to find a simplified answer.
My current query in Python:
c.execute("SELECT "
"AVG(HomeGoals),AVG(AwayGoals)"
"FROM (SELECT HomeGoals,AwayGoals FROM '%s' WHERE AwayId='%s' AND Year<='%s' AND rowid<'%s' AND Year>'%s' ORDER BY rowid DESC LIMIT '%s')" %
(League,each['AwayID'],current_season_year,each['rowid'],previous_year,desc_limit))
Example Data:
rowid HomeGoals
1 3
2 1
3 5
4 6
5 2
So for example AVG(HomeGoals) with the above data would return the average from the HomeGoals column which is 3.4
What I want to do is add less weighting to older results so for example HomeGoals on rowid 5 would have less significance than HomeGoals on rowid 1. The hope is to return an adjusted average by giving older results less significance.
I hope this makes sense, I've looked at exponential down weighting but I have no idea how to implement it.
Thank You
Related
I have a longitudinal dataset and I am trying to create two variables that correspond to two time periods based on specific date ranges (period_1 and period_2) to be able to analyze the effect of each of those time periods on my outcome.
My Stata code for grouping variables by ID is
gen period_1 = date_eval < mdy(5,4,2020)
preserve
collapse period_1=period_1
count if period_1
and it gives me a number of individuals during that period.
However, I get a different number if I use the SQL query in Python
evals_period_1 = ps.sqldf('SELECT id, COUNT(date_eval) FROM df WHERE strftime(date_eval) < strftime("%m/%d/%Y",{}) GROUP BY id'.format('5/4/2020'))
Am I grouping by ID differently in these two codes? Please let me know what you think.
Agree with Nick that a reproducible example would have been useful. Or at least a description of the results and how it is not as you expected. However, I can still say something about your Stata code. See a reproducible example below, and see how your code always results in the count 1. Even though the example below randomize the data to be different each time.
* Create a data set with 50 rows where period_1 is dummy (0,1) randomized
* differently each run
clear
set obs 50
gen period_1 = (runiform() < .5)
* List the first 5 rows
list in 1/5
* This collapses all rows and what you are left with is one row where the value
* is the average of all rows
collapse period_1=period_1
* List the one remaining observation
list
* Here Stata syntax is probably not what you are expecting. period_1 will
* here be replaced with the value in the first row. The random mean around .5.
* (This is my understanding assuming it follows what "display period_1" would do)
count if period_1
* That is identical to count if .5. And Stata evaluates
* any number >0 to "true" meaning the count where
* this statement is true to 1. This will always be the case in this code
* unless the random number generator creates the corner case where all rows are 0
count if .5
You probably want to drop the row with collapse and change the last row to count if period_1 == 1. But how your data is formatted is relevant for if this is the solution to your original question.
I am working with a large database of deposit (~600,000 rows), and my task is to slot the deposit based on their tenor 'bucket' i.e. a 'bucket' would have lower and upper limits in days (e.g. 0-30 days, 31-60 days etc). The simplified raw data is as followed ('LCY_CURR_BALANCE' is value of deposit, 'RM' is tenor):
As the NaN in 'RM' column signify non-maturity deposits, I fill them with 0, then change the column type from float to integer, using this code line:
MIS035A['KHCL']=MIS035A['KHCL'].fillna(0)#Replace NA with 0
MIS035A['KHCL']=MIS035A['KHCL'].astype(int)
The result is as followed:
However, when I start to sum the 'LCY_CURR_BALANCE' column based on condition on 'RM', the following problem happens: if the condition ==0, the process would take ridiculously long to complete (abt 3 hours). Any other conditions would take less than 30 secs. The code I use for conditional summation is as followed:
sumif_0=MIS035A[(MIS035A["KHCL"]==0)].sum()["ACY_CURR_BALANCE"]#condition ==0
sumif_1=MIS035A[(MIS035A["KHCL"]==1)].sum()["ACY_CURR_BALANCE"]#condition ==any other number
I truly appreciate if someone can explain, or help me solve, why such issue happens. I suspects it may be because of my filling of NaN to 0. However I have not found any further explanation of the issue on the internet.
Thank you very much!
How I would tackle the problem
data_0 = MIS035A[MIS035A["KHCL"]==0]
data_1 = MIS035A[MIS035A["KHCL"]==1]
sum_0 = data_0["LCY_CURR_BALANCE"].sum()
sum_1 = data_1["LCY_CURR_BALANCE"].sum()
Since you have a huge data, it is better to subset it to run your calculations faster.
Dataset
I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)
MovieName Date Rating Revenue
A 2019-01-15 3 3.4 million
B 2019-02-03 3 1.2 million
... ... ... ...
Object
Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.
Question
I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because
Some posts (see comment of #cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.
Could someone provide pointers to the question I have? Thank you in advance.
In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:
Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.
You can also extract weeks if you want a period shorter than a month
The same question on how to do it in SQL, with views or a stored procedure?
I have a sales table simplified by 4 columns namely id, product_id, sale_date and quantity. I would like to build a request returning :
1. for each product_id the total sales by date in one row
2. on the same row for each id the total sales on 7, 15, 30 days
For now, I use multiple WITH-views, one for each columns :
days7=session.query(Sales.product_id,func.sum(Sales.quantity).label('count')).\
filter(Sales.sale_date > now() - timedelta(days=7)).\
group_by(Sales.product_id).cte('days7')
...
req=session.query(Sales.product_id,days7.c.count.label('days7'),\
days15.c.count.label('days15'),\
days30.c.count.label('days30')).\
outerjoin(days7,days7.c.product_id==Sales.product_id).\
outerjoin(days15,days15.c.product_id==Sales.product_id).\
outerjoin(days30,days30.c.product_id==Sales.product_id).\
all()
It works pretty well but I not sure if this is the best way of doing it. Moreover if I want to add the count for each date of the 30 (or 360) previous days, it becomes crazy. The idea could have been to use a simple for loop :
viewSumByDay=[]
for day in range(180):
date=now()-timedelta(days=day)
viewSumByDay.append(session.query(...).cte(str(date.date())))
which is ok to create the view. And although for the left join it should also be ok with a req=req.outerjoin(viewSumByDay[day],...), I'm now stuck on how to use the loop to add the columns into the main query.
Do you see another nice solution?
Thanks a lot for your help!
Ok, sorry, a simple req=req.add_column(...) is described in the documentation.
However if there is a prettier way, I would like to know it.
I have a database model with a positive integer column named 'weight'. There's also other columns but they're not important for this problem. The weight column basically describes how 'important' this row is. The higher the value of weight, the more important. The weight will only range from 0 - 3. The default is 0 (least important).
I'd like to perform a query which selects 50 rows ordered by the weight column, but has been slightly randomised and includes rows with weights lower than what's in the results.
For example, the first 50 rows ordered by weight may all have a weight of 3 and 2. The query needs to include mostly these results, but also include some with a weight of 1 and 0. They need to be slightly randomised as well so the same query won't always return the same results. Also, even though it's limiting the results to 50, it needs to do this last, otherwise the same 50 results will be returned just in a different order.
This will be integrated in a Django project, but the DB is MySQL, so raw SQL is OK.
Performance is critical because this will happen on a landing page of a high traffic website.
Any ideas would be appreciated.
Thanks
You can use the rand() function combined with your weight column
select * from YOUR_TABLE order by weight * rand() desc
Note that this means that a weight 3 can appear more probably at the beginning than a weight 2.
Weight 0 appears always at the end because 0 * any number is always 0. If you don't like that you can add 1 to weight and transform the query to
select * from YOUR_TABLE order by (weight + 1) * rand() desc
Obviously if you need only the first 50 values you can add the limit clause to the query