The same question on how to do it in SQL, with views or a stored procedure?
I have a sales table simplified by 4 columns namely id, product_id, sale_date and quantity. I would like to build a request returning :
1. for each product_id the total sales by date in one row
2. on the same row for each id the total sales on 7, 15, 30 days
For now, I use multiple WITH-views, one for each columns :
days7=session.query(Sales.product_id,func.sum(Sales.quantity).label('count')).\
filter(Sales.sale_date > now() - timedelta(days=7)).\
group_by(Sales.product_id).cte('days7')
...
req=session.query(Sales.product_id,days7.c.count.label('days7'),\
days15.c.count.label('days15'),\
days30.c.count.label('days30')).\
outerjoin(days7,days7.c.product_id==Sales.product_id).\
outerjoin(days15,days15.c.product_id==Sales.product_id).\
outerjoin(days30,days30.c.product_id==Sales.product_id).\
all()
It works pretty well but I not sure if this is the best way of doing it. Moreover if I want to add the count for each date of the 30 (or 360) previous days, it becomes crazy. The idea could have been to use a simple for loop :
viewSumByDay=[]
for day in range(180):
date=now()-timedelta(days=day)
viewSumByDay.append(session.query(...).cte(str(date.date())))
which is ok to create the view. And although for the left join it should also be ok with a req=req.outerjoin(viewSumByDay[day],...), I'm now stuck on how to use the loop to add the columns into the main query.
Do you see another nice solution?
Thanks a lot for your help!
Ok, sorry, a simple req=req.add_column(...) is described in the documentation.
However if there is a prettier way, I would like to know it.
Related
G'day All,
I'm trying to create a running balance of all negative transactions with a date less than or equal to the current transaction object's transaction date however if I use __lte=transaction_date I get multiple rows, while this is correct I want to sum those multiple rows, how would I do that and annotate to my queryset?
Current Attempt:
#getting transaction sum and negative balances to take away from running balance
totals = queryset.values('transaction_date','award_id','transaction').annotate(transaction_amount_sum=Sum("transaction_amount"))\
.annotate(negative_balance=Coalesce(Sum("transaction_amount",filter=Q(transaction__in=[6,11,7,8,9,10,12,13])),0))\
#adding it all to the queryset
queryset = queryset\
.annotate(transaction_amount_sum=SubquerySum(totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('transaction_amount_sum')))\
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('negative_balance')
))\
.annotate(total_awarded=SubquerySum("award__total_awarded"))\
.annotate(running_balance=F('total_awarded')-F('negative_balance')) #This doesnt work correct, we need transaction date to be less than or eqaul not just the transaction date.
#filtering on distinct, we only want one of each record, doesnt matter which one. :)
distinct_pk = queryset.distinct('transaction_date','award_id','transaction').values_list('pk',flat=True)
queryset = queryset.filter(pk__in=distinct_pk)
What needs to be fixed:
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('negative_balance')
The above should really be:
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date__lte=OuterRef('transaction_date'))\
.values('negative_balance')
It will return multiple rows if I do this, what I want to do is sum those multiple rows on negative_balance.
Hope the above makes sense.
Any help will be greatly appreciated.
Thanks,
Thomas Lewin
Dataset
I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)
MovieName Date Rating Revenue
A 2019-01-15 3 3.4 million
B 2019-02-03 3 1.2 million
... ... ... ...
Object
Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.
Question
I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because
Some posts (see comment of #cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.
Could someone provide pointers to the question I have? Thank you in advance.
In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:
Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.
You can also extract weeks if you want a period shorter than a month
this is my first post here, I hope you will understand what troubles me.
So, I have a DataFrame that contains prices for some 1200 companies for each day, beginning in 2010. Now I want to calculate the total return for each one. My DataFrame is indexed by date. I could use the
df.iloc[-1]/df.iloc[0] method, but some companies started trading publicly at a later date, so I can't get the results for those companies, as they are divided by a NaN value. I've tried by creating a list which contains the first valid indexes for every stock(column), then when I try to calculate the total returns, I get - the wrong result!
I've tried a classic for loop:
for l in list:
returns = df.iloc[-1]/df.iloc[l]
For instance, last price of one stock was around $16, and first data I have is $1.5, which would be over 10 times return, yet my result is only about 1.1! I would also like to add that the aforementioned list includes first valid indexes for Date aswell, and it is in the first position.
Can somebody please help me? Thank you very much
Many ways you can go about this actually. But I do recommend you brush up on your python skills with basic examples before you get into more complicated examples.
If you want to do it your way, you can do it like this:
returns = {}
for stock_name in df.columns:
returns[stock_name] = df[stock_name].dropna().iloc[-1] / df[stock_name].dropna().iloc[0]
A more pythonic way would be to do it in a vectorized form, like this:
returns = ((1 + data.ffill().pct_change())
.cumprod()
.iloc[-1])
For example, if I want to get only the latest date in the "date" column for each unique userId from the "userId" column (so only the latest date in the data frame for each user, 1:1), and list by userId, how would I go about that in the most efficient way possible? Is there a way to do this?
I'm having a difficult time with this since there are multiple dates listed for each user in the data frame, but I only want the latest date for each user. For example, even if userId 9 had multiple dates from 01/01/2019 to 11/30/2019, and userId 8 had multiple dates in the df from 03/15/2019 to 10/31/2019, is there a way to pull a response such as:
userId Date
8 10/31/2019
9 11/30/2019
Use the "better-than" sql query principal:
You look for something, specifying something that is "better-than" what you're looking for. Then you make sure the "better-than" is null...meaning there isn't anything better than and hence you have the best.
select best.userId, best.Date
from theTable as best
left join theTable as better on best.userId = better.userId and better.Date > best.Date
where better.userId is null
group by best.userId;
This is a pretty standard application of the SQL term group by, which slices the data set by each group and applies your desired function (max of date, in this particular case). Pandas is pretty rich in this kind of operations.
So your solution should look like:
df.groupby(['UserID'])['Date'].max()
A fairly beginner-level pandas question here.
I have a DataFrame of transactions:
Customer Date Amount
Angus 2009-07-18 $76.46
Bruno 2009-07-21 $68.66
Danno 2009-07-25 $73.52
Chapp 2009-07-11 $56.04
Chapp 2009-07-21 $11.30
Frank 2009-07-07 $52.86
Chapp 2009-07-09 $97.82
Danno 2009-07-11 $84.98
(etc. for thousands of lines)
I'd like to create four DataFrames from this data:
For each customer, the customers name, how many transactions they've done, and the sum of the Amounts of these transactions
For each customer, the date and amount of their most recent transaction.
For each customer, the date and amount of their first transaction.
For each customer, the date and amount of their largest (amount-wise) transaction.
Can you advise me on the appropriate code?
(Answers along the lines of "Why are you using DataFrames? You should be using ThnargLopes for this!" will be warmly received.)
I think a DataFrame is a great structure for your data. Whenever you're setting up for a "split-apply-combine" set of analysis steps, Pandas excels. You can write a function that assumes you only have one customer and returns a Series like you're looking for.
import pandas as pd
def trans_count(DF):
return pd.Series({'count': len(DF),
'total': sum(DF['Amount'])})
Then use groupby and apply:
yourDF.groupby('Customer').apply(trans_count)
However, since each of your new DataFrames is a summary of a single customer, I would suggest writing one function that can return all of your desired results in a single Series.
untested from my phone!
OK, I've figured this out. First, we make a transaction field of ones to sum:
df["Trans"] = len(df)*[1]
We group by Customer:
cust_gp = df.groupby("Customer")
The the first one's easiest:
cust_gp.sum()
Four is also not hard:
cust_gp.max()
2 and 3 were tricky... I found a solution that seemed to work with my test data. Sort the data by Customer and Date, then aggregate by taking the first for each Customer:
df.sort(["Customer","Date"]).groupby("Customer").first()
df.sort(["Customer","Date"]).groupby("Customer").last()
...but when I ran it on my big data set, I was told that some of my Most Recent Transactions took place before the Last Transactions. Which makes no sense.
It turned out that the date field was being imported as text! So, complete solution:
df.Date = pd.to_datetime(df.Date) # Date field should be date, not text
df = df.sort(["Customer","Date"])
cust_gp = df.groupby("Customer")
total_df = cust_gp.sum() # 1
largest_df = cust_gp.max() # 2
first_df = cust_gp.first() # 3
last_df = cust_gp.last() # 4
I'm pleased with this, except for the "Gifts" column, which I'm sure isn't implemented in the most elegant way.