G'day All,
I'm trying to create a running balance of all negative transactions with a date less than or equal to the current transaction object's transaction date however if I use __lte=transaction_date I get multiple rows, while this is correct I want to sum those multiple rows, how would I do that and annotate to my queryset?
Current Attempt:
#getting transaction sum and negative balances to take away from running balance
totals = queryset.values('transaction_date','award_id','transaction').annotate(transaction_amount_sum=Sum("transaction_amount"))\
.annotate(negative_balance=Coalesce(Sum("transaction_amount",filter=Q(transaction__in=[6,11,7,8,9,10,12,13])),0))\
#adding it all to the queryset
queryset = queryset\
.annotate(transaction_amount_sum=SubquerySum(totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('transaction_amount_sum')))\
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('negative_balance')
))\
.annotate(total_awarded=SubquerySum("award__total_awarded"))\
.annotate(running_balance=F('total_awarded')-F('negative_balance')) #This doesnt work correct, we need transaction date to be less than or eqaul not just the transaction date.
#filtering on distinct, we only want one of each record, doesnt matter which one. :)
distinct_pk = queryset.distinct('transaction_date','award_id','transaction').values_list('pk',flat=True)
queryset = queryset.filter(pk__in=distinct_pk)
What needs to be fixed:
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date=OuterRef('transaction_date'),transaction=OuterRef('transaction'))\
.values('negative_balance')
The above should really be:
.annotate(negative_balance=SubquerySum(
totals.filter(award_id=OuterRef('award_id'),transaction_date__lte=OuterRef('transaction_date'))\
.values('negative_balance')
It will return multiple rows if I do this, what I want to do is sum those multiple rows on negative_balance.
Hope the above makes sense.
Any help will be greatly appreciated.
Thanks,
Thomas Lewin
Related
so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say,
5 products or more in a single transaction), where each row
represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently
making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
One to count the number of commas, so that number of commas+1 will be number of products.
To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
Assuming you already have a proper DataFrame:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date'], where the 'product' column contains list objects, and date contains Timestamp),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of PERSON_125 to be PERSON_105, so that there would be one repeat customer. Likewise, I used a threshold of n_visits > 1 as the criterion for "recurring", but you may want something different.
You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using name is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.
You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",")+1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).
I have a flask application that tracks shift information, and the orders logged during that shift, the models are set up like this
class Shift(db.Model):
# Columns
orders = db.relationship('Orders', lazy='dynamic')
class Orders(db.Model):
pay = db.Column(db.Integer)
dash_id = db.Column(db.Integer, db.ForeignKey('dash.id'))
While the user is in the middle of a shift I want to display the total pay they have made so far, and I also will commit it into the Shift table later as well. To get the total pay of all the related orders I tried to query something like
current_shift = Shift.query.filter_by(id=session['shiftID']).first()
orders = current_shift.orders
total_pay = func.sum(orders.pay)
But it always returns that 'AppenderBaseQuery' object has no attribute 'pay'
I know that I can loop through like this
total_pay = 0
for order in orders:
total_pay += order
but that can't be as quick, efficient, or certainly readable as an aggregate function in a query.
My question is this, what is the correct way to sum the Orders.pay columns (or perform aggregate functions of any column) of the related orders?
You don't need to go through the shifts table, because you already have all the information that you need in the orders table.
To get the result for a single shift you can do
pay = db_session.query(func.sum(Orders.pay)).filter(Orders.shifts_id == shift_id).one()
or for multiple shifts
pays = (
s.query(Orders.shifts_id, sa.func.sum(Orders.pay))
.filter(Orders.shifts_id.in_(list_of_shift_ids))
.group_by(Orders.shifts_id)
.all()
)
Note that both queries return rows as tuples, for example (50,), [(25,), (50,)] respectively.
this is my first post here, I hope you will understand what troubles me.
So, I have a DataFrame that contains prices for some 1200 companies for each day, beginning in 2010. Now I want to calculate the total return for each one. My DataFrame is indexed by date. I could use the
df.iloc[-1]/df.iloc[0] method, but some companies started trading publicly at a later date, so I can't get the results for those companies, as they are divided by a NaN value. I've tried by creating a list which contains the first valid indexes for every stock(column), then when I try to calculate the total returns, I get - the wrong result!
I've tried a classic for loop:
for l in list:
returns = df.iloc[-1]/df.iloc[l]
For instance, last price of one stock was around $16, and first data I have is $1.5, which would be over 10 times return, yet my result is only about 1.1! I would also like to add that the aforementioned list includes first valid indexes for Date aswell, and it is in the first position.
Can somebody please help me? Thank you very much
Many ways you can go about this actually. But I do recommend you brush up on your python skills with basic examples before you get into more complicated examples.
If you want to do it your way, you can do it like this:
returns = {}
for stock_name in df.columns:
returns[stock_name] = df[stock_name].dropna().iloc[-1] / df[stock_name].dropna().iloc[0]
A more pythonic way would be to do it in a vectorized form, like this:
returns = ((1 + data.ffill().pct_change())
.cumprod()
.iloc[-1])
The same question on how to do it in SQL, with views or a stored procedure?
I have a sales table simplified by 4 columns namely id, product_id, sale_date and quantity. I would like to build a request returning :
1. for each product_id the total sales by date in one row
2. on the same row for each id the total sales on 7, 15, 30 days
For now, I use multiple WITH-views, one for each columns :
days7=session.query(Sales.product_id,func.sum(Sales.quantity).label('count')).\
filter(Sales.sale_date > now() - timedelta(days=7)).\
group_by(Sales.product_id).cte('days7')
...
req=session.query(Sales.product_id,days7.c.count.label('days7'),\
days15.c.count.label('days15'),\
days30.c.count.label('days30')).\
outerjoin(days7,days7.c.product_id==Sales.product_id).\
outerjoin(days15,days15.c.product_id==Sales.product_id).\
outerjoin(days30,days30.c.product_id==Sales.product_id).\
all()
It works pretty well but I not sure if this is the best way of doing it. Moreover if I want to add the count for each date of the 30 (or 360) previous days, it becomes crazy. The idea could have been to use a simple for loop :
viewSumByDay=[]
for day in range(180):
date=now()-timedelta(days=day)
viewSumByDay.append(session.query(...).cte(str(date.date())))
which is ok to create the view. And although for the left join it should also be ok with a req=req.outerjoin(viewSumByDay[day],...), I'm now stuck on how to use the loop to add the columns into the main query.
Do you see another nice solution?
Thanks a lot for your help!
Ok, sorry, a simple req=req.add_column(...) is described in the documentation.
However if there is a prettier way, I would like to know it.
I have 2 models that are something like this:
class Foo(models.Model):
name = models.CharField(max_length=30)
class FooStatement(models.Model):
foo = models.ForeignKey(Foo, on_delete.models.CASCADE)
revenue = models.DecimalField(decimal_places=2)
date = models.DateField()
What I want to know is what the average is of the FooStatement revenue for the first 6 dates for each Foo object. Is there any way to achieve this?
I was thinking of slicing the first 6 entries (after ordering them), but I cannot seem to get this to work. The statement months all start at different dates so I can't just say that I want all dates that are lesser than 'x'.
I'm almost positive the answer lies somewhere in clever annotation, but I just cannot find it.
Edit: Django version 1.11.6 with a postgres database. There are upwards of 4000 Foo objects and they will keep growing.
What I ended up using is F expressions. Though not an ideal solution, it covers roughly 90% of my statements. I annotate the first month and then select all foostatements that are before a certain date and aggregate the average:
qs = Foo.objects.all().annotate(f_month=Min(F('foostatement__date')))
value = qs.filter(foostatement__date__lt=F('f_month')+datetime.timedelta(6*365/12)).aggregate(Avg('foostatement__revenue'))
A problem still persists with foostatements that skip a month, but for my purposes I can live with that.