django annotation and filtering - python

Hopefully this result set is explanatory enough:
title text total_score already_voted
------------- ------------ ----------- -------------
BP Oil spi... Recently i... 5 0
J-Lo back ... Celebrity ... 7 1
Don't Stop... If there w... 9 0
Australian... The electi... 2 1
My models file describes article (author, text, title) and vote (caster, date, score). I can get the first three columns just fine with the following:
articles = Article.objects.all().annotate(total_score=Sum('vote__score'))
but calculating the 4th column, which is a boolean value describing whether the current logged in user has placed any of the votes in column 3, is a bit beyond me at the moment! Hopefully there's something that doesn't require raw sql for this one.
Cheers,
Dave
--Trindaz on Fedang #django

I cannot think of a way to include the boolean condition. Perhaps others can answer that better.
How about thinking a bit differently? If you don't mind executing two queries you can filter your articles based on whether the currently logged in user has voted on them or not. Something like this:
all_articles = Article.objects.all()
articles_user_has_voted_on = all_articles.filter(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
other_articles = all_articles.exclude(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
Update
After some experiments I was able to figure out how to add a boolean condition for a column in the same model (Article in this case) but not for a column in another table (Vote.caster).
If Article had a caster column:
Article.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})
In the present state this can be applied for the Vote model:
Vote.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})

Related

Value Error - Recommender System with ALS Model

I have a database that I got online for the movies. The database has an ID (Just an interaction with the movie, ID does not mean anything), User, and MovieID. Each seperate row represent a given user watching a given movie, so I am trying to write a movie reccomendation system for each user. So you give me a user and I output the list of movies that they might like.
Here is the database (Almost 90,000 rows and a lot of different movies)
ID User MovieID
0 17556 2591 88879
1 17557 3101 88879
2 17598 3101 88879
3 17598 3101 88879
4 17604 9459937 88879
... ... ... ...
88085 73266 9468430 9948
88086 73310 9467397 112749
88087 73371 9468018 109281
88088 73371 9468018 109281
88089 73381 9468360 109508
So I used internet and found the following code:
import implicit
from scipy.sparse import coo_matrix
# Drop any duplicate rows from the DataFrame
df = df.drop_duplicates(subset=["User", "MovieID"])
# Sort the DataFrame by the User column
df = df.sort_values("User")
# Create a pivot table with the User as the index, the MovieID as the columns, and the ID as the values
bookings = df.pivot_table(index="User", columns="MovieID", values="ID", aggfunc=len, fill_value=0)
# Convert the pivot table to a sparse matrix in the COOrdinate format
M = coo_matrix(bookings)
# Convert the sparse matrix to the CSR format
M = M.tocsr()
# Create an ALS model
model = implicit.als.AlternatingLeastSquares(factors=10)
# Fit the model to the data
model.fit(M)
def recommend_movies(user):
# Make sure the user is in the index of the pivot table
if user not in bookings.index:
return []
# Get the user index in the matrix
user_index = list(bookings.index).index(user)
# Get the recommendations for the user
recommendations = model.recommend(user_index, M, N=10)
# Get the movie IDs of the recommended movies
recommended_movies = [bookings.columns[index] for index, _ in recommendations]
return recommended_movies
# Example usage:
recommendations = recommend_movies(3101)
# Print the recommendations
print(recommendations)
But this error kep coming up on this line:
recommendations = recommend_movies(3101)
47 user_count = 1 if np.isscalar(userid) else len(userid)
48 if user_items.shape[0] != user_count:
---> 49 raise ValueError("user_items must contain 1 row for every user in userids")
50
51 user = self._user_factor(userid, user_items, recalculate_user)
ValueError: user_items must contain 1 row for every user in userids
I tried using ChatGPT but it was not able to give me the solution and I also looked online and was not able to find anything. There are some duplicate User values and MovieID values in the dataset as can be seen, because users watch multiple movies and also sometimes rewatch the same movies

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)

I have a spreadsheet with several columns containing survey responses. This spreadsheet will be merged into others and I will then have duplicate rows similar to the ones below. I will then need to take all questions with the same text and calculate the percentages of the answers based on the entirety of the merged document.
Example Excel Data
**Poll Question** **Poll Responses**
The content was clear and effectively delivered  37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating  37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training  37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course?  5 Total Votes
My method for calculating a non-percent number of votes will be to convert the percentages to a number. E.G. find and extract 37 from 37 Total Votes, then use the following formula to get the number of users that voted on that particular answer: percent * total / 100.
So 24.30 * 37 / 100 = 8.99 rounded up means 9 out of 37 people voted for "Strongly Agree".
Here's an example spreadsheet of what I'd like to be able to do:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(note: non-percent and subtotal would be newly created columns)
Currently I take a folder full of .xls files and I loop through that folder, saving them to another in an .xlsx format. Inside that loop, I've added a comment block that contains my # NEW test CODE where I'm trying to put the logic to do this.
As you can see, I'm trying to target the cell and get the value, then get some regex and extract the number from it, (then add it to the subtotal column in that row. I then want to add it till I see a new instance of a row containing x Total Votes.
Here's my current code:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL;DR
This seems very convoluted, but if I can get the two new columns that contain the total votes for a question and the number (not percentage) of votes for an answer, then I can do some VLOOKUP magic for this on the merged document. Any help or methodology suggestions would be greatly appreciated. Thanks!
I solved this, I'll post the pseudo code below:
I loop through each sheet. Inside that loop, I loop through each row using for n, row in enumerate(df.itertuples(), 1):.
I get the value of the field that might contain "Poll Response" poll_response = str(row[3])
Using an if / else I check if the poll_response contains the text "Total Votes". If it does, it must be a question, otherwise it must be a row with an answer.
In the if for the question I get the cells that contain the data I need. I then have a function that compares the question text with all objects question text in the array. If it's a match, then I simply update the fields of the object, otherwise I create a new question object.
else the row is an answer row, and I use the question text to find the object in the array and update/add the answers or data.
This process loops through all the rows in each spreadsheet, and now I have my array full of unique question objects.

Python .split() function

I am using split to separate the M/D/Y values from one field to make them in their own respective fields. My script in bombing out on the NULL values in the original date field for the Day field.
10/27/1990 ----> M:10 D:27 Y:1990
# Process: Calculate Field Month
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Month",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[0]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Day
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Day",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[1]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Year
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Year",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[-1]""",expression_type="PYTHON_9.3",code_block="#")
I am unsure how I should fix this issue; any suggestions would be greatly appreciated!
Something like this should work (to calculate the year where possible):
in_table = "Assess_Template"
field = "Assess_Template.Year"
expression = "get_year(!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!)"
codeblock = """def get_year(date):
try:
return date.split("/")[-1]
except:
return date"""
arcpy.CalculateField_management(in_table, field, expression, "PYTHON_9.3", codeblock)
Good luck!
Tom

Iterating through form data

I have a QueryDict object in Django as follows:
{'ratingname': ['Beginner', 'Professional'], 'sportname': ['2', '3']
where the mapping is such:
2 Beginner
3 Professional
and 2, 3 are the primary key values of the sport table in models.py:
class Sport(models.Model):
name = models.CharField(unique=True, max_length=255)
class SomeTable(models.Model):
sport = models.ForeignKey(Sport)
rating = models.CharField(max_length=255, null=True)
My question here is, how do I iterate through ratingname such that I can save it as
st = SomeTable(sport=sportValue, rating=ratingValue)
st.save()
I have tried the following:
ratings = dict['ratingname']
sports = dict['sportname']
for s,i in enumerate(sports):
sport = Sport.objects.get(pk=sports[int(s[1])])
rate = SomeTable(sport=sport, rating=ratings[int(s)])
rate.save()
However, this creates a wrong entry in the tables. For example, with the above given values it creates the following object in my table:
id: 1
sport: 2
rating: 'g'
How do I solve this issue or is there a better way to do something?
There are a couple of problems here. The main one is that QueryDicts return only the last value when accessed with ['sportname'] or the like. To get the list of values, use getlist('sportname'), as documented here:
https://docs.djangoproject.com/en/1.7/ref/request-response/#django.http.QueryDict.getlist
Your enumerate is off, too - enumerate yields the index first, which your code assigns to s. So s[1] will throw an exception. There's a better way to iterate through two sequences in step, though - zip.
ratings = query_dict.getlist('ratingname') # don't reuse built in names like dict
sports = query_dict.getlist('sportname')
for rating, sport_pk in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_pk))
rate = SomeTable(sport=sport, rating=rating)
rate.save()
You could also look into using a ModelForm based on your SomeTable model.
You may use zip:
ratings = dict['ratingname']
sports = dict['sportname']
for rating, sport_id in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_id))
rate = SomeTable(sport=sport, rating=rating)
rate.save()

Categories