Value Error - Recommender System with ALS Model - python

I have a database that I got online for the movies. The database has an ID (Just an interaction with the movie, ID does not mean anything), User, and MovieID. Each seperate row represent a given user watching a given movie, so I am trying to write a movie reccomendation system for each user. So you give me a user and I output the list of movies that they might like.
Here is the database (Almost 90,000 rows and a lot of different movies)
ID User MovieID
0 17556 2591 88879
1 17557 3101 88879
2 17598 3101 88879
3 17598 3101 88879
4 17604 9459937 88879
... ... ... ...
88085 73266 9468430 9948
88086 73310 9467397 112749
88087 73371 9468018 109281
88088 73371 9468018 109281
88089 73381 9468360 109508
So I used internet and found the following code:
import implicit
from scipy.sparse import coo_matrix
# Drop any duplicate rows from the DataFrame
df = df.drop_duplicates(subset=["User", "MovieID"])
# Sort the DataFrame by the User column
df = df.sort_values("User")
# Create a pivot table with the User as the index, the MovieID as the columns, and the ID as the values
bookings = df.pivot_table(index="User", columns="MovieID", values="ID", aggfunc=len, fill_value=0)
# Convert the pivot table to a sparse matrix in the COOrdinate format
M = coo_matrix(bookings)
# Convert the sparse matrix to the CSR format
M = M.tocsr()
# Create an ALS model
model = implicit.als.AlternatingLeastSquares(factors=10)
# Fit the model to the data
model.fit(M)
def recommend_movies(user):
# Make sure the user is in the index of the pivot table
if user not in bookings.index:
return []
# Get the user index in the matrix
user_index = list(bookings.index).index(user)
# Get the recommendations for the user
recommendations = model.recommend(user_index, M, N=10)
# Get the movie IDs of the recommended movies
recommended_movies = [bookings.columns[index] for index, _ in recommendations]
return recommended_movies
# Example usage:
recommendations = recommend_movies(3101)
# Print the recommendations
print(recommendations)
But this error kep coming up on this line:
recommendations = recommend_movies(3101)
47 user_count = 1 if np.isscalar(userid) else len(userid)
48 if user_items.shape[0] != user_count:
---> 49 raise ValueError("user_items must contain 1 row for every user in userids")
50
51 user = self._user_factor(userid, user_items, recalculate_user)
ValueError: user_items must contain 1 row for every user in userids
I tried using ChatGPT but it was not able to give me the solution and I also looked online and was not able to find anything. There are some duplicate User values and MovieID values in the dataset as can be seen, because users watch multiple movies and also sometimes rewatch the same movies

Related

Looking for efficient way to build matrix from yelp review dataset in python

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.
The matrix should look like this:
4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|
Specifically, the columns are business_id and the rows are user_id
|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|
Currently I'm using this Yelp review data set stored in MongoDB:
_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"
My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.
I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.
Here's some snippet of my code:
def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])
def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('\n')
def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser)
arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)
Can anyone suggest more efficient way to build the rating matrix for me?
There are several general approaches you can take to speed this up.
Use sets or dictionaries to establish a unique set of businesses and users respectively; Set/Dict lookups are much faster than list searches.
Process the yelp file one entry at a time, once
Use something like numpy or pandas to build your matrix
Something like this
users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]
I modified #sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this
users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]
Later i found out that other than using tile method
We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit
from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()
note: Later i found out the reason why i can't .append() or .add() to ratings variable is because
ratings = {}
counts as dict data type, to declare a set data type you should use this instead:
ratings = set()

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Match one to many columns in Pandas dataframe

I have 2 datasets in CSV file, using pandas each file is converted into 2 different dataframes.
I want to find similar companies based on their url. I'm able to find similar companies based on 1 field (Rule1), but I want to compare more efficiently as following:
Dataset 1
uuid, company_name, website
YAHOO,Yahoo,yahoo.com
CSCO,Cisco,cisco.com
APPL,Apple,
Dataset 2
company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com
Result Dataset
company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
Dataset1 contains ~50K records.
Dataset2 contains ~4M records.
Rules
If field website in dataset 1 is the same as field company_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field support_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field privacy_website in dataset 2, extract identifier.
If not match, check if field company_name in dataset 1 is the same as field company_name in dataset 2, extract identifier.
If not matches return record and identifier field (UUID) will be empty.
Here is my current function:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields. Use extra space to avoid matching on empty TLD.
companies.fillna({'website': ' '}, inplace=True)
competitor_companies = competitor_companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_website.
competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
return competitor_companies
Looking for advise in which function to use?
Use map by Series with combine_first, but one requrement is necessary - always unique values in df1['website'] and df1['company_name']:
df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']
w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)
df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
company_name company_website support_website privacy_website uuid
0 Yahoo NaN yahoo.com yahoo.com YAHOO
1 Google google.com NaN NaN NaN
2 Cisco NaN NaN cisco.com CSCO
Take a look at dataframe.merge. Rename third column in A to company_website and do something like
A.merge(B, on='company_website', indicator=True)
should at least take care of the first rule.

Pandas Dataframe: Accessing via composite index created by groupby operation

I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.

django annotation and filtering

Hopefully this result set is explanatory enough:
title text total_score already_voted
------------- ------------ ----------- -------------
BP Oil spi... Recently i... 5 0
J-Lo back ... Celebrity ... 7 1
Don't Stop... If there w... 9 0
Australian... The electi... 2 1
My models file describes article (author, text, title) and vote (caster, date, score). I can get the first three columns just fine with the following:
articles = Article.objects.all().annotate(total_score=Sum('vote__score'))
but calculating the 4th column, which is a boolean value describing whether the current logged in user has placed any of the votes in column 3, is a bit beyond me at the moment! Hopefully there's something that doesn't require raw sql for this one.
Cheers,
Dave
--Trindaz on Fedang #django
I cannot think of a way to include the boolean condition. Perhaps others can answer that better.
How about thinking a bit differently? If you don't mind executing two queries you can filter your articles based on whether the currently logged in user has voted on them or not. Something like this:
all_articles = Article.objects.all()
articles_user_has_voted_on = all_articles.filter(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
other_articles = all_articles.exclude(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
Update
After some experiments I was able to figure out how to add a boolean condition for a column in the same model (Article in this case) but not for a column in another table (Vote.caster).
If Article had a caster column:
Article.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})
In the present state this can be applied for the Vote model:
Vote.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})

Categories