How to iterate over a data frame

How to iterate over a data frame - python

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.

In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Related

Value Error - Recommender System with ALS Model

I have a database that I got online for the movies. The database has an ID (Just an interaction with the movie, ID does not mean anything), User, and MovieID. Each seperate row represent a given user watching a given movie, so I am trying to write a movie reccomendation system for each user. So you give me a user and I output the list of movies that they might like.
Here is the database (Almost 90,000 rows and a lot of different movies)
ID User MovieID
0 17556 2591 88879
1 17557 3101 88879
2 17598 3101 88879
3 17598 3101 88879
4 17604 9459937 88879
... ... ... ...
88085 73266 9468430 9948
88086 73310 9467397 112749
88087 73371 9468018 109281
88088 73371 9468018 109281
88089 73381 9468360 109508
So I used internet and found the following code:
import implicit
from scipy.sparse import coo_matrix
# Drop any duplicate rows from the DataFrame
df = df.drop_duplicates(subset=["User", "MovieID"])
# Sort the DataFrame by the User column
df = df.sort_values("User")
# Create a pivot table with the User as the index, the MovieID as the columns, and the ID as the values
bookings = df.pivot_table(index="User", columns="MovieID", values="ID", aggfunc=len, fill_value=0)
# Convert the pivot table to a sparse matrix in the COOrdinate format
M = coo_matrix(bookings)
# Convert the sparse matrix to the CSR format
M = M.tocsr()
# Create an ALS model
model = implicit.als.AlternatingLeastSquares(factors=10)
# Fit the model to the data
model.fit(M)
def recommend_movies(user):
# Make sure the user is in the index of the pivot table
if user not in bookings.index:
return []
# Get the user index in the matrix
user_index = list(bookings.index).index(user)
# Get the recommendations for the user
recommendations = model.recommend(user_index, M, N=10)
# Get the movie IDs of the recommended movies
recommended_movies = [bookings.columns[index] for index, _ in recommendations]
return recommended_movies
# Example usage:
recommendations = recommend_movies(3101)
# Print the recommendations
print(recommendations)
But this error kep coming up on this line:
recommendations = recommend_movies(3101)
47 user_count = 1 if np.isscalar(userid) else len(userid)
48 if user_items.shape[0] != user_count:
---> 49 raise ValueError("user_items must contain 1 row for every user in userids")
50
51 user = self._user_factor(userid, user_items, recalculate_user)
ValueError: user_items must contain 1 row for every user in userids
I tried using ChatGPT but it was not able to give me the solution and I also looked online and was not able to find anything. There are some duplicate User values and MovieID values in the dataset as can be seen, because users watch multiple movies and also sometimes rewatch the same movies

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png

You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.

import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D

To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]

I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

filling in columns with info from other file based on condition

So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...

It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.

Looking for efficient way to build matrix from yelp review dataset in python

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.
The matrix should look like this:
4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|
Specifically, the columns are business_id and the rows are user_id
|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|
Currently I'm using this Yelp review data set stored in MongoDB:
_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"
My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.
I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.
Here's some snippet of my code:
def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])
def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('\n')
def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser)
arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)
Can anyone suggest more efficient way to build the rating matrix for me?

There are several general approaches you can take to speed this up.
Use sets or dictionaries to establish a unique set of businesses and users respectively; Set/Dict lookups are much faster than list searches.
Process the yelp file one entry at a time, once
Use something like numpy or pandas to build your matrix
Something like this
users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]

I modified #sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this
users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]
Later i found out that other than using tile method
We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit
from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()
note: Later i found out the reason why i can't .append() or .add() to ratings variable is because
ratings = {}
counts as dict data type, to declare a set data type you should use this instead:
ratings = set()

django annotation and filtering

Hopefully this result set is explanatory enough:
title text total_score already_voted
------------- ------------ ----------- -------------
BP Oil spi... Recently i... 5 0
J-Lo back ... Celebrity ... 7 1
Don't Stop... If there w... 9 0
Australian... The electi... 2 1
My models file describes article (author, text, title) and vote (caster, date, score). I can get the first three columns just fine with the following:
articles = Article.objects.all().annotate(total_score=Sum('vote__score'))
but calculating the 4th column, which is a boolean value describing whether the current logged in user has placed any of the votes in column 3, is a bit beyond me at the moment! Hopefully there's something that doesn't require raw sql for this one.
Cheers,
Dave
--Trindaz on Fedang #django

I cannot think of a way to include the boolean condition. Perhaps others can answer that better.
How about thinking a bit differently? If you don't mind executing two queries you can filter your articles based on whether the currently logged in user has voted on them or not. Something like this:
all_articles = Article.objects.all()
articles_user_has_voted_on = all_articles.filter(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
other_articles = all_articles.exclude(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
Update
After some experiments I was able to figure out how to add a boolean condition for a column in the same model (Article in this case) but not for a column in another table (Vote.caster).
If Article had a caster column:
Article.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})
In the present state this can be applied for the Vote model:
Vote.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over a data frame - python

Related

Value Error - Recommender System with ALS Model

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

filling in columns with info from other file based on condition

Looking for efficient way to build matrix from yelp review dataset in python

django annotation and filtering

Categories

Resources