Problem generating a list with a numeric qualifier - python

I am working on a course with low code requirements, and have one step where I am stuck.
I have this code that creates a list of restaurants and the number of reviews each has:
Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
From there I am supposed to create a list limited to those above 50 reviews, starting from this base:
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count['______________']['restaurant_name']
# Filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# Group the restaurant names with their ratings and find the mean rating of each restaurant
df_mean_4.groupby(['_______'])['_______'].mean().sort_values(ascending = False).reset_index().dropna() ## Complete the code to find the mean rating
Where I am stuck is on the first step.
rest_names = df_rating_count['______________']['restaurant_name']
I am pretty confident in the other 2 steps.
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
df_mean_4.groupby(['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index().dropna()
I have frankly tried so many different things I don't even know where to start.
Does anyone have any hints to at least point me in the right direction?

you can index and filter using [].
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']

#function to determine the revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
## Write the appropriate column name to compute the revenue
df['Revenue'] = df['________'].apply(compute_rev)
df.head()

Related

Looking for efficient way to build matrix from yelp review dataset in python

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.
The matrix should look like this:
4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|
Specifically, the columns are business_id and the rows are user_id
|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|
Currently I'm using this Yelp review data set stored in MongoDB:
_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"
My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.
I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.
Here's some snippet of my code:
def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])
def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('\n')
def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser)
arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)
Can anyone suggest more efficient way to build the rating matrix for me?
There are several general approaches you can take to speed this up.
Use sets or dictionaries to establish a unique set of businesses and users respectively; Set/Dict lookups are much faster than list searches.
Process the yelp file one entry at a time, once
Use something like numpy or pandas to build your matrix
Something like this
users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]
I modified #sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this
users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]
Later i found out that other than using tile method
We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit
from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()
note: Later i found out the reason why i can't .append() or .add() to ratings variable is because
ratings = {}
counts as dict data type, to declare a set data type you should use this instead:
ratings = set()

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Looping with list comprehensions

To simplify what I am trying to do:
I have 50 employees, each with a 40 task capacity.
I have a dataframe that I am reading in from a SQL table that I want to filter down to tasks with a score equal to 10 and then assign them to each employee so they have a full "basket" or workload. I want to assign one task to each employee and then iterate until finished.
My final output would look like a list with each list based off of position, denoting the employee number and the tasks that are assigned to them.
final_basket = [[task1, task2,...] , [task8, task11], ...[task45,task4]]
each one of the lists within the final basket would correspond to an employee, example:
final_basket[0] = [task1, task2,...] would be all the tasks for the first employee.
I can assign a task to each employe fine, but I get stuck with re-looping over all the employees to fill their capacity.
def basket_builder(i):
agent_basket = [[ ] for basket in range(40)] #define empty basket for all 40 agents
score_10 = base_data_1_mo[base_data_1_mo.case_score == 10] #filter data to score 10 only
score_10 = score_10[['investigation_id']] #select only investigation id df
score_10 = score_10.sort_index() ##sort by index asc
for i in range(40):
investigation_id = score_10.iloc[0]['investigation_id']
agent_basket[i].append(investigation_id)
index_drop_v2 = score_10[score_10.investigation_id == investigation_id].index[0]
score_10 = score_10.drop([index_drop_v2])
return final_basket
for i in range(40):
final_basket = []
final_basket = [[basket_builder(i) for agent in agent_basket[i]]
final_basket
Since I made some modifications to use a function to try and loop over here, I am now having an issue even printing the final_Basket
Could you do it with something like:
employee_task = {}
for n, task in enumerate(tasks):
employee_number = n % 50
if employee_number not in employee_task:
employee_task[employee_number] = []
employee_task[employee_number].append(task)
and check a posteriori that each employee has less than 40 tasks

I want create a list from values in a dataset based on a specific condition

I am working with a Dataset that contains the information of every March Madness game since 1985. I want to know which teams have won it all and how many times each.
I masked the main dataset and created a new one containing only information about the championship game. Now I am trying to create a loop that compares the scores from both teams that played in the championship game, detects the winner and adds that team to a list. This is how the dataset looks like: https://imgur.com/tXhPYSm
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for i in champions:
if champions['Score'] > champions['Score.1']:
list_champs.append(i['Team'])
else:
list_champs.append(i['Team.1'])
Why do you need to loop through the DataFrame?
Basic filtering should work well. Something like this:
champs1 = champions.loc[champions['Score'] > champions['Score.1'], 'Team']
champs2 = champions.loc[champions['Score'] < champions['Score.1'], 'Team.1']
list_champs = list(champs1) + list(champs2)
A minimalist change (not the most efficient) to get your code working:
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for row in champions.iterrows():
if row['Score'] > row['Score.1']:
list_champs.append(row['Team'])
else:
list_champs.append(row['Team.1'])
Otherwise, you could simply do:
df.apply(lambda row: row['Team'] if row['Score'] > row['Score.1'] else row['Team.1'], axis=1).values

How to compare these data sets from a csv? Python 2.7

I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.

Categories