Pandas - Dense crosstab with n most frequent from column1 and column2 - python

In setup for a collaborative filtering model on the MovieLens100k dataset in a Jupyter notebook, I'd like to show a dense crosstab of users vs movies. I figure the best way to do this is to show the most frequent n user against the most frequent m movie.
If you'd like to run it in a notebook, you should be able to copy/paste this after installing the fastai2 dependencies (it exports pandas among other internal libraries)
from fastai2.collab import *
from fastai2.tabular.all import *
path = untar_data(URLs.ML_100k)
# load the ratings from csv
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
names=['user','movie','rating','timestamp'])
# show a sample of the format
ratings.head(10)
# slice the most frequent n=20 users and movies
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
denser_ratings = ratings[ratings.user.isin(most_frequent_users)]
denser_movies = ratings[ratings.movie.isin(most_rated_movies)]
# crosstab the most frequent users and movies, showing the ratings
pd.crosstab(denser_ratings.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
Results:
Expected:
The desired output is much denser than what I've done. My example seems to be a little bit better than random, but not by much. I have two hypothesis to why it's not as dense as I want:
The most frequent users might not always rate the most rated movies.
My code has a bug which is making it index into the dataframe incorrectly for what I think I'm doing
Please let me know if you see an error in how I'm selecting the most frequent users and movies, or grabbing the matches with isin.
If that is correct (or really, regardless) - I'd like to see how I would make a denser set of users and movies to crosstab. The next approach I've thought of is to grab the most frequent movies, and select the most frequent users from that dataframe instead of the global dataset. But I'm unsure how I'd do that- between searching for the most frequent user across all the top m movies, or somehow more generally finding the set of n*m most-linked users and movies.
I will post my code if I solve it before better answers arrive.

My code has a bug which is making it index into the dataframe
incorrectly for what I think I'm doing
True, there is a bug.
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
actually is grabbing the value counts. So if users 1, 2, and 3 made 100 reviews each, the above code would return [100, 100, 100] when really we wanted the ids [1,2,3]. To get the id of the most frequent entries instead of the tally, you'd add .index to value_counts
most_frequent_users = list(ratings.user.value_counts().index[:20])
most_rated_movies = list(ratings.movie.value_counts().index[:20])
This alone improves the density to almost what the final result is below. What I was doing before was actually just a random sample (erroneously using value totals as the lookup for a movie id)
Furthermore, the approach I mentioned at the end of the post is a more robust general solution for crosstabbing with highest density as the goal. Find the most frequent X, and within that specific set, find the most frequent Y. This will work well even in sparse datasets.
n_users = 10
n_movies = 20
# list the ids of the most frequent users (those who rated the most movies)
most_frequent_users = list(ratings.user.value_counts().index[:n_users])
# grab all the ratings made by these most frequent users
denser_users = ratings[ratings.user.isin(most_frequent_users)]
# list the ids of the most frequent movies within this group of users
dense_users_most_rated = list(denser_ratings.movie.value_counts().index[:n_movies])
# grab all the most frequent movies rated by the most frequent users
denser_movies = ratings[ratings.movie.isin(dense_users_most_rated)]
# plot the crosstab
pd.crosstab(denser_users.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
This is exactly what I was looking for.
Only questions remaining are how standard was this approach? And why are some values floats?

Related

Compare columns (per row) of two DataFrames in Python

first of all, I'm quite new to programming overall (< 2 Months), so I'm sorry if that's an 'simple, no need to ask for help, try it yourself until you get it done' problem.
I have two data-frames with partially the same content (general overview of mobile-numbers including their cost centers in the company and monthly invoices with the affected mobile-numbers and their invoice amount).
I'd like to compare the content of the 'mobile-numbers' column of the monthly invoices DF to the content of the 'mobile-numbers' column of the general overview DF and if matching, assign the respective cost center to the mobile-number in the monthly invoices DF.
I'd love to share my code with you, but unfortunately I have absolutely zero clue how to solve that problem in any way.
Thanks
Edit: I'm from germany, I tried my best to explain the problem in english. If there is anything I messed up (so u dont get it) just tell me :)
Example of desired result
program meets your needs, in the second dataframe I put the value '40' to demonstrate that the dataframes already filled will not be zeroed, the replacement will only occur if there is a similar value between the dataframes, if you want a better explanation about the program , comment below, and don't forget to vote and mark as solved, I also put some 'prints' for a better view, but in general they are not necessary
import pandas as pd
general_df = pd.DataFrame({"mobile_number": [1234,3456,6545,4534,9874],
"cost_center": ['23F','67F','32W','42W','98W']})
invoice_df = pd.DataFrame({"mobile_number": [4534,5567,1234,4871,1298],
"invoice_amount": ['19,99E','19,99E','19,99E','19,99E','19,99E'],
"cost_center": ['','','','','40']})
print(f"""GENERAL OVERVIEW DF
{general_df}
________________________________________
INVOICE DF
{invoice_df}
_________________________________________
INVOICE RESULT
""")
def func(line):
t = 0
for x in range(0, len(general_df['mobile_number'])):
t = general_df.loc[general_df['mobile_number'] == line[0]]
if t.empty:
return line[2]
else:
return t.values.tolist()[0][1]
invoice_df['cost_center'] = invoice_df.apply(func, axis = 1)
print(invoice_df)

Calculating total returns from a DataFrame

this is my first post here, I hope you will understand what troubles me.
So, I have a DataFrame that contains prices for some 1200 companies for each day, beginning in 2010. Now I want to calculate the total return for each one. My DataFrame is indexed by date. I could use the
df.iloc[-1]/df.iloc[0] method, but some companies started trading publicly at a later date, so I can't get the results for those companies, as they are divided by a NaN value. I've tried by creating a list which contains the first valid indexes for every stock(column), then when I try to calculate the total returns, I get - the wrong result!
I've tried a classic for loop:
for l in list:
returns = df.iloc[-1]/df.iloc[l]
For instance, last price of one stock was around $16, and first data I have is $1.5, which would be over 10 times return, yet my result is only about 1.1! I would also like to add that the aforementioned list includes first valid indexes for Date aswell, and it is in the first position.
Can somebody please help me? Thank you very much
Many ways you can go about this actually. But I do recommend you brush up on your python skills with basic examples before you get into more complicated examples.
If you want to do it your way, you can do it like this:
returns = {}
for stock_name in df.columns:
returns[stock_name] = df[stock_name].dropna().iloc[-1] / df[stock_name].dropna().iloc[0]
A more pythonic way would be to do it in a vectorized form, like this:
returns = ((1 + data.ffill().pct_change())
.cumprod()
.iloc[-1])

How to remove rows from a categorical variable whose value counts do not satisfy a condition?

I am new to ML and Data Science (recently graduated from Master's in Business Analytics) and learning as much as I can by myself now while looking for positions in Data Science / Business Analytics.
I am working on a practice dataset with a goal of predicting which customers are likely to miss their scheduled appointment. One of the columns in my dataset is "Neighbourhood", which contains names of over 30 different neighborhoods. My dataset has 10,000 observations, and some neighborhood names only appear less than 50 times. I think that neighborhoods that appear less than 50 times in the dataset are too rare to be analyzed properly by machine learning models. Therefore, I want to remove the names of the neighborhoods from the "Neighborhood" column which appear in that column less than 50 times.
I have been trying to write a code for this for several hours, but struggle to get it right. So far, I have gotten to the version below:
my_df = my_df.drop(my_df["Neighbourhood"].value_counts() < 50, axis = 0)
I have also tried other versions of code to get rid of the rows in that categorical column, but I keep getting a similar error:
KeyError: '[False False ... True True] not found in axis'
I appreciate your help in advance, and thank you for sharing your knowledge and insights with me!
Try the code below - it uses the .loc operator to select rows on the basis of a certain condition (i.e. in neighborhoods with high counts)
counts = my_df['Neighborhood'].value_counts()
new_df = my_df.loc[my_df['Neighborhood'].isin(counts.index[counts > 50])]

More efficient way to classify

normal = []
nine_plus []
tw_plus = []
for i in df['SubjectID'].unique():
x= df.loc[df['SubjectID']==i]
if(len(x['Year Term ID'].unique())<=8):
normal.append(i)
elif(len(x['Year Term ID'].unique())>=9 and len(x['Year Term ID'].unique())<13):
nine_plus.append(i)
elif(len(x['Year Term ID'].unique())>=13):
tw_plus.append(i)
Hello, I am dealing with a dataset that has 10 million rows. The dataset is about student records and I am trying to classify the students into three groups according to how many semesters they have attended. I feel like I am using very crude method right now, and there could be more efficient way of categorizing. Any suggestions?
You go through a lot of repeated iterations that are likely to make your data frame slower than a simple Python list. Use the data frame organization in your favor.
Group your rows by Subject_ID, then Year_Term_ID.
Extract the count of rows in each sub-group -- which you currently have as len(x(...
Make a function, lambda, or extra column that represents the classification; call that len expression load:
0 if load <= 8 else 1 if load <= 12 else 3
Use that expression to re-group your students into the three desired classifications.
Do not iterate through the rows of the data frame: this is a "code smell" that you're missing a vectorized capability.
Does that get you moving?

What is an efficient way of creating a "network" from identifier data in pandas

I am a newbie in python and after browsing several answers to various questions concerning loops in python/pandas, I remain confused on how to solve my problem concerning water management data. I am trying to categorise and aggregate data based on its position in the sequence of connected nodes. The "network" is formed by each node containing the ID of the node that is downstream.
The original data contains roughly 53 000 items, which I converted to a pandas dataframe and looks something like this:
subwatershedsID = pd.DataFrame({ 'ID' : ['649208-127140','649252-127305','650556-126105','687315-128898'],'ID_DOWN' : ['582500-113890','649208-127140','649252-127305','574050-114780'], 'OUTLET_ID' : ['582500-113890','582500-113890','582500-113890','574050-114780'], 'CATCH_ID' : [217,217,217,213] })
My naive approach to deal with the data closest to the coast illustrates what I am trying to achieve.
sbwtrshdNextToStretch = subwatershedsID.loc[subwatershedsID['ID_DOWN'] == subwatershedsID['OUTLET_ID']]
sbwtrshdNextToStretchID = sbwtrshdNextToStretch[['ID']]
sbwtrshdStepFurther = pd.merge(sbwtrshdNextToStretchID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdStepFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdStepFurtherID = sbwtrshdStepFurther[['ID']]
sbwtrshdTwoStepsFurther = pd.merge(sbwtrshdStepFurtherID, subwatershedsID, how='inner', left_on='ID', right_on='ID_DOWN')
sbwtrshdTwoStepsFurther.rename(columns={'ID_y': 'ID'}, inplace=True)
sbwtrshdTwoStepsFurtherID = sbwtrshdTwoStepsFurther[['ID']]
subwatershedsAll = [sbwtrshdNextToStretchID, sbwtrshdStepFurtherID, sbwtrshdTwoStepsFurtherID]
subwatershedWithDistances = pd.concat(subwatershedsAll, keys=['d0', 'd1', 'd2'])
So this gives each node an identifier on how many nodes away it is from the first one and it feels like there should be a more simple way to achieve it and obviously something that works better for the whole data that can be with large number of consecutive connections. However, my thoughts are continuously returning to writing a loop within a loop, but all the advise seems to recommend avoiding them and hence also discourages from learning how to write the loop correctly. Furthermore, the comments on poor loop performance leave me with further doubts, since I am not sure how fast solving for 53 000 rows would be. So what would be a good python style solution?
If I understand correctly you have two stages:
Categorise each node based on its position in the network
Perform calculations on the data to work out things like volumes of water, number of nodes a certain distance from the outlet, etc.
If so...
1) Use NetworkX to perform the calculations on relative position in the network
NetworkX is a great network analysis library that comes with ready-made methods to achieve this kind of thing.
Here's an example using dummy data:
G = nx.Graph()
G.add_nodes_from([1,2,3,4])
G.add_edges_from([(1,2),(2,3),(3,4)])
# In this example, the shortest path is all the way down the stream
nx.shortest_path(G,1,4)
> [1,2,3,4]
len(nx.shortest_path(G,1,4))
> 4
# I've shortened the path by adding a new 'edge' (connection) between 1 and 4
G.add_edges_from([(1,2),(2,3),(3,4),(1,4)])
# Result is a much shorter path of only two nodes - the source and target
nx.shortest_path(G,1,4)
> [1,4]
len(nx.shortest_path(G,1,4))
> 2
2) Annotate the dataframe for later calculations
Once you have this data in a network format, you can iterate through the data and add that as metadata to the DataFrame.

Categories