Mean center Pandas - python

I have a movie dataset. The dataset contains ratings from users. For every user I need the mean of all the ratings form that user. This mean I need to substract from every single rating to get the centered mean.
mean_user_rating = df_ratings.groupby(['userId'])['rating'].mean().to_frame()
mean_user_rating.rename(columns={"A": "userId", "B": "rating"})
film_user = pd.DataFrame(df_ratings.sort_values('rating', ascending=False))
example data
But now I need to substract the mean ratings from the normal ratings but this need to be done on the userId.
The question and additional info:
Users have a bias, they do not review in the same way and therefore an absolute rating is less interesting. To make a better system we can look at the average rating of a user and how the given rating deviates from this average. This way you know whether a user relatively finds a film better or worse. In other words, we mean-center the rating. We take the mean, the mean, as the midpoint and rewrite the rating to the deviation from this midpoint.
Once you know the deviation from the center point, you can use this to compare films. For example, you can take the average of the deviation of each rating of a film and use this as a new rating for a film.
Question 25: A better top 5
It is now up to you to draw up a new top 5. This time not based on the average rating, but based on the average mean-centered rating. Again, use df_ratings_filtered as dataset here. Store the top 5 as Series in a variable called top_5_mean_centered with index movieId.
Note, this command is slightly larger. Think carefully in advance which steps you have to go through.

Related

Is the runtime of df.groupby().transform() linear in the number of groups in the groupby object?

BACKGROUND
I am calculating racial segregation statistics between and within firms using the Theil Index. The data structure is a multi-indexed pandas dataframe. The calculation involves a lot of df.groupby()['foo'].transform(), where the transformation is the entropy function from scipy.stats. I have to calculate entropy on smaller and smaller groups within this structure, which means calling entropy more and more times on the groupby objects. I get the impression that this is O(n), but I wonder whether there is an optimization that I am missing.
EXAMPLE
The key part of this dataframe comprises five variables: county, firm, race, occ, and size. The units of observation are counts: each row tells you the SIZE of the workforce of a given RACE in a given OCCupation in a FIRM in a specific COUNTY. Hence the multiindex:
df = df.set_index(['county', 'firm', 'occ', 'race']).sort_index()
The Theil Index is the size-weighted sum of sub-units' entropy deviations from the unit's entropy. To calculate segregation between counties, for example, you can do this:
from scipy.stats import entropy
from numpy import where
# Helper to calculate the actual components of the Theil statistic
define Hcmp(w_j, w, e_j, e):
return where(e == 0, 0, (w_j / w) * ((e - e_j) / e))
df['size_county'] = df.groupby(['county', 'race'])['size'].transform('sum')
df['size_total'] = df['size'].sum()
# Create a dataframe with observations aggregated over county/race tuples
counties = df.groupby(['county', 'race'])[['size_county', 'size_total']].first()
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4) # <--
# The base for entropy is 4 because there are four recorded racial categories.
# Assume that counties['entropy_total'] has already been calculated.
counties['seg_cmpnt'] = Hcmp(counties['size_county'], counties['size_total'],
counties['entropy_county'], counties['entropy_total'])
county_segregation = counties['seg_cmpnt'].sum()
Focus on this line:
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4)
The starting dataframe has 3,130,416 rows. When grouped by county, though, the resulting groupby object has just 2,267 groups. This runs quickly enough. When I calculate segregation within counties and between firms, the corresponding line is this:
firms['entropy_firm'] = firms.groupby('firm')['size_firm'].transform(entropy, base=4)
Here, the groupby object has 86,956 groups (the count of firms in the data). This takes about 40 times as long as the prior, which looks suspiciously like O(n). And when I try to calculate segregation within firms, between occupations...
# Grouping by firm and occupation because occupations are not nested within firms
occs['entropy_occ'] = occs.groupby(['firm', 'occ'])['size_occ'].transform(entropy, base=4)
...There are 782,604 groups. Eagle-eyed viewers will notice that this is exactly 1/4th the size of the raw dataset, because I have one observation for each firm/race/occupation tuple, and four racial categories. It is also nine times the number of groups in the by-firm groupby object, because the data break employment out into nine occupational categories.
This calculation takes about nine times as long: four or five minutes. When the underlying research project involves 40-50 years of data, this part of the process can take three or four hours.
THE PROBLEM, RESTATED
I think the issue is that, even though scipy.stats.entropy() is being applied in a smart, vectorized way, the necessity of calculating it over a very large number of small groups--and thus calling it many, many times--is swamping the performance benefits of vectorized calculations.
I could pre-calculate the necessary logarithms that entropy requires, for example with numpy.log(). If I did that, though, I'd still have to group the data to first get each firm/occupation/race's share within the firm/occupation. I would also lose any advantage of readable code that looks similar at different levels of analysis.
Thus my question, stated as clearly as I can: is there a more computationally efficient way to call something like this entropy function, when calculating it over a very large number of relatively small groups in a large dataset?

groupby and pandas in python

I have a file (score.csv) like this:
I need to solve the problem in python using pandas and groupby
The problem is like I have one csv file in which I have a data which I need to group on the basis of some point
the points are
I want to group data basis on series id
Then find the highest score and percentage
To find the userids with closest to 50% data (middle cohort) - in comparison to point 2 for the first testid
Then find score of these users for rest of the testids
Then normalise the score with the toppers score
The idea is to find the performance of students in each test.
Structure:
Test Series (series_id)-> having multiple tests(test_id)->mapped with users(user_id)-> scores
For each series_id, to find the first test(which is the lowest test_id for each series id), and users with scores between 40-60 in the first test only.
(Now analysis will be done on users found in point 1 for other tests. Meaning I have found users who are scoring around 50 marks and now I will track their journey in other tests.)
Pick the users_ids from above and find scores for other tests as well. Along with this will have to find the highest score in each test to find the ratio of marks_obtained/highest marks in that test. Basically, we want to normalise the scores with respect to the highest scorer to understand the journey of these users.
You can concatenate the groupby function with other aggregating methods:
df.groupby('series_id').max('marks_gained')
df.groupby('series_id').mean('marks_gained')
You can then find the median and define your 50% cohort by equal distance to that.
df.groupby('series_id').median('marks_gained')
It's usually better to provide data in a format that is easily reproducible and to be completely explicit about each of your requests(for example one has to guess what you mean by percentage - the percentiles of marks?)

Pandas - Dense crosstab with n most frequent from column1 and column2

In setup for a collaborative filtering model on the MovieLens100k dataset in a Jupyter notebook, I'd like to show a dense crosstab of users vs movies. I figure the best way to do this is to show the most frequent n user against the most frequent m movie.
If you'd like to run it in a notebook, you should be able to copy/paste this after installing the fastai2 dependencies (it exports pandas among other internal libraries)
from fastai2.collab import *
from fastai2.tabular.all import *
path = untar_data(URLs.ML_100k)
# load the ratings from csv
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
names=['user','movie','rating','timestamp'])
# show a sample of the format
ratings.head(10)
# slice the most frequent n=20 users and movies
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
denser_ratings = ratings[ratings.user.isin(most_frequent_users)]
denser_movies = ratings[ratings.movie.isin(most_rated_movies)]
# crosstab the most frequent users and movies, showing the ratings
pd.crosstab(denser_ratings.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
Results:
Expected:
The desired output is much denser than what I've done. My example seems to be a little bit better than random, but not by much. I have two hypothesis to why it's not as dense as I want:
The most frequent users might not always rate the most rated movies.
My code has a bug which is making it index into the dataframe incorrectly for what I think I'm doing
Please let me know if you see an error in how I'm selecting the most frequent users and movies, or grabbing the matches with isin.
If that is correct (or really, regardless) - I'd like to see how I would make a denser set of users and movies to crosstab. The next approach I've thought of is to grab the most frequent movies, and select the most frequent users from that dataframe instead of the global dataset. But I'm unsure how I'd do that- between searching for the most frequent user across all the top m movies, or somehow more generally finding the set of n*m most-linked users and movies.
I will post my code if I solve it before better answers arrive.
My code has a bug which is making it index into the dataframe
incorrectly for what I think I'm doing
True, there is a bug.
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
actually is grabbing the value counts. So if users 1, 2, and 3 made 100 reviews each, the above code would return [100, 100, 100] when really we wanted the ids [1,2,3]. To get the id of the most frequent entries instead of the tally, you'd add .index to value_counts
most_frequent_users = list(ratings.user.value_counts().index[:20])
most_rated_movies = list(ratings.movie.value_counts().index[:20])
This alone improves the density to almost what the final result is below. What I was doing before was actually just a random sample (erroneously using value totals as the lookup for a movie id)
Furthermore, the approach I mentioned at the end of the post is a more robust general solution for crosstabbing with highest density as the goal. Find the most frequent X, and within that specific set, find the most frequent Y. This will work well even in sparse datasets.
n_users = 10
n_movies = 20
# list the ids of the most frequent users (those who rated the most movies)
most_frequent_users = list(ratings.user.value_counts().index[:n_users])
# grab all the ratings made by these most frequent users
denser_users = ratings[ratings.user.isin(most_frequent_users)]
# list the ids of the most frequent movies within this group of users
dense_users_most_rated = list(denser_ratings.movie.value_counts().index[:n_movies])
# grab all the most frequent movies rated by the most frequent users
denser_movies = ratings[ratings.movie.isin(dense_users_most_rated)]
# plot the crosstab
pd.crosstab(denser_users.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
This is exactly what I was looking for.
Only questions remaining are how standard was this approach? And why are some values floats?

Generalized Data Quality Checks on Datasets

I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.
My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.
assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0
Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?
For instance, one dataset might have only 400 rows, and another might have 2 million.
Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.
Right now, for tables that change daily, I'm doing the following rudimentary check:
assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100
But obviously this is not sustainable.
Any suggestions are much appreciated.
Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.
For example:
If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).
This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.
As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.
Hope that helps!

Probability of occurrence based on historical data

The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.

Categories