groupby and pandas in python

groupby and pandas in python - python

I have a file (score.csv) like this:
I need to solve the problem in python using pandas and groupby
The problem is like I have one csv file in which I have a data which I need to group on the basis of some point
the points are
I want to group data basis on series id
Then find the highest score and percentage
To find the userids with closest to 50% data (middle cohort) - in comparison to point 2 for the first testid
Then find score of these users for rest of the testids
Then normalise the score with the toppers score
The idea is to find the performance of students in each test.
Structure:
Test Series (series_id)-> having multiple tests(test_id)->mapped with users(user_id)-> scores
For each series_id, to find the first test(which is the lowest test_id for each series id), and users with scores between 40-60 in the first test only.
(Now analysis will be done on users found in point 1 for other tests. Meaning I have found users who are scoring around 50 marks and now I will track their journey in other tests.)
Pick the users_ids from above and find scores for other tests as well. Along with this will have to find the highest score in each test to find the ratio of marks_obtained/highest marks in that test. Basically, we want to normalise the scores with respect to the highest scorer to understand the journey of these users.

You can concatenate the groupby function with other aggregating methods:
df.groupby('series_id').max('marks_gained')
df.groupby('series_id').mean('marks_gained')
You can then find the median and define your 50% cohort by equal distance to that.
df.groupby('series_id').median('marks_gained')
It's usually better to provide data in a format that is easily reproducible and to be completely explicit about each of your requests(for example one has to guess what you mean by percentage - the percentiles of marks?)

Related

Mean center Pandas

I have a movie dataset. The dataset contains ratings from users. For every user I need the mean of all the ratings form that user. This mean I need to substract from every single rating to get the centered mean.
mean_user_rating = df_ratings.groupby(['userId'])['rating'].mean().to_frame()
mean_user_rating.rename(columns={"A": "userId", "B": "rating"})
film_user = pd.DataFrame(df_ratings.sort_values('rating', ascending=False))
example data
But now I need to substract the mean ratings from the normal ratings but this need to be done on the userId.
The question and additional info:
Users have a bias, they do not review in the same way and therefore an absolute rating is less interesting. To make a better system we can look at the average rating of a user and how the given rating deviates from this average. This way you know whether a user relatively finds a film better or worse. In other words, we mean-center the rating. We take the mean, the mean, as the midpoint and rewrite the rating to the deviation from this midpoint.
Once you know the deviation from the center point, you can use this to compare films. For example, you can take the average of the deviation of each rating of a film and use this as a new rating for a film.
Question 25: A better top 5
It is now up to you to draw up a new top 5. This time not based on the average rating, but based on the average mean-centered rating. Again, use df_ratings_filtered as dataset here. Store the top 5 as Series in a variable called top_5_mean_centered with index movieId.
Note, this command is slightly larger. Think carefully in advance which steps you have to go through.

Generalized Data Quality Checks on Datasets

I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.
My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.
assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0
Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?
For instance, one dataset might have only 400 rows, and another might have 2 million.
Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.
Right now, for tables that change daily, I'm doing the following rudimentary check:
assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100
But obviously this is not sustainable.
Any suggestions are much appreciated.

Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.
For example:
If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).
This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.
As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.
Hope that helps!

How do I average historical statistics from a dataset with sports results?

I have tennis dataset and this is the head:
Now I want to average FS_1 for a given ID1. In other words, I want to get all players average first serve percentage from the data in this dataset. And all players occur several times.
I know I can do this to get the average value of a field;
def mean(arr):
return sum(arr) / len(arr)
mean(dataset['FS_1'])
but how do I get a specific players average?

Pandas groupby method should do the trick:
df.groupby(['ID1']).mean()['FS_1']

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.

The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

pandas: smallest X for a defined probability

The data is financial data, with OHLC values in column, e.g.
Open High Low Close
Date
2013-10-20 1.36825 1.38315 1.36502 1.38029
2013-10-27 1.38072 1.38167 1.34793 1.34858
2013-11-03 1.34874 1.35466 1.32941 1.33664
2013-11-10 1.33549 1.35045 1.33439 1.34950
....
I am looking for the answer to the following question:
What is the smallest number X for which (at least) N% of the numbers in a large data set are equal or bigger than X
And for our data with N=60 using the High column, the question would be: What is the smallest number X for which (at least) 60% of High column items are equal or bigger than X?
I know how to calculate std dev, mean and the rest with pandas but my statistic understanding is rather poor to allow me to proceed further. Please also point me to tehoretical papers/tutorials if you know so.
Thank you.

For the sake of completeness, even though the question was essentially resolved in the comment by #haki above, suppose your data is in the DataFrame data. If you were looking for the high price for which 25% of observed high prices are lower, you would use
data['High'].quantile(q=0.25)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.