Pandas filter based on aggregate values - python

I am using the data found here: Kaggle NFL Data. I am attempting to filter the data based on the number of pass attempts per player.
Reading in all data to variable all_nfl_data. I then would like to do this:
all_pass_plays = all_nfl_data[all_nfl_data.PlayType == 'Pass']
passers_under_100 = all_pass_plays.groupby('Passer').transform('size') <= 100
I cannot figure out how to correctly filter based on the above logic. I am trying to filter for players which have less than 100 pass attempts in total. The goal is to filter the full dataframe based on this number, not just return the player names themselves. Appreciate the help :)

You can do with isin (PS: trying to fix your code)
all_pass_plays = all_nfl_data[all_nfl_data.PlayType == 'Pass']
passers_under_100 = all_pass_plays.groupby('Passer').size()<= 100
afterfilterdf=all_nfl_data[all_nfl_data['Passer'].isin(passers_under_100[passers_under_100].index)]

Alternative solution in one line
passers_under_100 = all_pass_plays.groupby('Passer').filter(lambda x : x['Passer'].size <= 100)
Corresponding documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html

Related

Add a column in the dataset which contains the total number of cast in that particular movie or Tv-Show

I have just started my work on pandas. Currently I'm working on a dataset of NETFLIX.
In this dataset I want to add a new column which contains the total number of cast members in that particular movie or tv show. I can calculate the cast individually but I want to calculate all of them. Can someone help me to write this code ?
NETFLIX.CSV FILE
Here is what I'm trying to do:
def set_cast(val):
for i in df['cast']:
if val== "None":
return 0
else:
return len(df.loc[i,'cast'].split(', '))
df['num_of_cast'] = df['cast'].apply(set_cast)
That's how I'm trying to add number of cast in a new column but it's not working...The dataset contains 8807 rows so adding each of it individually is not possible for me.
Need a solution for this. Thanks
You are almost there
When you apply a function to a pd.Series, it is applied to each individual element of the series
So try this:
def set_cast(val):
if val is None:
return 0
if val == 'None':
return 0
return len(val.split(', '))
df['num_of_cast'] = df['Cast'].apply(set_cast)

Best way to loop through a filtered pandas Dataframe

I need to loop through a pandas DataFrame, but first I have to filter it. I need to look at how many "old_id"s are attached to each new ID.
I wrote this code and is working fine, but it doesn't scale really well.
d = dict()
for new_id in (new_id_list):
d[new_id] = df[df['new_id_col'] == new_id]['old_id'].nunique()
How can I make this more efficient?
Looks like you're looking for groupby + nunique. This fetches the number of unique "old_id"s per "new_id_col":
out = df.groupby('new_id_col')['old_id'].nunique().to_dict()

How To Apply Conditions In A DataFrame To Give A Particular Set Of Value Or Values?

I have here two questions that I can't find a solution to. Pls, help.
This is a python pandas DataFrame and not a MySQL table. For the first question here is the code that I tried:
for x in stock["Price"]:
if x < 150:
print(stock[["ItemNo","ItemName"]])
else:
print("Error")
"stock" is the name of the DataFrame that I used.
Try:
# Display item number and name whose price is less than 150
less_than_150 = stock.loc[stock["price"] < 150, ["ItemNo", "ItemName"]]
print(less_than_150)
# Display detail of different types of Tea available in Shop
different_teas = stock[stock["ItemName"].str.contains("Tea", regex=False)].unique()
print(different_teas)

Efficiently filtering the oldest record in a dataframe according to each id

I have a dataframe with the following details:
For each q_id there might be multiple ph_id with different ph_date.
I want to make a new dataframe out of it, in a way that for each q_id there is just one ph_id and that is the oldest (with minimum date).
I tried the following code but I think it is computationally slow:
def oldest_ph(q_id):
return a.loc[a.ph_date == a[a['q_id'] == q_id].ph_date.min(), 'ph_date']
b['oldest_date'] = a['q_id'].apply(lambda x: a(x))
Is there any better way for this point?
First, let try extracting the oldest ph_id for each q_id, then you can use map:
s = df.sort_values('date').drop_duplicates('q_id').set_index('q_id')
df['ph_id'] = df['q_id'].map(s['ph_id'])

How to access values in groupby dataframe with multiple labels?

I am trying to find the values inside dataframe that has been grouped.
I grouped payment data with time the person borrowed the money and months it took for the person to pay and summed the amount they paid. My goal is to find the list of months it took for people to pay back.
For example, how can I know the list of 'month_taken' when start_yyyymm is 201807?
payment_sum_monthly =
payment_data.groupby(['start_yyyymm','month_taken'])
[['amount']].sum()
If I use R and put the payment data in data.table form, I can find out the list of month_taken by
payment_sum_monthly[start_yyyymm == '201807',month_taken]
How can I do this in Python? Thanks.
is_date = payment_data['start_yyyymm'] == "201807"
It should give you all the entities that has 'start_yyyymm' is 201807. Then to call those entities, you can code following:
date_set = payment_data[is_date].copy()
payment_sum_monthly = date_set .groupby('month_taken').aggregate(sum)
payment_sum_monthly
And if you need one more condition you can do following:
condition2 = payment_data['column name'] == condition
payment_data[is_date & condition2]
I hope I got your question right and it helps

Categories