drop rows in Groupby python - python

I want to get result
df3=df.groupby(['Region']).apply(lambda x: x[x['Region'].isin(["North", "East"])]['Sales'].sum()).reset_index(name='sum')
Region sum
0 East 455.0
1 North 665.0
2 South 0.0
3 West 0.0
I want to do drop rows with value = 0 or another conditions
Region sum
0 East 455.0
1 North 665.0

You can use df.loc
df[1]!=0 -> True/False filter
df.loc[df[1]!=0] # Apply the filter
df=pd.DataFrame([['East', 455.0],
['North', 665.0],
['South', 0.0],
['West', 0.0]])
df
Out[11]:
0 1
0 East 455.0
1 North 665.0
2 South 0.0
3 West 0.0
df.loc[df[1]!=0]
Out[12]:
0 1
0 East 455.0
1 North 665.0
Answer to the comment:
df.rename(columns={0:'region', 1:'sum'}).assign(**{'sum':lambda p:[q if q !=0 else pd.NA for q in p['sum']] }).dropna() (I am not sure if I understood it. Do you mean that?)

Using df.loc is the easiest method it comes to mind
filtered_df = df3.loc[df3["sum"] != 0]

Related

Conditionals with NaN in python [duplicate]

I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

Pandas calculate points per game when number of matches per team is NOT a column

I am working with a dataset that looks like this:
I want to analyze the most effective teams from this dataset so I have decided to calculate points based on their results and then calculate points per game. For reference, a win is 3 points, a draw is 1 point and a loss is 0 points. So to calculate points I decided to add two new columns that say how many points the home team and away team got. I did this by:
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_home_team'] = 3
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_home_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_home_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_away_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_away_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_away_team'] = 3
This is giving me a SettingWithCopyWarning but the code seems to be working fine as this is what the dataset looks like after:
To get total points I did this:
home_points = (since_2018.groupby(['home_team'])['points_home_team'].sum() + since_2018.groupby(['away_team'])['points_away_team'].sum())
home_points.sort_values(ascending=False)
Now I want to calculate points per game for each team in order to see which teams have been the most effective and I think I managed to get a number for games played by each team through this code:
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
So now from there I am stuck as to how to actually use those number of games and points to get points per game. Any help is appreciated, thanks!
For reference, here is a text version of the dataset:
home_team away_team home_score away_score tournament city country neutral total_goals points_home_team points_away_team
date
2018-01-02 Iraq United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-02 Oman Bahrain 1 0 Gulf Cup Kuwait City Kuwait True 1 3.0 0.0
2018-01-05 Oman United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-07 Estonia Sweden 1 1 Friendly Abu Dhabi United Arab Emirates True 2 1.0 1.0
2018-01-11 Denmark Sweden 0 1 Friendly Abu Dhabi United Arab Emirates True 1 0.0 3.0
Try this to set the home/away scores:
net_goals = since_2018.home_score - since_2018.away_score
home_points = [3 if ng>0 else 1 if ng==0 else 0 for ng in net_goals]
And the inverse for away_points
Now change
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
To
... .groupby('home_team').home_points.agg(['sum', 'count']) ...
and you are almost done
Create a dedicated total points data frame and add a column for the number of matches to it. Combine the home and away teams to calculate the points per game.
df_h = pd.concat([since_2018.groupby(['home_team'])['points_home_team'].sum(),since_2018.groupby("home_team").size()], keys='home_team', axis=1)
df_h.columns = ['points', 'matches']
df_a = pd.concat([since_2018.groupby(['away_team'])['points_away_team'].sum(),since_2018.groupby('away_team').size()], keys='away_team', axis=1)
df_a.columns = ['points', 'matches']
result = pd.concat([df_h, df_a], axis=0)
result['rate'] = result['points'] / result['matches']
result
points matches rate
Denmark 0.0 1 0.0
Estonia 1.0 1 1.0
Iraq 1.0 1 1.0
Oman 4.0 2 2.0
Bahrain 0.0 1 0.0
Sweden 4.0 2 2.0
United Arab Emirates 2.0 2 1.0

create two columns based on a function with apply()

I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0

Replace Matrix elements with 1

I have an empty matrix and I want to replace the matrix elements with 1 if country (index) belongs to Region (column).
I try to create a double loop, but I get stacked when I need to do the conditional. Thanks. ([152 rows x 6 columns]). Thanks so much.
west europe east europe latin america
Norway 0 0 0
Denmark 0 0 0
Iceland 0 0 0
Switzerland 0 0 0
Finland 0 0 0
Netherlands 0 0 0
Sweden 0 0 0
Austria 0 0 0
Ireland 0 0 0
Germany 0 0 0
Belgium 0 0 0
I was thinking smth like:
matrix = pd.DataFrame(np.random.randint(1, size=(152, 6)), index=['# enumarate all the countries], columns=['west europe', 'east europe', 'latin america','north america','africa', 'asia'])
print (matrix)
for i in range (len(matrix)):
for j in range(len(matrix)):
if data[i] =='Africa' and data['Country'] = [ '#here enumarate all Africa countries':
matrix[i][j]==1
elif:
....
matrix[i][j]==1
else:
matrix[i][j]==0
print (matrix)
Sample data frame with countries and region:
Country Happiness Rank Happiness Score Economy Family Health Freedom Generosity Corruption Dystopia Job Satisfaction Region
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 Western Europe
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 Western Europe
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 Western Europe
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 Western Europe
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 Western Europe
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 Western Europe
If your input variable data is a DataFrame, then as #Alollz mentioned, you can use the pandas pd.get_dummies function.
Something like this: pd.get_dummies(data, columns=['Region'])
And the output would look like:
Country HappinessRank HappinessScore Economy Family Health Freedom Generosity Corruption Dystopia JobSatisfaction Region_WesternEurope
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 1
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 1
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 1
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 1
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 1
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 1
It will take the Region category column and make it into indicator columns. In this case it uses the column name as the prefix but you can play around with that.

Combining dataframes with different indices in Pandas

I've generated a dataframe of probabilities from a scikit-learn classifier like this:
def preprocess_category_series(series, key):
if series.dtype != 'category':
return series
if series.cat.ordered:
s = pd.Series(series.cat.codes, name=key)
mode = s.mode()[0]
s[s<0] = mode
return s
else:
return pd.get_dummies(series, drop_first=True, prefix=key)
data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])
I now want to append these probabilities back to my original dataframe. However, the predictions dataframe generated above, while preserving the order of items in data, has lost data's index. I assumed I'd be able to do
pd.concat([data, predictions], axis=1, ignore_index=True)
but this generates an error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I've seen that this comes up sometimes if column names are duplicated, but in this case none are. What is that error about? What's the best way to stitch these dataframes back together.
data.head():
year serial hwtfinl region statefip \
cpsid
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20120800000500 2012 6 2814.24 East South Central Division Alabama
20120800000600 2012 7 2828.42 East South Central Division Alabama
county month pernum cpsidp wtsupp ... \
cpsid ...
20121000000100 0 11 1 20121000000101 3208.1213 ...
20121000000100 0 11 2 20121000000102 3796.8506 ...
20121000000100 0 11 3 20121000000103 3386.4305 ...
20120800000500 0 11 1 20120800000501 2814.2417 ...
20120800000600 1097 11 1 20120800000601 2828.4193 ...
race hispan educ votereg \
cpsid
20121000000100 White Not Hispanic 111 Voted
20121000000100 White Not Hispanic 111 Did not register
20121000000100 White Not Hispanic 111 Voted
20120800000500 White Not Hispanic 92 Voted
20120800000600 White Not Hispanic 73 Did not register
educ_parsed age4 educ4 \
cpsid
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree Under 30 College grad
20120800000500 Associate's degree, academic program 45-64 College grad
20120800000600 High school diploma or equivalent 65+ HS or less
race4 region4 gender
cpsid
20121000000100 White South Male
20121000000100 White South Female
20121000000100 White South Female
20120800000500 White South Female
20120800000600 White South Female
predictions.head():
a b c d e f
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447
Just for fun, I've tried this specifically with only the head rows:
pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)
Same error comes up.
I'm on 0.18.0 too. This is what I tried and it worked. Is this what you are doing?
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X,Y)
import pandas as pd
data = pd.DataFrame(X)
data['y']=Y
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)])
pd.concat([data, predictions], axis=1, ignore_index=True)
0 1 2 3 4
0 -1 -1 1 1.000000e+00 1.522998e-08
1 -2 -1 1 1.000000e+00 3.775135e-11
2 -3 -2 1 1.000000e+00 5.749523e-19
3 1 1 2 1.522998e-08 1.000000e+00
4 2 1 2 3.775135e-11 1.000000e+00
5 3 2 2 5.749523e-19 1.000000e+00
It turns out there is one relatively straightforward solution:
predictions.index = data.index
pd.concat([data, predictions], axis=1)
Now it works perfectly. No clue why it wouldn't work the way I originally tried.

Categories