I have an empty matrix and I want to replace the matrix elements with 1 if country (index) belongs to Region (column).
I try to create a double loop, but I get stacked when I need to do the conditional. Thanks. ([152 rows x 6 columns]). Thanks so much.
west europe east europe latin america
Norway 0 0 0
Denmark 0 0 0
Iceland 0 0 0
Switzerland 0 0 0
Finland 0 0 0
Netherlands 0 0 0
Sweden 0 0 0
Austria 0 0 0
Ireland 0 0 0
Germany 0 0 0
Belgium 0 0 0
I was thinking smth like:
matrix = pd.DataFrame(np.random.randint(1, size=(152, 6)), index=['# enumarate all the countries], columns=['west europe', 'east europe', 'latin america','north america','africa', 'asia'])
print (matrix)
for i in range (len(matrix)):
for j in range(len(matrix)):
if data[i] =='Africa' and data['Country'] = [ '#here enumarate all Africa countries':
matrix[i][j]==1
elif:
....
matrix[i][j]==1
else:
matrix[i][j]==0
print (matrix)
Sample data frame with countries and region:
Country Happiness Rank Happiness Score Economy Family Health Freedom Generosity Corruption Dystopia Job Satisfaction Region
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 Western Europe
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 Western Europe
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 Western Europe
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 Western Europe
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 Western Europe
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 Western Europe
If your input variable data is a DataFrame, then as #Alollz mentioned, you can use the pandas pd.get_dummies function.
Something like this: pd.get_dummies(data, columns=['Region'])
And the output would look like:
Country HappinessRank HappinessScore Economy Family Health Freedom Generosity Corruption Dystopia JobSatisfaction Region_WesternEurope
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 1
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 1
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 1
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 1
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 1
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 1
It will take the Region category column and make it into indicator columns. In this case it uses the column name as the prefix but you can play around with that.
Related
What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
How I can count if a country that is in more rows , has failed or passed,
enter image description here
Like is
ID unique Countries Test
1 Spain, Netherlands Fail
2 Italy Pass
3 France, Netherlands Pass
4 Belgium, France, Bulgaria Fail
5 Belgium, United Kingdom Pass
6 Netherlands, France Pass
7 France, Netherlands, Belgiu Pass
and the result should be like this
enter image description here
Pass Fail
Spain 0 1
Italy 1 0
France 3 1
Netherlands 3 1
Belgium 2 1
United Kingdom 1 0
Because Netherlands is in 4 rows , and has 3 passed and one failed.
Use Series.str.split with DataFrame.explode and last call crosstab:
df1 = df.assign(Countries = df.Countries.str.split(', ')).explode('Countries')
df2 = pd.crosstab(df1['Countries'],df1['Test'])
I am working with a dataset that looks like this:
I want to analyze the most effective teams from this dataset so I have decided to calculate points based on their results and then calculate points per game. For reference, a win is 3 points, a draw is 1 point and a loss is 0 points. So to calculate points I decided to add two new columns that say how many points the home team and away team got. I did this by:
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_home_team'] = 3
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_home_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_home_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_away_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_away_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_away_team'] = 3
This is giving me a SettingWithCopyWarning but the code seems to be working fine as this is what the dataset looks like after:
To get total points I did this:
home_points = (since_2018.groupby(['home_team'])['points_home_team'].sum() + since_2018.groupby(['away_team'])['points_away_team'].sum())
home_points.sort_values(ascending=False)
Now I want to calculate points per game for each team in order to see which teams have been the most effective and I think I managed to get a number for games played by each team through this code:
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
So now from there I am stuck as to how to actually use those number of games and points to get points per game. Any help is appreciated, thanks!
For reference, here is a text version of the dataset:
home_team away_team home_score away_score tournament city country neutral total_goals points_home_team points_away_team
date
2018-01-02 Iraq United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-02 Oman Bahrain 1 0 Gulf Cup Kuwait City Kuwait True 1 3.0 0.0
2018-01-05 Oman United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-07 Estonia Sweden 1 1 Friendly Abu Dhabi United Arab Emirates True 2 1.0 1.0
2018-01-11 Denmark Sweden 0 1 Friendly Abu Dhabi United Arab Emirates True 1 0.0 3.0
Try this to set the home/away scores:
net_goals = since_2018.home_score - since_2018.away_score
home_points = [3 if ng>0 else 1 if ng==0 else 0 for ng in net_goals]
And the inverse for away_points
Now change
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
To
... .groupby('home_team').home_points.agg(['sum', 'count']) ...
and you are almost done
Create a dedicated total points data frame and add a column for the number of matches to it. Combine the home and away teams to calculate the points per game.
df_h = pd.concat([since_2018.groupby(['home_team'])['points_home_team'].sum(),since_2018.groupby("home_team").size()], keys='home_team', axis=1)
df_h.columns = ['points', 'matches']
df_a = pd.concat([since_2018.groupby(['away_team'])['points_away_team'].sum(),since_2018.groupby('away_team').size()], keys='away_team', axis=1)
df_a.columns = ['points', 'matches']
result = pd.concat([df_h, df_a], axis=0)
result['rate'] = result['points'] / result['matches']
result
points matches rate
Denmark 0.0 1 0.0
Estonia 1.0 1 1.0
Iraq 1.0 1 1.0
Oman 4.0 2 2.0
Bahrain 0.0 1 0.0
Sweden 4.0 2 2.0
United Arab Emirates 2.0 2 1.0
I have data from an API that is in this format:
PublicAssistanceFundedProjectsSummaries
0 [{'disasterNumber': 1239, 'declarationDate': '...
I want it to be in this format:
disasterNumber declarationDate etc. etc.
1239 11/21/2001 XYZ. XYZ
How would I go about this? My current code looks like:
g = requests.get('https://www.fema.gov/api/open/v1/PublicAssistanceFundedProjectsSummaries?get=PublicAssistanceFundedProjectsSummaries').json()
df_g = json_normalize(g)
df_g2 = df_g.drop(df_g.columns[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]], axis=1)
Sorry, I'm very new to coding.
You are close, need specify PublicAssistanceFundedProjectsSummaries key only:
df = json_normalize(g, 'PublicAssistanceFundedProjectsSummaries')
Or:
df = json_normalize(g['PublicAssistanceFundedProjectsSummaries'])
print (df.head())
disasterNumber declarationDate incidentType state \
0 1239 1998-08-26T00:00:00.000Z Severe Storm(s) Texas
1 1240 1998-08-27T00:00:00.000Z Hurricane North Carolina
2 1240 1998-08-27T00:00:00.000Z Hurricane North Carolina
3 1240 1998-08-27T00:00:00.000Z Hurricane North Carolina
4 1240 1998-08-27T00:00:00.000Z Hurricane North Carolina
county applicantName educationApplicant \
0 Val Verde UNITED MEDICAL CENTERS False
1 NaN ATLANTIC TELEPHONE MEMBERSHIP CORP. False
2 NaN CARTERET COUNTY SCHOOLS True
3 NaN CEDAR POINT, TOWN OF False
4 NaN AURORA, TOWN OF False
numberOfProjects federalObligatedAmount hash \
0 1 12028.63 5d10528e4d343d96061da8816465e64b
1 1 30956.35 4aad896727265f6e5948a76e7977f57e
2 4 1288255.25 eb55e580a33cd4d97fa8fd1f0d71294d
3 1 4125.00 e8ed9d023142825fa658ef7b5ae4729a
4 6 22810.25 23684ce3b4d27bc7561c39b024c3a05d
lastRefresh id
0 2019-10-18T14:06:56.673Z 5da9c7005164620fcb0213ca
1 2019-10-18T14:06:56.676Z 5da9c7005164620fcb0213dd
2 2019-10-18T14:06:56.680Z 5da9c7005164620fcb0213f4
3 2019-10-18T14:06:56.680Z 5da9c7005164620fcb0213f8
4 2019-10-18T14:06:56.676Z 5da9c7005164620fcb0213da
I've generated a dataframe of probabilities from a scikit-learn classifier like this:
def preprocess_category_series(series, key):
if series.dtype != 'category':
return series
if series.cat.ordered:
s = pd.Series(series.cat.codes, name=key)
mode = s.mode()[0]
s[s<0] = mode
return s
else:
return pd.get_dummies(series, drop_first=True, prefix=key)
data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])
I now want to append these probabilities back to my original dataframe. However, the predictions dataframe generated above, while preserving the order of items in data, has lost data's index. I assumed I'd be able to do
pd.concat([data, predictions], axis=1, ignore_index=True)
but this generates an error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I've seen that this comes up sometimes if column names are duplicated, but in this case none are. What is that error about? What's the best way to stitch these dataframes back together.
data.head():
year serial hwtfinl region statefip \
cpsid
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20120800000500 2012 6 2814.24 East South Central Division Alabama
20120800000600 2012 7 2828.42 East South Central Division Alabama
county month pernum cpsidp wtsupp ... \
cpsid ...
20121000000100 0 11 1 20121000000101 3208.1213 ...
20121000000100 0 11 2 20121000000102 3796.8506 ...
20121000000100 0 11 3 20121000000103 3386.4305 ...
20120800000500 0 11 1 20120800000501 2814.2417 ...
20120800000600 1097 11 1 20120800000601 2828.4193 ...
race hispan educ votereg \
cpsid
20121000000100 White Not Hispanic 111 Voted
20121000000100 White Not Hispanic 111 Did not register
20121000000100 White Not Hispanic 111 Voted
20120800000500 White Not Hispanic 92 Voted
20120800000600 White Not Hispanic 73 Did not register
educ_parsed age4 educ4 \
cpsid
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree Under 30 College grad
20120800000500 Associate's degree, academic program 45-64 College grad
20120800000600 High school diploma or equivalent 65+ HS or less
race4 region4 gender
cpsid
20121000000100 White South Male
20121000000100 White South Female
20121000000100 White South Female
20120800000500 White South Female
20120800000600 White South Female
predictions.head():
a b c d e f
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447
Just for fun, I've tried this specifically with only the head rows:
pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)
Same error comes up.
I'm on 0.18.0 too. This is what I tried and it worked. Is this what you are doing?
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X,Y)
import pandas as pd
data = pd.DataFrame(X)
data['y']=Y
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)])
pd.concat([data, predictions], axis=1, ignore_index=True)
0 1 2 3 4
0 -1 -1 1 1.000000e+00 1.522998e-08
1 -2 -1 1 1.000000e+00 3.775135e-11
2 -3 -2 1 1.000000e+00 5.749523e-19
3 1 1 2 1.522998e-08 1.000000e+00
4 2 1 2 3.775135e-11 1.000000e+00
5 3 2 2 5.749523e-19 1.000000e+00
It turns out there is one relatively straightforward solution:
predictions.index = data.index
pd.concat([data, predictions], axis=1)
Now it works perfectly. No clue why it wouldn't work the way I originally tried.