Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)
I've a sample dataframe
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
How can I replicate the above dataframe without changing the order?
Expected outcome:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
How about:
pd.concat([df]*3, ignore_index=True)
Output:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
You can use pd.concat:
result=pd.concat([df]*x).reset_index(drop=True)
print(result)
Output (for x=3):
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
My data is in a csv like this
ID Date Year Home Team Away Team HP AP
1 09/02 1966 Miami Oakland 14 23
2 09/03 1966 Houston Denver 45 7
3 09/10 1966 Oakland Houston 31 0
4 09/27 1966 Houston Oakland 18 10
5 10/20 1966 Oakland Houston 21 18
On each row I want to sum the previously accumulated home and away points for both home team and away team.
I have used pandas groupby to get the home points for the home team and the away points for the away team similar to below
df1['HT_HP']=df1.groupby('Home Team')['HP'].apply(lambda x : x.shift().cumsum())
But I can't do it to get the previously scored away points for the home team and the previously scored home points for the away team.
so for the first Oakland Houston game there would be a column with Oakland 23 away points and a separate column for Houston 45 home points
Expected outcome:
ID Date Year Home Team Away Team HP AP HT_AP AT_HP
1 09/02 1966 Miami Oakland 14 23 NaN NaN
2 09/03 1966 Houston Denver 45 7 NaN NaN
3 09/10 1966 Oakland Houston 31 0 23 45
4 09/27 1966 Houston Oakland 18 10 0 31
5 10/20 1966 Oakland Houston 21 18 33 63
I've tried this
df1['HT_AGS'] = df1.where(df1['AwayTeam']==df1['HomeTeam']).groupby('HomeTeam')['FTHG'].apply(lambda x : x.shift().cumsum())
This returns a full column of NaN values
In excel it would be something similar to sumifs(F1:F3,D1:D3,E4)
I have a dataframe and want the output to be formatted to save paper for printing.
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8
London 40
France 2 20
France 2 22
France 3
France 3
France 3
USA 10
Is there a way to format the dataframe to look like this:
GameA GameB
Country
London 5 London 20
London 5 London 10
London 3 London 5
London 3 London 6
London London 8
London London 40
GameA GameB
France 2 France 20
France 2 France 22
France 3
France 3
France 3
GameA
USA 10
The formatting is off a bit because of how it copy and pasted the text results above (due to the missing values), but this should work with your actual data.
countries = df.index.unique()
for country in countries:
print(df.loc[df.index == country])
print(' ')
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8 NaN
London 40 NaN
GameA GameB
Country
France 2 20
France 2 22
France 3 NaN
France 3 NaN
France 3 NaN
GameA GameB
Country
USA 10 NaN
I have the dataframe:
df = pd.DataFrame({'Continent':['North America','North America','North America','Europe','Europe','Europe','Europe'],
'Country': ['US','Canada','Mexico','France','Germany','Spain','Italy'],
'Status': ['Member','Non-Member','Non-Member','Member','Non-Member','Member','Non-Member'],
'Units': [27,5,4,10,15,8,8]})
print df
Continent Country Status Units
0 North America US Member 27
1 North America Canada Non-Member 5
2 North America Mexico Non-Member 4
3 Europe France Member 10
4 Europe Germany Non-Member 15
5 Europe Spain Member 8
6 Europe Italy Non-Member 8
I need to add 2 columns which are summary statistics about the Continents. I need a column with the sum of Units for Member countries and Non Member countries.
so that the final output would look like:
Continent Member Units Non-Member Units Country Status Units
0 North America 27 9 US Member 27
1 North America 27 9 Canada Non-Member 5
2 North America 27 9 Mexico Non-Member 4
3 Europe 18 23 France Member 10
4 Europe 18 23 Germany Non-Member 15
5 Europe 18 23 Spain Member 8
6 Europe 18 23 Italy Non-Member 8
It seems like I need to use groupby but I can't figure out how to take the groupby values and re-insert them into the dataframe as new columns.
summary_stats = df.groupby(['Continent','Status'])['Units'].sum()
print summary_stats
Continent Status
Europe Member 18
Non-Member 23
North America Member 27
Non-Member 9
Name: Units, dtype: int64
I also tried not using groupby with these:
df['Member Units'] = df['Units'][df['Status'] == 'Member'].sum()
df['Non-Member Units'] = df['Units'][df['Status'] == 'Non-Member'].sum()
but that doesn't differentiate by Continent so it just adds up all the Members and Non-Members
Any help is greatly appreicated!
I think you need first groupby and transform sum for creating new Series all_sum. Then I think is better use numpy.where and if is member, get value from Series, if not, get 0. Similar with non-members:
all_sum = df.groupby(['Continent','Status'])['Units'].transform(sum)
print all_sum
0 27
1 9
2 9
3 18
4 23
5 18
6 23
dtype: int64
df['Member Units'] = np.where(df['Status'] == 'Member', all_sum, 0)
df['Non-Member Units'] = np.where(df['Status'] != 'Member', all_sum, 0)
print df
Continent Country Status Units Member Units Non-Member Units
0 North America US Member 27 27 0
1 North America Canada Non-Member 5 0 9
2 North America Mexico Non-Member 4 0 9
3 Europe France Member 10 18 0
4 Europe Germany Non-Member 15 0 23
5 Europe Spain Member 8 18 0
6 Europe Italy Non-Member 8 0 23
Once you have summary_stats I think you can do something like:
df['Member Units'] = summary_stats[zip(df['Continent'].values, df['Status'].values)]
The reason you need to zip the Series values is that df['Continent'] returns a series with indices, but you don't want that to happen.
Since you have summary_stats, you can use merge() after reshape it:
summary = summary_stats.reset_index().pivot(index='Continent', columns='Status', values='Units')
summary['Continent'] = summary.index
df = df.merge(summary, on='Continent')
Then just rename columns as you want