Using Add() function to merge multiple dataframes in Panda

Using Add() function to merge multiple dataframes in Panda - python

Given three data frames containing the number of gold, silver, and bronze Olympic medals won by some countries, determine the total number of medals won by each country.
Note: All the three data frames don’t have all the same countries.Also, sort the final dataframe, according to the total medal count in descending order.
This is my code below - but I am not getting the desired output.Can someone please suggest what is wrong?
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
#gold.set_index('Country',inplace = True)
#silver.set_index('Country',inplace = True)
#bronze.set_index('Country',inplace = True)
Total = gold.add(silver,fill_value = 0).add(bronze,fill_value = 0)
Total.sort_values('Medals',ascending = True)

You can try:
pd.concat([gold, silver, bronze]).groupby('Country').sum().\
sort_values('Medals', ascending=False).reset_index()
If you do like that you have three dataframes in one. It's grouped by country and you get sum of medals for each of them. At the end we sort it in a descending order and reset the index.
Output:
Country Medals
0 USA 72
1 France 53
2 UK 27
3 Russia 25
4 Germany 20

You can do below way as well:
gold.set_index('Country', inplace=True)
silver.set_index('Country', inplace=True)
bronze.set_index('Country', inplace=True)
#print(gold)
#print(silver)
#print(bronze)
Total= gold.add(silver, fill_value=0).add(bronze,fill_value=0).sort_values('Medals', ascending=False)
Output:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0

import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False).astype("int64")
# Print the sorted dataframe
print(total)
int64 is used to convert the float value into integer and 64 indicates 64bit memory location

import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
gold.set_index('Country' , inplace = True)
silver.set_index('Country' , inplace = True)
bronze.set_index('Country' , inplace = True )
total = gold.add(silver , fill_value = 0).add(bronze , fill_value = 0)
total = total.sort_values(by = 'Medals', ascending = False)
total = total.astype(int)
print(total)

Related

Generate a world map with folium and value_counts()

I have a problem. I want to create a world map with the help of folium. I am using value_counts, because I want to show like a heatmap which countries occur most.
But unfortunately I do not know, how I could get the 'heads' folium.Choropleth(...,columns=['Country','Total'],...).add_to(m). How could I generate a map with value_counts?
The point, is that value_counts does not give any heading. Is there any option to get the heading columns=['Country', 'Total'],
Dataframe
id country
0 1 DE
1 2 DE
2 3 CN
3 4 BG
4 3 CN
5 4 BG
6 5 BG
import pandas as pd
d = {'id': [1, 2, 3, 4, 3, 4, 5], 'country': ['DE', 'DE', 'CN', 'BG', 'CN', 'BG', 'BG']}
df = pd.DataFrame(data=d)
# print(df)
count_country = df['country'].value_counts()
[OUT]
BG 3
DE 2
CN 2
import folium
#Creating a base map
m = folium.Map()
folium.Choropleth(
data=count__orders_countries,
columns=['Country', 'Total'],
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)

Check out this tutorial it helpful. https://towardsdatascience.com/creating-a-simple-folium-map-covid-19-worldwide-total-case-a0a1429c6e7c
Apparently, there are two key points that you are missing:
1- Setting up the world country data. This is done through a URL that you pass to the folium.Choropleth geo_Data parameter.
From the tutorial:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-
visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
2- In your dataframe you need to have the names of the countries to match the data from the URL so you need to replace the names in your dataframe to these names.
For example in the tutorial they had to change the following names like this (the name of the dataframe with countries and data was called df_covid):
From the tutorial:
#Replacing the country name
df_covid.replace('USA', "United States of America", inplace = True)
df_covid.replace('Tanzania', "United Republic of Tanzania", inplace =
True)
df_covid.replace('Democratic Republic of Congo', "Democratic Republic of > the Congo", inplace = True)
df_covid.replace('Congo', "Republic of the Congo", inplace = True)
df_covid.replace('Lao', "Laos", inplace = True)
df_covid.replace('Syrian Arab Republic', "Syria", inplace = True)
df_covid.replace('Serbia', "Republic of Serbia", inplace = True)
df_covid.replace('Czechia', "Czech Republic", inplace = True)
df_covid.replace('UAE', "United Arab Emirates", inplace = True)
3- Finally, create your map. PAss the URL to geo_data, the dataframe to data, and the column name that has the country and the counts to the columns.
From the tutorial:
folium.Choropleth(
geo_data=country_shapes,
name='choropleth COVID-19',
data=df_covid,
columns=['Country', 'Total Case'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Edit:
To get a data frame from the counts you could do something like this:
df_input = pd.DataFrame()
df_input['Country'] = count_country.index
df_input['Counts'] = np.array(count_country)

Manipulating a Hierarchical Dataset with Pandas and Python

I have set of restaurant sales data the structure of which comprises two dimensions: a location dimension and a food type dimension, as well as a fact table that contains some measures. I am having trouble manipulating the table to perform analysis. In the end, it will likely be displayed in excel.
Here is a toy dataset:
tuples = [('California', 'SF'), ('California', 'SD'),
('New York', 'Roch'), ('New York', 'NY'),
('Texas', 'Houst'), ('Texas', 'SA')]
measure1 = [5, 10,
30, 60,
10, 30]
measure2 = [50, 10,
30, 6,
1, 30]
tier1 = ['Burger',
'Burger',
'Burger',
'Pizza',
'Pizza',
'Burger']
tier2 = ['Beef',
'Chicken',
'Beef',
'Pep',
'Cheese',
'Beef']
index = pd.MultiIndex.from_tuples(tuples, names=['State', 'City'])
revenue = pd.Series(measure1, index=index)
revenue = revenue.reindex(index)
rev_df = pd.DataFrame({'Category':tier1,
'Subcategory':tier2,
'Revenue': revenue,
'NumOfOrders': [3, 5,1, 3,10, 20]})
rev_df
This code produces this dataframe:
I want to do two things:
Place the Category and Subcategory as MultiIndex column headers and calculate NumOrder weighted revenue by food subcategory and category with subtotals and grand totals
Place the cities dimension on the Y-axis and move the Category and Subcategory by measure to the x-axis.
For example-
(1)
Burger Total Burger Pizza....
Beef Chicken
California SF 4 5 9
SD 5 5 10
Total Califor 9 10 19
(2)
California California Total
SF SD
Total Burger WgtRev 9 10 19
Beef WgtRev 4 5 10
Chickn WgtRev 5 5 10
Total Pizza...
To start, my first attempt was to use a pivot_table
pivoted = rev_df.pivot_table(index = ['State','City'],
columns = ['Category','Subcategory'],
aggfunc = 'sum', #How to make it a weighted average?
margins = True, #How to subtotal and grandtotal?
margins_name = 'All',
fill_value = 0)
KeyError: "['State' 'City'] not in index"
As you can see, I get an error. What is the most python way to manipulate this snowflake-esque datamodel?

Reshaping a Pandas data frame with duplicate values

Using the using the Plotly go.Table() function and Pandas, I'm attempting to create a table to summarize some data. My data is as follows:
import pandas as pd
test_df = pd.DataFrame({'Manufacturer':['BMW', 'Chrysler', 'Chrysler', 'Chrysler', 'Brokertec', 'DWAS', 'Ford', 'Buick'],
'Metric':['Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator'],
'Dimension':['Short', 'Short', 'Short', 'Long', 'Short', 'Short', 'Long', 'Long'],
'User': ['USA', 'USA', 'USA', 'USA', 'USA', 'New USA', 'USA', 'Los USA'],
'Value':[50, 3, 3, 2, 5, 7, 10, 5]
})
My desired output is as follows (summing the Dimension by Manufacturer):
Manufacturer Short Long
Chrysler 6 2
Buick 5 5
Mercedes 7 0
Ford 0 10
I need to shape the Pandas data frame a bit (and this is where I'm running into trouble). My code was as follows:
table_columns = ['Manufacturer', 'Longs', 'Shorts']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (df[df['Manufacturer'].isin(manufacturers)]
.set_index(['Manufacturer', 'Dimension'])
['Value'].unstack()
.reset_index()[table_columns]
)
Then, create the table using the Plotly go.Table() function:
import plotly.graph_objects as go
direction_table = go.Figure(go.Table(
header=dict(
values=table_columns,
font=dict(size=12),
line_color='darkslategray',
fill_color='lightskyblue',
align='center'
),
cells=dict(
values=df_new.T, # using Transpose here
line_color='darkslategray',
fill_color='lightcyan',
align = 'center')
)
)
direction_table
The error I'm seeing is:
ValueError: Index contains duplicate entries, cannot reshape
What is the best way to work around this?
Thanks in advance!

You need to use pivot_table with aggfunc='sum' instead of set_index.unstack
table_columns = ['Manufacturer', 'Long', 'Short']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (test_df[test_df['Manufacturer'].isin(manufacturers)]
.pivot_table(index='Manufacturer', columns='Dimension',
values='Value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
print (df_new)
Manufacturer Long Short
0 Buick 5 0
1 Chrysler 2 6
2 Ford 10 0
Note it is not the same output but I don't think your input can give the expected output
Or the same result with groupby.sum and unstack
(test_df[test_df['Manufacturer'].isin(manufacturers)]
.groupby(['Manufacturer', 'Dimension'])
['Value'].sum()
.unstack(fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)

filter pandas where some columns contain any of the words in a list

I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas

You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999

Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)

You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]

Creating mean and sum dataframe using groupby

I have this dataframe whereby I wish to calculate the mean and sum from the column 'Score'. I do not want to use the .groupby().agg() method.
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'Score': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','Score','Continent'])
print (df)
Dataframe becomes:
Country Score Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
The easiest method I have found is:
new_df = df.groupby('Continent')['Score'].agg({'sum': np.sum, 'avg': np.average})
Continent
Europe 12 6.0
North America 85 42.5
I now have 2 series average and total. How do I make that into a new dataframe using the index from .groupby('Continent')?
I'm trying to use the group, frame method here:
for group, frame in df.groupby('Continent'):
avg = np.average(frame['Score'])
total = np.sum(frame['Score'])
df['avg'] = avg
df['sum'] = total

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Add() function to merge multiple dataframes in Panda - python

Related

Generate a world map with folium and value_counts()

Manipulating a Hierarchical Dataset with Pandas and Python

Reshaping a Pandas data frame with duplicate values

filter pandas where some columns contain any of the words in a list

Creating mean and sum dataframe using groupby

Categories

Resources