Reshaping a Pandas data frame with duplicate values - python

Using the using the Plotly go.Table() function and Pandas, I'm attempting to create a table to summarize some data. My data is as follows:
import pandas as pd
test_df = pd.DataFrame({'Manufacturer':['BMW', 'Chrysler', 'Chrysler', 'Chrysler', 'Brokertec', 'DWAS', 'Ford', 'Buick'],
'Metric':['Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator'],
'Dimension':['Short', 'Short', 'Short', 'Long', 'Short', 'Short', 'Long', 'Long'],
'User': ['USA', 'USA', 'USA', 'USA', 'USA', 'New USA', 'USA', 'Los USA'],
'Value':[50, 3, 3, 2, 5, 7, 10, 5]
})
My desired output is as follows (summing the Dimension by Manufacturer):
Manufacturer Short Long
Chrysler 6 2
Buick 5 5
Mercedes 7 0
Ford 0 10
I need to shape the Pandas data frame a bit (and this is where I'm running into trouble). My code was as follows:
table_columns = ['Manufacturer', 'Longs', 'Shorts']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (df[df['Manufacturer'].isin(manufacturers)]
.set_index(['Manufacturer', 'Dimension'])
['Value'].unstack()
.reset_index()[table_columns]
)
Then, create the table using the Plotly go.Table() function:
import plotly.graph_objects as go
direction_table = go.Figure(go.Table(
header=dict(
values=table_columns,
font=dict(size=12),
line_color='darkslategray',
fill_color='lightskyblue',
align='center'
),
cells=dict(
values=df_new.T, # using Transpose here
line_color='darkslategray',
fill_color='lightcyan',
align = 'center')
)
)
direction_table
The error I'm seeing is:
ValueError: Index contains duplicate entries, cannot reshape
What is the best way to work around this?
Thanks in advance!

You need to use pivot_table with aggfunc='sum' instead of set_index.unstack
table_columns = ['Manufacturer', 'Long', 'Short']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (test_df[test_df['Manufacturer'].isin(manufacturers)]
.pivot_table(index='Manufacturer', columns='Dimension',
values='Value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
print (df_new)
Manufacturer Long Short
0 Buick 5 0
1 Chrysler 2 6
2 Ford 10 0
Note it is not the same output but I don't think your input can give the expected output
Or the same result with groupby.sum and unstack
(test_df[test_df['Manufacturer'].isin(manufacturers)]
.groupby(['Manufacturer', 'Dimension'])
['Value'].sum()
.unstack(fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)

Related

Replace certain values of one column, with different values from a different df, pandas

I have a df,
for example -
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['python', 'medical','sql','c++'],
})
and another df -
df2 = pd.DataFrame({'occupation': ['data scientist', 'data analyst'],
'knowledge':['5', '4'],
})
I want to replace the knowledge values of the first DF with the knowledge values of the second, but only for the rows which are the same.
making the first DF look like that:
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['5', 'medical','4','c++'],
})
I tried to do stuff with replace, but it didn't work...
You may try this:
occ_know_dict = df2.set_index('occupation').to_dict()['knowledge']
df['knowledge'] = df[['knowledge','occupation']].apply(
lambda row: occ_know_dict[row['occupation']] if row['occupation'] in occ_know_dict else row['knowledge'], axis=1)
You can try map the corresponding knowledge column which shares the same occupation of df2 to df1 then update the value to df.
df['knowledge'].update(df['occupation'].map(df2.set_index('occupation')['knowledge']))
Note that update happens inplace.
print(df)
name age occupation knowledge
0 name1 21 data scientist 5
1 name2 23 doctor medical
2 name3 24 data analyst 4
3 name4 28 engineer c++

Generate a world map with folium and value_counts()

I have a problem. I want to create a world map with the help of folium. I am using value_counts, because I want to show like a heatmap which countries occur most.
But unfortunately I do not know, how I could get the 'heads' folium.Choropleth(...,columns=['Country','Total'],...).add_to(m). How could I generate a map with value_counts?
The point, is that value_counts does not give any heading. Is there any option to get the heading columns=['Country', 'Total'],
Dataframe
id country
0 1 DE
1 2 DE
2 3 CN
3 4 BG
4 3 CN
5 4 BG
6 5 BG
import pandas as pd
d = {'id': [1, 2, 3, 4, 3, 4, 5], 'country': ['DE', 'DE', 'CN', 'BG', 'CN', 'BG', 'BG']}
df = pd.DataFrame(data=d)
# print(df)
count_country = df['country'].value_counts()
[OUT]
BG 3
DE 2
CN 2
import folium
#Creating a base map
m = folium.Map()
folium.Choropleth(
data=count__orders_countries,
columns=['Country', 'Total'],
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Check out this tutorial it helpful. https://towardsdatascience.com/creating-a-simple-folium-map-covid-19-worldwide-total-case-a0a1429c6e7c
Apparently, there are two key points that you are missing:
1- Setting up the world country data. This is done through a URL that you pass to the folium.Choropleth geo_Data parameter.
From the tutorial:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-
visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
2- In your dataframe you need to have the names of the countries to match the data from the URL so you need to replace the names in your dataframe to these names.
For example in the tutorial they had to change the following names like this (the name of the dataframe with countries and data was called df_covid):
From the tutorial:
#Replacing the country name
df_covid.replace('USA', "United States of America", inplace = True)
df_covid.replace('Tanzania', "United Republic of Tanzania", inplace =
True)
df_covid.replace('Democratic Republic of Congo', "Democratic Republic of > the Congo", inplace = True)
df_covid.replace('Congo', "Republic of the Congo", inplace = True)
df_covid.replace('Lao', "Laos", inplace = True)
df_covid.replace('Syrian Arab Republic', "Syria", inplace = True)
df_covid.replace('Serbia', "Republic of Serbia", inplace = True)
df_covid.replace('Czechia', "Czech Republic", inplace = True)
df_covid.replace('UAE', "United Arab Emirates", inplace = True)
3- Finally, create your map. PAss the URL to geo_data, the dataframe to data, and the column name that has the country and the counts to the columns.
From the tutorial:
folium.Choropleth(
geo_data=country_shapes,
name='choropleth COVID-19',
data=df_covid,
columns=['Country', 'Total Case'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Edit:
To get a data frame from the counts you could do something like this:
df_input = pd.DataFrame()
df_input['Country'] = count_country.index
df_input['Counts'] = np.array(count_country)

Using Add() function to merge multiple dataframes in Panda

Given three data frames containing the number of gold, silver, and bronze Olympic medals won by some countries, determine the total number of medals won by each country.
Note: All the three data frames don’t have all the same countries.Also, sort the final dataframe, according to the total medal count in descending order.
This is my code below - but I am not getting the desired output.Can someone please suggest what is wrong?
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
#gold.set_index('Country',inplace = True)
#silver.set_index('Country',inplace = True)
#bronze.set_index('Country',inplace = True)
Total = gold.add(silver,fill_value = 0).add(bronze,fill_value = 0)
Total.sort_values('Medals',ascending = True)
You can try:
pd.concat([gold, silver, bronze]).groupby('Country').sum().\
sort_values('Medals', ascending=False).reset_index()
If you do like that you have three dataframes in one. It's grouped by country and you get sum of medals for each of them. At the end we sort it in a descending order and reset the index.
Output:
Country Medals
0 USA 72
1 France 53
2 UK 27
3 Russia 25
4 Germany 20
You can do below way as well:
gold.set_index('Country', inplace=True)
silver.set_index('Country', inplace=True)
bronze.set_index('Country', inplace=True)
#print(gold)
#print(silver)
#print(bronze)
Total= gold.add(silver, fill_value=0).add(bronze,fill_value=0).sort_values('Medals', ascending=False)
Output:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False).astype("int64")
# Print the sorted dataframe
print(total)
int64 is used to convert the float value into integer and 64 indicates 64bit memory location
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
gold.set_index('Country' , inplace = True)
silver.set_index('Country' , inplace = True)
bronze.set_index('Country' , inplace = True )
total = gold.add(silver , fill_value = 0).add(bronze , fill_value = 0)
total = total.sort_values(by = 'Medals', ascending = False)
total = total.astype(int)
print(total)

Manipulating a Hierarchical Dataset with Pandas and Python

I have set of restaurant sales data the structure of which comprises two dimensions: a location dimension and a food type dimension, as well as a fact table that contains some measures. I am having trouble manipulating the table to perform analysis. In the end, it will likely be displayed in excel.
Here is a toy dataset:
tuples = [('California', 'SF'), ('California', 'SD'),
('New York', 'Roch'), ('New York', 'NY'),
('Texas', 'Houst'), ('Texas', 'SA')]
measure1 = [5, 10,
30, 60,
10, 30]
measure2 = [50, 10,
30, 6,
1, 30]
tier1 = ['Burger',
'Burger',
'Burger',
'Pizza',
'Pizza',
'Burger']
tier2 = ['Beef',
'Chicken',
'Beef',
'Pep',
'Cheese',
'Beef']
index = pd.MultiIndex.from_tuples(tuples, names=['State', 'City'])
revenue = pd.Series(measure1, index=index)
revenue = revenue.reindex(index)
rev_df = pd.DataFrame({'Category':tier1,
'Subcategory':tier2,
'Revenue': revenue,
'NumOfOrders': [3, 5,1, 3,10, 20]})
rev_df
This code produces this dataframe:
I want to do two things:
Place the Category and Subcategory as MultiIndex column headers and calculate NumOrder weighted revenue by food subcategory and category with subtotals and grand totals
Place the cities dimension on the Y-axis and move the Category and Subcategory by measure to the x-axis.
For example-
(1)
Burger Total Burger Pizza....
Beef Chicken
California SF 4 5 9
SD 5 5 10
Total Califor 9 10 19
(2)
California California Total
SF SD
Total Burger WgtRev 9 10 19
Beef WgtRev 4 5 10
Chickn WgtRev 5 5 10
Total Pizza...
To start, my first attempt was to use a pivot_table
pivoted = rev_df.pivot_table(index = ['State','City'],
columns = ['Category','Subcategory'],
aggfunc = 'sum', #How to make it a weighted average?
margins = True, #How to subtotal and grandtotal?
margins_name = 'All',
fill_value = 0)
KeyError: "['State' 'City'] not in index"
As you can see, I get an error. What is the most python way to manipulate this snowflake-esque datamodel?

how to fix 'size', 'occurred at index City' error

I am trying to do the example in Use Python & Pandas to replace NaN in the 'size' column with a specific value, depending on the City. In the example below i am trying to assign a value of 18 if the City is St. Louis.
I have used the lambda function to do it since the original dataframe has lot of rows with the repeat of City names and only few of them have NaN values.
when i run the code I am getting an error - KeyError: ('size', 'occurred at index City')
below is the snippet of the code -
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, 'NaN', 'NaN', 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
df = pd.DataFrame(raw_data)
df
df['size'] = df.apply(lambda x : x['size'].fillna(value = 18 if x['City' == 'St Louis'] else x['size'], axis = 1, inplace = True))
df
Expected - 18 to be populated in size column for St. Louis
Actual - KeyError: ('size', 'occurred at index City')
If all you're trying to do is set the size of St. Louis, you can run:
df.loc[df['City'] == 'St Louis', 'size'] = 18
However, if you instead want to set all values of NaN to 18, you could likewise run:
df.loc[df['size'] == 'NaN', 'size'] = 18
And if you'd just like to set the size of all St. Louis entries where the size is NaN, you could do:
df.loc[df['City'] == 'St Louis' and df['size'] == 'NaN', 'size'] = 18
There is a simple solution by fillna method
df['size'] = df['size'].fillna(18)
EDITED
What I failed to notice - that you populate cells with NaN string, not with real NaN values.
If you change your input data as
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, np.NaN, np.NaN, 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
Then the following method will allow you to re-populate size columns cells by city names
df = pd.DataFrame(raw_data)
df[['City', 'size']] = df.set_index('City')['size'].fillna({'St Louis': 18, 'SFO': 20}).reset_index()

Categories