Creating a mutliindex heatmap from two dicts and pandas dataframe - python

I am having two dicts where the different keys are the same for both, but not necessarily in the same order.
DictA = {"Asia": ["Japan", "China", "Laos"], "Europe": ["England", "Sweden"]}
DictB = {"Europe": ["Denmark", "Hungary", "Spain", "Moldova"], "Asia": ["Mongolia", "Thailand"]}
These keys and values point to the columns and rows of a pandas dataframe with values that need to be made into a heatmap.
Df =
Country Asia Europe
Japan 3 1
Sweden 2 2
England 1 4
China 5 9
Laos 1 9
Denmark 3 1
Mongolia 1 7
Thailand 7 4
Hungary 7 3
Spain 2 9
Moldova 1 5
What I need to figure out is how to use these dicts to cross reference the pandas dataframe to make a heatmap where the values are coloring the heatmap. The name of the countries should be on the lower axis of the heatmap (but this is not important) and underneath the names of the countries should the continents that country is in (further from the heatmap). So if the names of the countries are on the left side of the heatmap, the names of the continents should be even further to the left to show where each country belongs.
I haven't got the slightest clue how to do this. Any help is greatly appreciated!

So, you have to build a matrix with the data and pass to the DataFrame constructor.
To build a matrix, first count all continents and countries, for each of them, save into an array, to make the labels and remember and index in the matrix.
Then, build the matrix with countries x continents size and fill with zeros. Then, per each Dict, count the ocurrences of contries in continents.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
DictA = {"Asia": ["Japan", "China", "Laos"], "Europe": ["England", "Sweden"]}
DictB = {"Europe": ["Denmark", "Hungary", "Spain", "Moldova"], "Asia": ["Mongolia", "Thailand"]}
def createDataFrame(dicts):
continents = []
countries = []
for aDict in dicts:
for continent in aDict:
if continent not in continents:
continents.append(continent)
for country in aDict[continent]:
if country not in countries:
countries.append(country)
continentsLength = len(continents)
countriesLength = len(countries)
# countries rows x continents cols
matrix = np.zeros((countriesLength, continentsLength))
for aDict in dicts:
for continent in aDict:
continentIndex = continents.index(continent)
for country in aDict[continent]:
countryIndex = countries.index(country)
matrix[countryIndex][continentIndex] += 1
return pd.DataFrame(matrix, columns=continents, index=countries)
df = createDataFrame([DictA, DictB])
plt.imshow(df, cmap="YlGnBu")
plt.colorbar()
plt.xticks(range(len(df.columns)),df.columns, rotation=20)
plt.yticks(range(len(df.index)),df.index)
plt.show()
And you have the output:

Related

Linear interpolation

I am working with a dataset containing columns for GDP and GDP per capita for a number of countries. These columns contain missing values. Due to the nature of the data I was hoping to play around with linear interpolation in order to fill in the missing values without losing the general shape of the data.
My code looks as follows:
grouped_df = df.groupby("Country")
#Iterate over the groups
for country, group in grouped_df:
#Select the rows that contain missing values
missing_values = group[group["GDP percapita"].isnull()]
if not missing_values.empty:
#Interpolate to fill the missing values
filled_values = missing_values["GDP percapita"].interpolate(method="linear")
#Update the original dataframe
df.update(filled_values)
When I run this however, the missing values are still present in my dataset, however I can't find the issue with my code.
replace
filled_values = missing_values["GDP
percapita"].interpolate(method="linear")
with
filled_values = group["GDP
percapita"].interpolate(method="linear")
Here is little working sample
d= {"country": ["Brazil", "Brazil", "India", "India", ],
"cities": ["Brasilia", "Rio", "New Dehli", "Bombay"],
"population": [200.4, None, 100.10, None]}
p= pd.DataFrame(d)
print(p) #should give you
country cities population
0 Brazil Brasilia 200.4
1 Brazil Rio NaN
2 India New Dehli 100.1
3 India Bombay NaN
# to interpolate entire df
p.interpolate(method='linear')
country cities population
0 Brazil Brasilia 200.40
1 Brazil Rio 150.25
2 India New Dehli 100.10
3 India Bombay 100.10
#to interpolate group wise
grouped_df = p.groupby("country")
#Iterate over the groups
for country, group in grouped_df:
missing_values = group[group["population"].isnull()]
filled_values =
missing_values["population"].interpolate(method="linear")
p.update(filled_values)

Categorize column according to lists and aggregate with result

Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.
I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64
One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10
df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()

How to create a column from another df according to matching columns?

I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common
You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.

Create a two-way table from dictionary of combinations

I'm writing a simple code to have a two-way table of distances between various cities.
Basically, I have a list of cities (say just 3: Paris, Berlin, London), and I created a combination between them with itertools (so I have Paris-Berlin, Paris-London, Berlin-London). I parsed the distances from a website and saved them in a dictionary (so I have: {Paris: {Berlin : 878.36, London : 343.67}, Berlin : {London : 932.14}}).
Now I want to create a two way table, so that I can look up for a pair of cities in Excel (I need it in Excel unfortunately, otherwise with Python all of this would be unnecessary!), and have the distance back. The table has to be complete (ie not triangular, so that I can look for London-Paris, or Paris-London, and the value must be there on both row/column pair). Is something like this possible easily? I was thinking probably I need to fill in my dictionary (ie create something like { Paris : {Berlin : 878.36, London 343.67}, Berlin : {Paris : 878.36, London : 932.14}, London : {Paris : 343.67, Berlin : 932.14}), and then feed it to Pandas, but not sure it's the fastest way. Thank you!
I think this does something like what you need:
import pandas as pd
data = {'Paris': {'Berlin': 878.36, 'London': 343.67}, 'Berlin': {'London': 932.14}}
# Create data frame from dict
df = pd.DataFrame(data)
# Rename index
df.index.name = 'From'
# Make index into a column
df = df.reset_index()
# Turn destination columns into rows
df = df.melt(id_vars='From', var_name='To', value_name='Distance')
# Drop missing values (distance to oneself)
df = df.dropna()
# Concatenate with itself but swapping the order of cities
df = pd.concat([df, df.rename(columns={'From' : 'To', 'To': 'From'})], sort=False)
# Reset index
df = df.reset_index(drop=True)
print(df)
Output:
From To Distance
0 Berlin Paris 878.36
1 London Paris 343.67
2 London Berlin 932.14
3 Paris Berlin 878.36
4 Paris London 343.67
5 Berlin London 932.14

Find max number in .CSV file in Python

I have a .csv file that when opened in Excel looks like this:
My code:
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for item in myfile:
a = item.split(",")
countries.append(a)
hdi_list = []
for acountry in countries:
hdi = acountry[3]
try:
hdi_list.append(float(hdi))
except:
pass
average = round(sum(hdi_list)/len(hdi_list), 2)
maxNumber = round(max(hdi_list), 2)
minNumber = round(min(hdi_list), 2)
This code works well, however, when I find the max,min, or avg I need to grab the corresponding name of the country and print that as well.
How can I change my code to grab the country name of the min,max, avg as well?
Instead of putting the values straight in the list, use tuples instead, like this:
hdi_list.append((float(hdi), acountry[1]))
Then you can use this instead:
maxTuple = max(hdi_list)
maxNumber = round(maxTuple[0], 2)
maxCountry = maxTuple[1]
Using the pandas module, [4], [5], and [6] below should show the max, min, and average respectively. Note that the data below doesn't match yours save for country.
In [1]: import pandas as pd
In [2]: df = pd.read_csv("hdi.csv")
In [3]: df
Out[3]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 86.31
9 Sweden 80.54
In [4]: df.ix[df["HDI"].idxmax()]
Out[4]:
Country Liechtenstein
HDI 88.97
Name: 7, dtype: object
In [5]: df.ix[df["HDI"].idxmin()]
Out[5]:
Country Ireland
HDI 75.47
Name: 6, dtype: object
In [6]: df["HDI"].mean()
Out[6]: 84.484999999999985
Assuming both Liechtenstein and Germany have max values:
In [15]: df
Out[15]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 88.97
9 Sweden 80.54
In [16]: df[df["HDI"] == df["HDI"].max()]
Out[16]:
Country HDI
7 Liechtenstein 88.97
8 Germany 88.97
The same logic can be applied for the minimum value.
The following approach is close enough to your implementation that I think it might be useful. However, if you start working with larger or more complicated csv files, you should look into packages like "csv.reader" or "Pandas" (as previously mentioned). They are more robust and efficient in working with complex .csv data. You could also work through Excel with the "xlrd" package.
In my opinion, the simplest solution to reference country names with their respective values is to combine your 'for loops'. Instead of looping through your data twice (in two separate 'for loops') and creating two separate lists, use a single 'for loop' and create a dictionary with relevant data (ie. "country name", "hdi"). You could also create a tuple (as previously mentioned) but I think dictionaries are more explicit.
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for line in myfile:
country_name = line.split(",")[1]
value_of_interest = float(line.split(",")[3])
countries.append(
{"Country Name": country_name,
"Value of Interest": value_of_interest})
ave_value = sum([country["Value of Interest"] for country in countries])/len(countries)
max_value = max([country["Value of Interest"] for country in countries])
min_value = min([country["Value of Interest"] for country in countries])
print "Country Average == ", ave_value
for country in countries:
if country["Value of Interest"] == max_value:
print "Max == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
if country["Value of Interest"] == min_value:
print "Min == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
Note that this method returns multiple countries if they have equal min/max values.
If you are dead-set on creating separate lists (like your current implementation), you might consider zip() to connect your lists (by index), where
zip(countries, hdi_list) = [(countries[1], hdi_list[1]), ...]
For example:
for country in zip(countries, hdi_list):
if country[1] == max_value:
print country[0], country[1]
with similar logic applied to the min and average. This method works but is less explicit and more difficult to maintain.

Categories