Linear interpolation - python

I am working with a dataset containing columns for GDP and GDP per capita for a number of countries. These columns contain missing values. Due to the nature of the data I was hoping to play around with linear interpolation in order to fill in the missing values without losing the general shape of the data.
My code looks as follows:
grouped_df = df.groupby("Country")
#Iterate over the groups
for country, group in grouped_df:
#Select the rows that contain missing values
missing_values = group[group["GDP percapita"].isnull()]
if not missing_values.empty:
#Interpolate to fill the missing values
filled_values = missing_values["GDP percapita"].interpolate(method="linear")
#Update the original dataframe
df.update(filled_values)
When I run this however, the missing values are still present in my dataset, however I can't find the issue with my code.

replace
filled_values = missing_values["GDP
percapita"].interpolate(method="linear")
with
filled_values = group["GDP
percapita"].interpolate(method="linear")
Here is little working sample
d= {"country": ["Brazil", "Brazil", "India", "India", ],
"cities": ["Brasilia", "Rio", "New Dehli", "Bombay"],
"population": [200.4, None, 100.10, None]}
p= pd.DataFrame(d)
print(p) #should give you
country cities population
0 Brazil Brasilia 200.4
1 Brazil Rio NaN
2 India New Dehli 100.1
3 India Bombay NaN
# to interpolate entire df
p.interpolate(method='linear')
country cities population
0 Brazil Brasilia 200.40
1 Brazil Rio 150.25
2 India New Dehli 100.10
3 India Bombay 100.10
#to interpolate group wise
grouped_df = p.groupby("country")
#Iterate over the groups
for country, group in grouped_df:
missing_values = group[group["population"].isnull()]
filled_values =
missing_values["population"].interpolate(method="linear")
p.update(filled_values)

Related

Creating a mutliindex heatmap from two dicts and pandas dataframe

I am having two dicts where the different keys are the same for both, but not necessarily in the same order.
DictA = {"Asia": ["Japan", "China", "Laos"], "Europe": ["England", "Sweden"]}
DictB = {"Europe": ["Denmark", "Hungary", "Spain", "Moldova"], "Asia": ["Mongolia", "Thailand"]}
These keys and values point to the columns and rows of a pandas dataframe with values that need to be made into a heatmap.
Df =
Country Asia Europe
Japan 3 1
Sweden 2 2
England 1 4
China 5 9
Laos 1 9
Denmark 3 1
Mongolia 1 7
Thailand 7 4
Hungary 7 3
Spain 2 9
Moldova 1 5
What I need to figure out is how to use these dicts to cross reference the pandas dataframe to make a heatmap where the values are coloring the heatmap. The name of the countries should be on the lower axis of the heatmap (but this is not important) and underneath the names of the countries should the continents that country is in (further from the heatmap). So if the names of the countries are on the left side of the heatmap, the names of the continents should be even further to the left to show where each country belongs.
I haven't got the slightest clue how to do this. Any help is greatly appreciated!
So, you have to build a matrix with the data and pass to the DataFrame constructor.
To build a matrix, first count all continents and countries, for each of them, save into an array, to make the labels and remember and index in the matrix.
Then, build the matrix with countries x continents size and fill with zeros. Then, per each Dict, count the ocurrences of contries in continents.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
DictA = {"Asia": ["Japan", "China", "Laos"], "Europe": ["England", "Sweden"]}
DictB = {"Europe": ["Denmark", "Hungary", "Spain", "Moldova"], "Asia": ["Mongolia", "Thailand"]}
def createDataFrame(dicts):
continents = []
countries = []
for aDict in dicts:
for continent in aDict:
if continent not in continents:
continents.append(continent)
for country in aDict[continent]:
if country not in countries:
countries.append(country)
continentsLength = len(continents)
countriesLength = len(countries)
# countries rows x continents cols
matrix = np.zeros((countriesLength, continentsLength))
for aDict in dicts:
for continent in aDict:
continentIndex = continents.index(continent)
for country in aDict[continent]:
countryIndex = countries.index(country)
matrix[countryIndex][continentIndex] += 1
return pd.DataFrame(matrix, columns=continents, index=countries)
df = createDataFrame([DictA, DictB])
plt.imshow(df, cmap="YlGnBu")
plt.colorbar()
plt.xticks(range(len(df.columns)),df.columns, rotation=20)
plt.yticks(range(len(df.index)),df.index)
plt.show()
And you have the output:

Pivoting COVID-19 JH Data to Time Series Rows

I am trying to pivot the Johns Hopkins Data so that date columns are rows and the rest of the information stays the same. The first seven columns should stay columns, but the remaining columns (date columns) should be rows. Any help would be appreciated.
Load and Filter data
import pandas as pd
import numpy as np
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv'
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
dea = pd.read_csv(deaths_url)
con = pd.read_csv(confirmed_url)
dea = dea[(dea['Province_State'] == 'Texas')]
con = con[(con['Province_State'] == 'Texas')]
View recency of data and pivot
# get the most recent data of data
mostRecentDate = con.columns[-1] # gets the columns of the matrix
# show the data frame
con.sort_values(by=mostRecentDate, ascending = False).head(10)
# save this index variable to save the order.
index = data.columns.drop(['Province_State'])
# The pivot_table method will eliminate duplicate entries from Countries with more than one city
data.pivot_table(index = 'Admin2', aggfunc = sum)
# formatting using a variety of methods to process and sort data
finalFrame = data.transpose().reindex(index).transpose().set_index('Admin2').sort_values(by=mostRecentDate, ascending=False).transpose()
The resulting data frame looks like this, however it did not preserve any of the date times
I have also tried:
date_columns = con.iloc[:, 7:].columns
con.pivot(index = date_columns, columns = 'Admin2', values = con.iloc[:, 7:])
ValueError: Must pass DataFrame with boolean values only
Edit:
As per guidance I tried the melt command listed in the first answer and it does not create rows of dates, it just removed all other non-date values.
date_columns = con.iloc[:, 7:].columns
con.melt(id_vars=date_columns)
The end result should look like this:
Date iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key
1/22/2020 US USA 840 48001 Anderson Texas US 31.81534745 -95.65354823 Anderson, Texas, US
1/22/2020 US USA 840 48003 Andrews Texas US 32.30468633 -102.6376548 Andrews, Texas, US
1/22/2020 US USA 840 48005 Angelina Texas US 31.25457347 -94.60901487 Angelina, Texas, US
1/22/2020 US USA 840 48007 Aransas Texas US 28.10556197 -96.9995047 Aransas, Texas, US
Use pandas melt. Great example here.
Example:
In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [43]: cheese.melt(id_vars=['first', 'last'])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: cheese.melt(id_vars=['first', 'last'], var_name='quantity')
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In your case, you need to be operating on a dataframe (i.e. con or finalframe or wherever your date column is). For example:
con.melt(id_vars=date_columns)
See specific example here.

How to create a column from another df according to matching columns?

I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common
You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.

How to modify the Pandas DataFrame and insert new columns

I have some data with information provide below,
df.info() is below,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value 6662 non-null float64
country 6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None
list of the columns,
[u'value' 'country']
the df is below,
value country
0 550.00 USA
1 118.65 CHINA
2 120.82 CHINA
3 86.82 CHINA
4 112.14 CHINA
5 113.59 CHINA
6 114.31 CHINA
7 111.42 CHINA
8 117.21 CHINA
9 111.42 CHINA
--------------------
--------------------
6655 500.00 USA
6656 500.00 USA
6657 390.00 USA
6658 450.00 USA
6659 420.00 USA
6660 420.00 USA
6661 450.00 USA
I need to add another column namely outlier and put 1
if the data is an outliers for that respective country,
otherwise, I need to put 0. I emphasize that the outlier will need to be computed for the respective countries and NOT for the countries altogether.
I find some formulas for calculating the outliers which may be in help, for example,
# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
return df
def exclude_the_outliers_extra(df):
LOWER_LIMIT = .35
HIGHER_LIMIT = .70
filt_df = df.loc[:, df.columns == 'value']
# Then, computing percentiles.
quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])
# Next filtering values based on computed percentiles. To do that I use
# an apply by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
(x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)
filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1)
filt_df.dropna(inplace=True)
return df
I was not able to use those formulas properly for this purpose, but, provided as suggestion.
Finally, I will need to count the percentage of the outliers for the
USA and CHINA presented in the data.
How to achieve that?
Note: putting the outlier column with all zeros is easy in the
pasdas and should be like this,
df['outlier'] = 0
However, it's still the issue to find the outlier and overwrite the
zeros with 1 for that respective country.
You can slice the dataframe by each country, calculate the quantiles for the slice, and set the value of outlier at the index of the country.
There might be a way to do it without iteration, but it is beyond me.
# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False
# set the quantile limits
low_q = 0.35
high_q = 0.7
# iterate over each country
for c in df.country.unique():
# subset the dataframe where the country = c, get the quantiles
q = df.value[df.country==c].quantile([low_q, high_q])
# at the row index where the country column equals `c` and the column is `outlier`
# set the value to be true or false based on if the `value` column is within
# the quantiles
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
.apply(lambda x: x<q[low_q] or x>q[high_q]))
Edit: To get the percentage of outliers per country, you can groupby the country column and aggregate using the mean.
gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))
# output:
# Percentage of outliers for China : 54.0%
# Percentage of outliers for USA : 56.0%

From tuples to multiple columns in pandas

How do I convert this dataframe
location value
0 (Richmond, Virginia, nan, USA) 100
1 (New York City, New York, nan, USA) 200
to this:
city state region country value
0 Richmond Virginia nan USA 100
1 New York City New York nan USA 200
Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.
new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
df[col] = df['location'].apply(lambda location: location[n])
df = df.drop('location',axis=1)
If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.
addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))
value city state region country
0 100 Richmond Virginia NaN USA
1 200 New York City New York NaN USA
I haven't timed this, but I would suggest this option:
df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])
I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)
I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.
df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
("New York City", "New York", pd.NA, "USA")],
"value": [100,200]})
loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)
from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:
loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
columns=['city','state','regions','country'],
index=df.index)

Categories