Prevent negative values in df.interpolate() - python

I'm having troubles with avoiding negative values in interpolation. I have the following data in a DataFrame:
current_country =
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
289 South Sudan Sub-Saharan Africa 143 3.83200 0.393940 0.185190 0.157810 0.196620 0.130150 0.258990 2.509300 2016
449 South Sudan Sub-Saharan Africa 147 3.59100 0.397249 0.601323 0.163486 0.147062 0.116794 0.285671 1.879416 2017
610 South Sudan Sub-Saharan Africa 154 3.25400 0.337000 0.608000 0.177000 0.112000 0.106000 0.224000 1.690000 2018
765 South Sudan Sub-Saharan Africa 156 2.85300 0.306000 0.575000 0.295000 0.010000 0.091000 0.202000 1.374000 2019
And I want to interpolate the following year (2019) - shown below - using pandas' df.interpolate()
new_row =
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
593 South Sudan Sub-Saharan Africa 0 np.nan np.nan np.nan np.nan np.nan np.nan np.nan np.nan 2015
I create the df containing null values in all columns to be interpolated (as above) and append that one to the original dataframe before I interpolate to populate the cells with NaNs.
interpol_subset = current_country.append(new_row)
interpol_subset = interpol_subset.interpolate(method = "pchip", order = 2)
This produces the following df
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
289 South Sudan Sub-Saharan Africa 143 3.83200 0.393940 0.185190 0.157810 0.196620 0.130150 0.258990 2.509300 2016
449 South Sudan Sub-Saharan Africa 147 3.59100 0.397249 0.601323 0.163486 0.147062 0.116794 0.285671 1.879416 2017
610 South Sudan Sub-Saharan Africa 154 3.25400 0.337000 0.608000 0.177000 0.112000 0.106000 0.224000 1.690000 2018
765 South Sudan Sub-Saharan Africa 156 2.85300 0.306000 0.575000 0.295000 0.010000 0.091000 0.202000 1.374000 2019
4 South Sudan Sub-Saharan Africa 0 2.39355 0.313624 0.528646 0.434473 -0.126247 0.072480 0.238480 0.963119 2015
The issue: In the last row, the value in "Freedom" is negative. Is there a way to parameterize the df.interpolate function such that it doesn't produce negative values? I can't find anything in the documentation. I'm fine with the estimates besides that negative value (Although they're a bit skewed)
I considered simply flipping the negative to a positive, but the "Score" value is a sum of all the other continuous features and I would like to keep it that way. What can I do here?
Here's a link to the actual code snippet. Thanks for reading.

I doubt this is an issue for interpolation. The main reason is the method you were using. 'pchip' will return a negative value for the 'freedom' anyway. If we take the values from your dataframe:
import numpy as np
import scipy.interpolate
y = np.array([0.196620, 0.147062, 0.112000, 0.010000])
x = np.array([0, 1, 2, 3])
pchip_obj = scipy.interpolate.PchipInterpolator(x, y)
print(pchip_obj(4))
The result is -0.126. I think if you want a positive result you should better change the method you are using.

Related

Create multiple new pandas column based on other columns in a loop

Assuming I have the following toy dataframe, df:
Country Population Region HDI
China 100 Asia High
Canada 15 NAmerica V.High
Mexico 25 NAmerica Medium
Ethiopia 30 Africa Low
I would like to create new columns based on the population, region, and HDI of Ethiopia in a loop. I tried the following method, but it is time-consuming when a lot of columns are involved.
df['Population_2'] = df['Population'][df['Country'] == "Ethiopia"]
df['Region_2'] = df['Region'][df['Country'] == "Ethiopia"]
df['Population_2'].fillna(method='ffill')
My final DataFrame df should look like:
Country Population Region HDI Population_2 Region_2 HDI_2
China 100 Asia High 30 Africa Low
Canada 15 NAmerica V.High 30 Africa Low
Mexico 25 NAmerica Medium 30 Africa Low
Ethiopia 30 Africa Low 30 Africa Low
How about this?
for col in ['Population', 'Region', 'HDI']:
df[col + '_2'] = df.loc[df.Country=='Ethiopia', col].iat[0]
I don't quite understand the broader point of what you're trying to do, and if Ethiopia could have multiple values the solution might be different. But this works for the problem as you presented it.
You can use:
# select Ethiopia row and add suffix "_2" to the columns (except Country)
s = (df.drop(columns='Country')
.loc[df['Country'].eq('Ethiopia')].add_suffix('_2').squeeze()
)
# broadcast as new columns
df[s.index] = s
output:
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low
You can use assign and also assuming that you have only row corresponding to Ethiopia:
d = dict(zip(df.columns.drop('Country').map('{}_2'.format),
df.set_index('Country').loc['Ethiopia']))
df = df.assign(**d)
print(df):
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low

python graph not showing up

So I tried graphing a data frame using pandas and when I typed it out there is a blank image that shows up with no errors or anything. I was hoping someone knows what the problem could be and how I can solve it.
I was wondering if this is a backend issue or what. Thank you!
For faster answers, we need the code in text format and sample data for reproduction. I have tried to apply the sample from the official reference to your code. The reason why the graph doesn't show up is a guess, since I don't have any code or data, but I think the country name is not retrieved from the dictionary. I extracted the top 10 countries from the sample data by population, and drew a graph based on the data extracted from the original data frame for those country names. The data used as the basis for the looping process is a dictionary of country names and arbitrary colors.
import plotly.express as px
from plotly.subplots import make_subplots
df1 = px.data.gapminder().query('year==2007').sort_values('pop', ascending=False).head(10)
df1
country
continent
year
lifeExp
pop
gdpPercap
iso_alpha
iso_num
299
China
Asia
2007
72.961
1318683096
4959.11
CHN
156
707
India
Asia
2007
64.698
1110396331
2452.21
IND
356
1619
United States
Americas
2007
78.242
301139947
42951.7
USA
840
719
Indonesia
Asia
2007
70.65
223547000
3540.65
IDN
360
179
Brazil
Americas
2007
72.39
190010647
9065.8
BRA
76
1175
Pakistan
Asia
2007
65.483
169270617
2605.95
PAK
586
107
Bangladesh
Asia
2007
64.062
150448339
1391.25
BGD
50
1139
Nigeria
Africa
2007
46.859
135031164
2013.98
NGA
566
803
Japan
Asia
2007
82.603
127467972
31656.1
JPN
392
995
Mexico
Americas
2007
76.195
108700891
11977.6
MEX
484
# create dict country and color
colors = px.colors.sequential.Plasma
color = {k:v for k,v in zip(df1.country,colors)}
{'China': '#0d0887',
'India': '#46039f',
'United States': '#7201a8',
'Indonesia': '#9c179e',
'Brazil': '#bd3786',
'Pakistan': '#d8576b',
'Bangladesh': '#ed7953',
'Nigeria': '#fb9f3a',
'Japan': '#fdca26',
'Mexico': '#f0f921'}
# top10 data
df1_top10 = px.data.gapminder().query('country in #df1.country')
import plotly.graph_objects as go
fig = go.Figure()
colors = px.colors.sequential.Plasma
for k,v in color.items():
fig.add_trace(go.Scatter(
x=df1_top10[df1_top10['country']==k]['year'],
y=df1_top10[df1_top10['country']==k]['lifeExp'],
name=k,
mode='markers+text+lines',
marker_color='black',
marker_size=3,
line=dict(color=color[k]),
yaxis='y1'))
fig.update_layout(
title="Top 10 Country wise Life Ladder trend",
xaxis_title="Year",
yaxis_title="Life Ladder",
template='ggplot2',
font=dict( size=16,
color="Black",
family="Garamond"
),
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True)
)
fig.show()

Find largest 2 values for each year in the returned pandas groupby object after sorting each group

My dataframe has 3 columns: Year. Leading Cause,Deaths. I want to find the total number of deaths by leading cause in each year. I did the following:
totalDeaths_Cause = df.groupby(["Year", "Leading Cause"])["Deaths"].sum()
which results in:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Immunodeficiency 70
Parkinson Disease 180
2012 Cerebrovascular Disease 102
Disease1 183
Diseases of Heart 76
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Self-Harm 17
Name: Deaths, dtype: int64
Now I want to get the largest 2 values(for deaths) each year and the leading Cause such that:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Parkinson Disease 180
2012 Disease1 183
Cerebrovasular disease 102
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Thanks in advance for your help!
Let us do
df=df.sort_values().groupby(level=0).tail(1)

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

Compare two lists - if item from one list is present in another do something

UPDATE
I have two lists of countries:
one from https://www.countries-ofthe-world.com/world-currencies.html and the second one from http://www.nationsonline.org/oneworld/country_code_list.htm. Some country names are different in those list. I need to merge them to get each country name, ISO3 and ISO-4217. To have the full list I need to rename some of the countries. I`m trying to workout a routine to find the country and them rename it by the values found in the second DataFrame.
I have two lists (of countries) it the following format:
Names I need
Country_ISO['Country_or_territory'].tail(10)
Out[57]:
237 Russian Federation
238 Vanuatu
239 Venezuela
240 Viet Nam
241 Virgin Islands, US
242 Wallis and Futuna Islands
243 Western Sahara
244 Yemen
245 Zambia
246 Zimbabwe
Name: Country_or_territory, dtype: object
and names that are different
NotIn.Country_or_territory.tail(10)
Out[61]:
131 Macau
132 Macedonia
148 Pitcairn Islands
153 Svalbard and Jan Mayen
163 Russia
172 South Korea
177 Syria
178 Taiwan
180 Tanzania
193 Vietnam
Name: Country_or_territory, dtype: object
I need to find items in the first list (Country_ISO['Country_or_territory'].tail(10)) that correspond to items in the second list (NotIn.Country_or_territory.tail(10)) and do something with those names (rename them).
I`m trying to do this using nested for loops:
for itemNotIn in NotIn.Country_or_territory.tail(10):
for item in Country_ISO['Country_or_territory'].tail(10):
Tr = itemNotIn[:3] #This here because I need to compare by the first 3 characters)
t = re.sub(Tr+'\w+', '*****NOT_IN*****', item)
print(t)
But when I run it i get a repetition of len(NotIn.Country_or_territory.tail(10)).
And I just can't find a way to make it work.
Ideally I will have a list like:
*****NOT_IN*****
Vanuatu
Venezuela
*****NOT_IN***** Nam
Virgin Islands, US
Wallis and Futuna Islands
Western Sahara
Yemen
Zambia
Zimbabwe
As the commenters have suggested, using sets you can get the latter part of your question (ie the differences between the list) using this code:
list1 = ['Russia','Vanuatu','Venezuela','Viet Nam','Virgin Islands, 'US',
'Wallis and Futuna Islands','Western Sahara','Yemen','Zambia','Zimbabwe']
list2 = ['Macau','Macedonia','Pitcairn Islands','Svalbard and Jan Mayen',
'Russia','South Korea','Syria','Taiwan','Tanzania','Vietnam']
temp_set1 = set(list1).difference(list2)
print("Not in list2", temp_set1)
temp_set2 = set(list2).difference(list1)
print("Note in list1", temp_set2)
Now, the first part of your question actually suggests you wish to find similarities and change the similar items. In that case, you could do
common = list(set(list1).intersection(list2))
In [18]: list(set(list1).intersection(list2))
Out[18]: ['Russia']
def fun(common):
#do something with common
Finally, if you still wish to compare using only the first three characters, then do something like this:
set([x[:3] for x in list1]).difference([x[:3] for x in list2])
In [19]: set([x[:3] for x in list1]).difference([x[:3] for x in list2])
Out[19]: {'US', 'Van', 'Ven', 'Vir', 'Wal', 'Wes', 'Yem', 'Zam', 'Zim'}

Categories