Dataframes' subtraction and assignment gives back NAs - python

Let's suppose that I have a dataset (df_data) such as the following:
Time Geography Population
2016 England and Wales 58381200
2017 England and Wales 58744600
2016 Northern Ireland 1862100
2017 Northern Ireland 1870800
2016 Scotland 5404700
2017 Scotland 5424800
2016 Wales 3113200
2017 Wales 3125200
If I do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
then the df_england has NA values at the column Population.
How can I fix this?
By the way, I have read relevant posts but exactly worked for me (.loc, .copy etc).

This is really an organization problem. If you pivot then you can do the subtractions easily, and ensure alignment on Time
df_pop = df.pivot(index='Time', columns='Geography', values='Population')
df_pop['England'] = df_pop['England and Wales'] - df_pop['Wales']
Output df_pop:
Geography England and Wales Northern Ireland Scotland Wales England
Time
2016 58381200 1862100 5404700 3113200 55268000
2017 58744600 1870800 5424800 3125200 55619400
If you need to get back to your original format, then you can do:
df_pop.stack().to_frame('Population').reset_index()
# Time Geography Population
#0 2016 England and Wales 58381200
#1 2016 Northern Ireland 1862100
#2 2016 Scotland 5404700
#3 2016 Wales 3113200
#4 2016 England 55268000
#5 2017 England and Wales 58744600
#6 2017 Northern Ireland 1870800
#7 2017 Scotland 5424800
#8 2017 Wales 3125200
#9 2017 England 55619400

I had simply to do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland'].reset_index(drop=True)
df_wales = df_data[df_data['Geography']=='Wales'].reset_index(drop=True)
df_scotland = df_data[df_data['Geography']=='Scotland'].reset_index(drop=True)
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales'].reset_index(drop=True)
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
or better way in principle since you are retaining the indices of the initial dataframe is the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population'].values

Related

How to pull value from a column when several columns match in two data frames?

I am trying to write a script which will search a database similar to that in Table 1 based on a product/region/year specification outlined in table 2. The plan is to search for a match in Table 1 to a specification outlined in Table 2 and then pull the observation value, as seen in Table 2 - with results.
I need this code to run several loops, where the year criteria is relaxed. For example, loop 1 would search for a match in Product_L1, Geography_L1 and Year and loop 2 would search for a match in Product_L1, Geography_L1 and Year-1 and so on.
Table 1
Product level 1
Product level 2
Region level 1
Region level 2
Year
Obs. value
Portland cement
Cement
Peru
South America
2021
1
Portland cement
Cement
Switzerland
Europe
2021
2
Portland cement
Cement
USA
North America
2021
3
Portland cement
Cement
Brazil
South America
2021
4
Portland cement
Cement
South Africa
Africa
2021
5
Portland cement
Cement
India
Asia
2021
6
Portland cement
Cement
Brazil
South America
2020
7
Table 2
Product level 1
Product level 2
Region level 1
Region level 2
Year
Portland cement
Cement
Brazil
South America
2021
Portland cement
Cement
Switzerland
Europe
2021
Table 2 - with results
Product level 1
Product level 2
Region level 1
Region level 2
Year
Loop 1
Loop 2
x
Portland cement
Cement
Brazil
South America
2021
4
7
I have tried using the following code, but it comes up with the error 'Can only compare identically-labeled Series objects'. Does anyone have any suggestions on how to prevent this error?
Table_2['Loop_1'] = np.where((Table_1.Product_L1 == Table_2.Product_L1)
& (Table_1.Geography_L1 == Table_2.Geography_L1)
& (Table_1.Year == Table_2.Year),
Table_1(['obs_value'], ''))
You can perform a merge operation and provide a list of columns that you want from Table_1.
import pandas as pd
Table_1 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement", "Cement", "Cement", "Cement", "Cement", "Cement"],
"Geography_L1": ["Peru", "Switzerland", "USA", "Brazil", "South Africa", "India", "Brazil"],
"Geography_L2": ["South America", "Europe", "North America", "South America", "Africa", "Asia", "South America"],
"Year": [2021, 2021, 2021, 2021, 2021, 2021, 2020],
"obs_value": [1, 2, 3, 4, 5, 6, 7]
})
Table_2 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement"],
"Geography_L1": ["Brazil", "Switzerland"],
"Geography_L2": ["South America", "Europe"],
"Year": [2021, 2021]
})
columns_list = ['Product_L1','Product_L2','Geography_L1','Geography_L2','Year','obs_value']
result = pd.merge(Table_2, Table_1[columns_list], how='left')
result is a new dataframe:
Product_L1 Product_L2 Geography_L1 Geography_L2 Year obs_value
0 Portland cement Cement Brazil South America 2021 4
1 Portland cement Cement Switzerland Europe 2021 2
EDIT: Based upon the update to the question, I think what you are trying to do is achievable using set_index and unstack. This will create a new dataframe with the observed values listed in columns 'Year_2020', 'Year_2021' etc.
index_columns = ['Product_L1','Product_L2','Geography_L1','Geography_L2', 'Year']
edit_df = Table_1.set_index(index_columns)['obs_value'].unstack().add_prefix('Year_').reset_index()

Adding column of values to pandas DataFrame

I'm doing a simple sentiment analysis and am stuck on something that I feel is very simple. I'm trying to add an new column with a set of values, in this example compound values. But after the for loop iterates it adds the same value for all the rows rather than a value for each iteration. The compound values are the last column in the DataFrame. There should be a quick fix. thanks!
for i, row in real.iterrows():
real['compound'] = sid.polarity_scores(real['title'][i])['compound']
title text subject date compound
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017 0.2263
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017 0.2263
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017 0.2263
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017 0.2263
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017 0.2263
IIUC:
real['compound'] = real.apply(lambda row: sid.polarity_scores(row['title'])['compound'], axis=1)

how to find the full name of athlete in this case?

Let's say this is my data frame:
country Edition sports Athletes Medal Firstname Score
Germany 1990 Aquatics HAJOS, Alfred gold Alfred 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto 2
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios 2
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis 1
US 2008 Athletics HAJOS, Alfred silver Alfred 2
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 golf HAJOS, Alfred Bronze Alfred 1
France 2011 golf ANDREOU, Joannis silver Joannis 2
Spain 2011 golf BURKE, Thomas gold Thomas 3
I am trying to find out which Athlete's first name has the largest sum of scores?
I have tried the following:
df.groupby ( 'Firstname' )[Score ].sum().idxmax()
This returns the first name of the Athlete but I want to display the full name of Athlete can anyone help me in this?
for example : I am getting 'Otto' as output but i want to display HIRSCHMANN, Otto as output!
Note: what I have noticed in my original data set when I groupby ( 'Athlete') the answer is different.
idxmax will only give you the index of the first row with maximal value. If multiple Firstname share the max score, it will find to find them.
Try this instead:
sum_score = df.groupby ('Firstname')['Score'].sum()
max_score = sum_score.max()
names = sum_score[sum_score == max_score].index
df[df['Firstname'].isin(names)]

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

using numpy to calculate mean

I am trying to calculate the mean of GNP for each country from 2006 to 2015. But when I apply the aggregation with mean function, it will not calculate the mean from 2006 to 2015. Instead, it just display the values for each year. Pls tell me what went wrong? I am able to sort by country but the mean just wont work on the data.
wb_indicator = 'NY.GNP.ATLS.CD'
start_year = 2006
end_year = 2015
df_ex = wb.download(indicator = wb_indicator,
country = ['all'],
start = start_year,
end = end_year)
df_ex1 = df_ex.reset_index()
df_ex1.groupby(['country']).agg({'NY.GNP.ATLS.CD': [np.mean]})
df_ex1.head(20)
Output:
country year NY.GNP.ATLS.CD 0 Arab World 2015 2.767920e+12 1 Arab
World 2014 2.897113e+12 2 Arab World 2013 2.832769e+12 3 Arab
World 2012 2.590610e+12 4 Arab World 2011 2.190786e+12 5 Arab
World 2010 2.055967e+12 6 Arab World 2009 1.932056e+12 7 Arab
World 2008 1.858270e+12 8 Arab World 2007 1.547924e+12 9 Arab
World 2006 1.312967e+12 10 Caribbean small states 2015 6.680302e+10
11 Caribbean small states 2014 6.664219e+10
This should work
import pandas as pd
import wbdata as wb
import datetime
wb_indicator = 'NY.GNP.ATLS.CD'
data_date = (datetime.datetime(2006, 1, 1), datetime.datetime(2015, 1, 1))
data = wb.get_data(wb_indicator, data_date=data_date, pandas=True)
gnp_means = data.reset_index().groupby('country').mean()

Categories