Add rows in Pandas based on condition (grouping) - python

I've googled quite some bit regarding this and could not find an answer that applied to my problem. The issue Im having is that I've got a dataframe and each row has a variable and I want to continuously insert rows with variable C which is the value of variable A + B. Example:
TOWN YEAR Var Value
Amsterdam 2019 A 1
Amsterdam 2019 B 2
Amsterdam 2020 A 1
Amsterdam 2020 B 3
Rotterdam 2019 A 4
Rotterdam 2019 B 4
Rotterdam 2020 A 5
Rotterdam 2020 B 2
Where the desired output would insert a row and sum A and B respectively for rows that are identical in the other columns. My attempt right now backfired as I used groupby and sum, then converted it into a list and then just tried to append it as a seperate column (var_C). The reason it backfired is because I had to duplicate each value to match the length of the original dataset. In the end the length of the list did not match the length of the original dataset.
data_current = data[data['var'].isin(['A', 'B'])]
data_var_c = data_current.groupby(['TOWN', 'year'])['value'].sum()
values = data_var_c.tolist()
values_dup = [val for val in values for _ in (0, 1)]
len(values_dup)
Any feedback would be appreciated!

You can use groupby and pd.concat:
result = (
pd.concat([
df,
df.groupby(['TOWN', 'YEAR'], as_index=False)
.agg(sum)
.assign(Var = 'C')
])
)
result = result.sort_values(['TOWN', 'YEAR', 'Var'])
OUTPUT:
TOWN YEAR Var Value
0 Amsterdam 2019 A 1
1 Amsterdam 2019 B 2
0 Amsterdam 2019 C 3
2 Amsterdam 2020 A 1
3 Amsterdam 2020 B 3
1 Amsterdam 2020 C 4
4 Rotterdam 2019 A 4
5 Rotterdam 2019 B 4
2 Rotterdam 2019 C 8
6 Rotterdam 2020 A 5
7 Rotterdam 2020 B 2
3 Rotterdam 2020 C 7

A pivot stack option:
import pandas as pd
df = pd.DataFrame({
'TOWN': {0: 'Amsterdam', 1: 'Amsterdam', 2: 'Amsterdam', 3: 'Amsterdam',
4: 'Rotterdam', 5: 'Rotterdam', 6: 'Rotterdam', 7: 'Rotterdam'},
'YEAR': {0: 2019, 1: 2019, 2: 2020, 3: 2020, 4: 2019, 5: 2019, 6: 2020,
7: 2020},
'Var': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B'},
'Value': {0: 1, 1: 2, 2: 1, 3: 3, 4: 4, 5: 4, 6: 5, 7: 2}
})
new_df = df.pivot(index=['TOWN', 'YEAR'], columns='Var')['Value'] \
.assign(C=lambda x: x.agg('sum', axis=1)) \
.stack() \
.rename('Value') \
.reset_index()
print(new_df)
new_df:
TOWN YEAR Var Value
0 Amsterdam 2019 A 1
1 Amsterdam 2019 B 2
2 Amsterdam 2019 C 3
3 Amsterdam 2020 A 1
4 Amsterdam 2020 B 3
5 Amsterdam 2020 C 4
6 Rotterdam 2019 A 4
7 Rotterdam 2019 B 4
8 Rotterdam 2019 C 8
9 Rotterdam 2020 A 5
10 Rotterdam 2020 B 2
11 Rotterdam 2020 C 7

I overcomplicated it, it is as simple as grouping by TOWN and Year, taking the value column and apply a sum function to get the overall sum:
data['c'] = data_current.groupby(['TOWN', 'year'])['value'].transform('sum')
This, however, is not the desired output as it adds the summation as another column. Whereas, Nk03's answer adds the summation as a seperate row.

Related

How to do cumulative division in groupby level [0,1] in pandas dataframe based on conditions?

I have a dataframe where I want append row add some groupby + additional conditions. Looking for for loop or other solution whatever can work.
or if its easier...
first melt df and then add new ratio % col then unmelt.
As calculations are customise, I think for loop can find the solution with or without groupby.
---Line 6,7,8 are my requirement.---
0-14 = child and unemployed
14-50 = young and working
50+ = old and unemployed
# ref line 6,7,8 = showing which rows to (+) and (/)
Currently I want to put 3 conditions in output line 6,7,8:
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50+)','(0-14)','(14-50)','(50+)'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50+) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50+) US 2000
output required:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2019 young vs child UK-young vs child 2.0 # 20/10
7 2019 old vs young UK-old vs young 15.0 #300/20
8 2019 unemployed vs working UK-unemployed vs working. 15.5 #300+10 20
Trials now:
df2 = df.copy()
criteria = [df2['con'].str.contains('0-14'),
df2['con'].str.contains('14-50'),
df2['con'].str.contains('50+')]
#conditions should be according to requirements
values = ['young vs child','old vs young', 'unemployed vs working']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['age group'] = df2['age group']+'_'+np.select(criteria, values, 0)
df.groupby(['year','age group','con']).sum().groupby(level=[0,1]).cumdiv()
pd.concat([df,df2])
#----errors. cumdiv() not found and missing conditions criteria-------
also tried:
df['population'].div(df.groupby('con')['population'].shift(1))
#but looking for customisations into this
#so it can first sum rows and then divide
#according to unemployed condition-- row 8 reference.
CLOSEST TRAIL
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in con_list:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() + bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
also
PLEASE find the easiest way to solve it... Please...

How can I compare two dataframes row by row in pandas with insensitive cases?

Here is my data:
d = {'ID': [14, 14, 14, 14, 14, 14, 15, 15, 14], 'NAME': ['KWI', 'NED', 'RICK', 'NICH', 'DIONIC', 'RICHARD', 'ROCKY', 'CARLOS', 'SIDARTH'], 'ID_COUNTRY':[1, 2, 3,4,5,6,7,8,9], 'COUNTRY':['MEXICO', 'ITALY', 'CANADA', 'ENGLAND', 'GERMANY', 'UNITED STATES', 'JAPAN', 'SPAIN', 'BRAZIL'], 'ID_CITY':[10, 20, 21, 31, 18, 27, 36, 86, 28], 'CITY':['MX', 'IT', 'CA', 'ENG', 'GE', 'US', 'JP', 'SP', 'BZ'], 'STATUS': ['OK', 'OK', 'OK', 'OK', 'OK', 'NOT', 'OK', 'NOT', 'OK']}
df = pd.DataFrame(data=d)
df:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 KWI 1 MEXICO 10 MX OK
1 14 NED 2 ITALY 20 IT OK
2 14 RICK 3 CANADA 21 CA OK
3 14 NICH 4 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 GE OK
5 14 RICHARD 6 UNITED STATES 27 US NOT
6 14 ROCKY 7 JAPAN 36 JP OK
7 15 CARLOS 8 SPAIN 86 SP NOT
8 15 SIDHART 9 BRAZIL 28 BZ OK
The df is the base data. The data that I need to compare with df is df1:
d1 = {'ID': [14, 10, 14, 11, 14], 'NAME': ['Kwi', 'NED', 'riCK', 'nich', 'DIONIC'], 'ID_COUNTRY':[1, 2, 3, 6, 5], 'COUNTRY':['MXICO', 'itaLY', 'CANADA', 'ENGLAND', 'GERMANY'], 'ID_CITY':[10, 22, 21, 31, 18], 'CITY':['MX', 'AT', 'CA', 'ENG', 'EG'], 'STATUS': ['', 'OK', '', 'OK', '']}
df1 = pd.DataFrame(data=d1)
df1:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 MXICO 10 MX
1 10 NED 2 itaLY 22 AT OK
2 14 riCK 3 CANADA 21 CA
3 11 nich 6 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 EG
Desired output 1 (The values that not match must appear highlighted):
The data in df1 that not match with df is:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 *MXICO* 10 MX **
1 *10* NED 2 itaLY *22* AT OK
2 14 riCK 3 CANADA 21 CA **
3 *11* nich 6 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 *EG* **
*TWO ROWS ARE MISSING*
Note: In this output is necessary that the comparisons row by row will be insensitive to strings as itaLY, Kwi, riCK, nich that values are ok because are the same.
Desired output 2:
The data in df1 that not match with df is in :
COUNTRY, STATUS with ID 14, NAME Kwi, ID_COUNTRY 1.
ID, ID_CITY, CITY with ID 10, NAME NED, ID_COUNTRY 2.
STATUS with ID 14, NAME riCK, ID_COUNTRY 3.
ID, ID_COUNTRY with ID 11, NAME nich, ID_COUNTRY 6.
CITY, STATUS with ID 14, NAME DIONIC, ID_COUNTRY 5.
TWO ROWS ARE MISSING.
The result it just need to be a comparison of the data that match the length of df1, but also there is the possibility that rows mismatch following the ID from df as I show here (14) the values 15 in ID are no considered. I think the second output is more specific and efficient and first one it will be slow to visualize if there are many data to compare.
I hope everyone understand what is the point of this issue and found and answer. I have been struggling with this some time and did not get the solution I want, that's why I came here with you guys. Thanks for read and hope contribute to this platform.
When one wants a case insensitive comparison between strings in python, one would like to set both strings to upper or lower case and then do a traditional == or != comparison.
When using pandas, this can be achieved by the .str Series method, which allows the use of string methods such as .upper() and .lower(). In your case, a possible solution would be:
df, df1 = df.astype(str), df1.astype(str)
_df = df1.copy()
for i in df1.index:
comparison = df.loc[i].str.upper() != df1.loc[i].str.upper()
_df.loc[i, comparison] = '*' + df1.loc[i, comparison].astype(str) + '*'
If we print the resulting dataframe _df we get your desired output 1:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 *MXICO* 10 MX **
1 *10* NED 2 itaLY *22* *AT* OK
2 14 riCK 3 CANADA 21 CA **
3 *11* nich *6* ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 *EG* **
In this case I'm assuming that corresponding rows have the same index across both dataframes.
For your second desired output, you can just iterate over each row again:
print("Data in df1 that does't match df:")
for i in _df.index:
not_matching_cols = _df.loc[i].str.endswith('*')
if not_matching_cols.any():
print(','.join(_df.loc[i, not_matching_cols].index), end=' ')
print('with', 'NAME', df1.loc[i, 'NAME'], 'ID_COUNTRY', df1.loc[i, 'ID_COUNTRY'])
If you also want to print the numbers of rows missing on df1 you can just add
print(df.shape[0] - df1.shape[0], 'ROWS ARE MISSING')
The output of this last part should be:
Data in df1 that does't match df:
COUNTRY,STATUS with NAME Kwi ID_COUNTRY 1
ID,ID_CITY,CITY with NAME NED ID_COUNTRY 2
STATUS with NAME riCK ID_COUNTRY 3
ID,ID_COUNTRY with NAME nich ID_COUNTRY 6
CITY,STATUS with NAME DIONIC ID_COUNTRY 5
4 ROWS ARE MISSING
I am not sure what code you have been using to compare it row by row or what your conditions are, but one thing you can try is converting all the string rows to lowercase strings...
df.update(df.select_dtypes('object').applymap(str.lower))
# 'object' is used to refer to strings in pandas dtypes
Or if you want to preserve the original columns, you could try making new, temporary columns...
df['name_lower'] = df['name'].apply(str.lower)
df1['name_lower'] = df1['name'].apply(str.lower)

Missing first row when construction a Series from a DataFrame

I have a dictionary I call 'test_dict'
test_dict = {'OBJECTID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Country': {0: 'Vietnam',
1: 'Vietnam',
2: 'Vietnam',
3: 'Vietnam',
4: 'Vietnam'},
'Location': {0: 'Nha Trang',
1: 'Hue',
2: 'Phu Quoc',
3: 'Chu Lai',
4: 'Lao Bao'},
'Lat': {0: 12.250000000000057,
1: 16.401000000000067,
2: 10.227000000000032,
3: 15.406000000000063,
4: 16.627300000000048},
'Long': {0: 109.18333300000006,
1: 107.70300000000009,
2: 103.96700000000004,
3: 108.70600000000007,
4: 106.59970000000004}}
That I convert to a DataFrame
test_df = pd.DataFrame(test_dict)
and I get this:
OBJECTID Country Location Lat Long
0 1 Vietnam Nha Trang 12.2500 109.183333
1 2 Vietnam Hue 16.4010 107.703000
2 3 Vietnam Phu Quoc 10.2270 103.967000
3 4 Vietnam Chu Lai 15.4060 108.706000
4 5 Vietnam Lao Bao 16.6273 106.599700
I want to construct a series with location names and I would like the column "ObjectID" to be the index. When I try it, I lose the first row.
pd.Series(test_df.Location, index=test_df.OBJECTID)
I get this:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object
What I was hoping to get was this:
OBJECTID
1 Nha Trang
2 Hue
3 Phu Quoc
4 Chu Lai
5 Lao Bao
What am I doing wrong here? Why is the process of converting into a Series losing the first row?
You can fix your code via
pd.Series(test_df.Location.values, index=test_df.OBJECTID)
because the problem is that test_df.Location has an index itself that starts at 0.
Edit - my preferred alternative:
test_df.set_index('OBJECTID')['Location']
You can use:
pd.Series(test_df.Location).reindex(test_df.OBJECTID)
Result:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object

I have country, start and end year for all baseball players. I need to know how many players per country played each year

I have a dataset with 20,000 players. Columns are birthCountry, debut_year and final_year.
birthCountry debut_year final_year
0 USA 2004 2015
1 USA 1954 1976
2 USA 1962 1971
3 USA 1977 1990
4 USA 2001 2006
I need to get a table as follows:
1980 1981 1982
USA 50 49 48
CANADA XX XX XX
MEXICO XX XX XX
...
Where each cell represents the number of players that were born in a particular country, that played during that year.
I created a nested list, containing all years that each player played. The length of this list is the same as the length of the df. In the df, I created one additional column per year and I tried to add 1 for each player/year combination.
The idea was to use this to create a groupby or pivot_table
# create a list of years
years = list(range(min(df['debut_year'].values),max(df['final_year'].values)+1))
# create a list of countries
countries = df.birthCountry.unique()
# add columns for years
for n in range(1841,2019): #years are from 1841 to 2018
df[n] = ''
# now I have one additional column for every year. A lot of new empty columns
# temporary lists
templist = list(range(0,len(df)))
# every element of the following list contains all the years each player played
templist2 = []
for i in templist:
templist2.append(list(range(int(df.iloc[i,1]),int(df.iloc[i,2]))))
# add 1 if the player played that year
for i in range(len(df)):
for j in templist2[i]:
df.iloc[i][j] = 1
I run for some time and then nothing changed in the original dataframe.
Probably you can find a better more elegant solution.
To limit the size of the example, I created the following source DataFrame:
df = pd.DataFrame(data=[[ 1, 'USA', 1974, 1978 ], [ 2, 'USA', 1976, 1981 ],
[ 3, 'USA', 1975, 1979 ], [ 4, 'USA', 1977, 1980 ],
[ 5, 'Mex', 1976, 1979 ], [ 6, 'Mex', 1978, 1980 ]],
columns=['Id', 'birthCountry', 'debut_year', 'final_year'])
The fists step of actual computation is to create a Series containing
years in which each player was active:
years = df.apply(lambda row: pd.Series(range(row.debut_year,
row.final_year + 1)), axis=1).stack().astype(int).rename('year')
The second step is to create an auxiliary DataFrame - a join of
df.birthCountry and years:
df2 = df[['birthCountry']].join(years.reset_index(level=1, drop=True))
And the last step is to produce the actual result:
df2.groupby(['birthCountry', 'year']).size().rename('Count')\
.unstack().fillna(0, downcast='infer')
For the above test data, the result is:
year 1974 1975 1976 1977 1978 1979 1980 1981
birthCountry
Mex 0 0 1 1 2 2 1 0
USA 1 2 3 4 4 3 2 1
I think, my solution is more "pandasonic" than the other, proposed earlier
by Remy.
I was able to come up with the following solution if I understand the structure of your df variable correctly. I made a dictionary list (using a smaller range of years) with the same structure for my example:
df = [{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2016},
{'birthCountry': 'CANADA', 'debut_year': 2010, 'final_year': 2016},
{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'CANADA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'MEXICO', 'debut_year': 2012, 'final_year': 2016}]
countries = {}
for field in df:
if field['birthCountry'] not in countries.keys():
countries[field['birthCountry']] = {year: 0 for year in range(2010, 2019)}
for year in range(field['debut_year'], field['final_year']):
countries[field['birthCountry']][year] += 1

iterate through df column and return value in dataframe based on row index, column reference

My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns: #compares year to column names
idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
gdp_ed_list.append(gdp_ed) #append to a list
Currently, my code is getting stuck at the index.list() section. It is returning the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
2 for value in df_combined_column_named['year']: #loops through each year in year column
3 if value in df_combined_column_named.columns: #compares year to column names
----> 4 idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
5 gdp_ed = df_combined_column_named.get_value(idx, value)
6 gdp_ed_list.append(gdp_ed)
KeyError: u'2004'
Any thoughts?
It looks like you are trying to match the value in the year column to column labels and then extract the value in the corresponding cells. You could do that by looping through the rows (see below) but I think it would be not be the fastest way.
Instead, you could use pd.melt to coalesce the columns with year-like labels into a single column, say, year_col:
In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')
In [39]: melted
Out[39]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col value
0 Afghanistan 2004 11 5 8 1999 NaN
1 Albania 2004 11 11 11 1999 3.414761e+09
2 Algeria 2005 13 13 13 1999 4.864061e+10
3 Andorra 2008 11 12 11 1999 1.239840e+09
4 Anguilla 2008 11 11 11 1999 NaN
5 Afghanistan 2004 11 5 8 2000 NaN
...
The benefit of "melting" the DataFrame in this way is that
now you would have both year and year_col columns. The values you are looking for are in the rows where year equals year_col. And that is easy to obtain by using .loc:
In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col \
25 Afghanistan 2004 11 5 8 2004
26 Albania 2004 11 11 11 2004
32 Algeria 2005 13 13 13 2005
48 Andorra 2008 11 12 11 2008
49 Anguilla 2008 11 11 11 2008
value
25 5.285462e+09
26 7.314865e+09
32 1.030000e+11
48 4.001349e+09
49 NaN
Thus, you could use
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
'2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
'2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
'2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
'2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
'2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
'2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
'2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
'2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
'2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
'2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
'2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
'men_ed_yrs': [11, 11, 13, 11, 11],
'total_ed_yrs': [8, 11, 13, 11, 11],
'women_ed_yrs': [5, 11, 13, 12, 11],
'year': ['2004', '2004', '2005', '2008', '2008']})
melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs',
'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)
Why was a KeyError raised:
The KeyError is being raised by df_combined_column_named['year'][value]. Suppose value is '2004'. Then df_combined_column_named['year'] is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). df_combined_column_named['year'][value] fails because it attempts to index this Series with the string '2004' which is not in the integer index.
Alternatively, here is another way to achieve the goal by looping through the rows using iterrows. This is perhaps simpler to understand, but in general using iterrows is slow compared to other column-based Pandas-centric methods:
data = []
for idx, row in df.iterrows():
data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)
prints
country year value
0 Afghanistan 2004 5.285462e+09
1 Albania 2004 7.314865e+09
2 Algeria 2005 1.030000e+11
3 Andorra 2008 4.001349e+09
4 Anguilla 2008 NaN

Categories