Related
I am using a networkx package from python and I have a dataframe
(Sample dataframe)
from to count
v0 v1 0.1
v0 v2 0.15
v0 v3 0.15
v0 v4 0.25
v0 v5 0.15
and so on..
Sample picture(weighted direct graph)
That is my dataframe.
{'grad': {0: 'CUHK', 1: 'CUHK', 2: 'CUHK', 3: 'CUHK', 4: 'CUHK', 5: 'CityU', 6: 'CityU', 7: 'CityU', 8: 'CityU', 9: 'HKU', 10: 'HKU', 11: 'HKU', 12: 'HKUST', 13: 'HKUST', 14: 'HKUST', 15: 'HKUST', 16: 'HKUST', 17: 'HKUST', 18: 'Low Frequency', 19: 'Low Frequency', 20: 'Low Frequency', 21: 'Low Frequency', 22: 'Low Frequency', 23: 'Low Frequency', 24: 'PolyU', 25: 'PolyU', 26: 'PolyU', 27: 'PolyU'}, 'to': {0: 'CUHK', 1: 'CityU', 2: 'HKU', 3: 'LingU', 4: 'PolyU', 5: 'CityU', 6: 'HKU', 7: 'LingU', 8: 'PolyU', 9: 'CityU', 10: 'HKU', 11: 'PolyU', 12: 'CUHK', 13: 'CityU', 14: 'HKU', 15: 'HKUST', 16: 'LingU', 17: 'PolyU', 18: 'CUHK', 19: 'CityU', 20: 'HKU', 21: 'HKUST', 22: 'LingU', 23: 'PolyU', 24: 'CityU', 25: 'HKU', 26: 'LingU', 27: 'PolyU'}, 'count': {0: 9, 1: 5, 2: 3, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 3, 9: 3, 10: 9, 11: 4, 12: 2, 13: 1, 14: 2, 15: 1, 16: 4, 17: 4, 18: 49, 19: 34, 20: 29, 21: 34, 22: 3, 23: 36, 24: 1, 25: 1, 26: 1, 27: 11}}
The principle of ranking is when Vx -> Vy is bigger than Vy -> Vx, Vx has a higher rank than Vy.
e.g. V0 -> V5 = 0.2 and V5 -> V0 = 0.5 so, V5 have a higher rank
Now I am using the brute force method, which loops and checks all the relationships. When the condition is met, I change their order in a new list. -> {V0,V1,V2,V3,V4,V5,V6,V7}
I want an elegant solution to rank these nodes. Maybe I can get some partial orders like V5>V0 and V0>V1 and use them to form a global order V5>V0>V1, but I don't know how to achieve it. Is there any method better than brute force? Is this related to any famous problem?
One way of doing this would be the following:
import networkx as nx
import pandas as pd
data = {'grad': {0: 'CUHK', 1: 'CUHK', 2: 'CUHK', 3: 'CUHK', 4: 'CUHK', 5: 'CityU', 6: 'CityU', 7: 'CityU', 8: 'CityU', 9: 'HKU', 10: 'HKU', 11: 'HKU', 12: 'HKUST', 13: 'HKUST', 14: 'HKUST', 15: 'HKUST', 16: 'HKUST', 17: 'HKUST', 18: 'Low Frequency', 19: 'Low Frequency', 20: 'Low Frequency', 21: 'Low Frequency', 22: 'Low Frequency', 23: 'Low Frequency', 24: 'PolyU', 25: 'PolyU', 26: 'PolyU', 27: 'PolyU'},
'to': {0: 'CUHK', 1: 'CityU', 2: 'HKU', 3: 'LingU', 4: 'PolyU', 5: 'CityU', 6: 'HKU', 7: 'LingU', 8: 'PolyU', 9: 'CityU', 10: 'HKU', 11: 'PolyU', 12: 'CUHK', 13: 'CityU', 14: 'HKU', 15: 'HKUST', 16: 'LingU', 17: 'PolyU', 18: 'CUHK', 19: 'CityU', 20: 'HKU', 21: 'HKUST', 22: 'LingU', 23: 'PolyU', 24: 'CityU', 25: 'HKU', 26: 'LingU', 27: 'PolyU'},
'count': {0: 9, 1: 5, 2: 3, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 3, 9: 3, 10: 9, 11: 4, 12: 2, 13: 1, 14: 2, 15: 1, 16: 4, 17: 4, 18: 49, 19: 34, 20: 29, 21: 34, 22: 3, 23: 36, 24: 1, 25: 1, 26: 1, 27: 11}}
df = pd.DataFrame(data)
G = nx.from_pandas_edgelist(df, 'grad', 'to', edge_attr='count', create_using=nx.DiGraph())
pagerank = nx.pagerank(G, weight='count')
sorted_pagerank = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)
This returns a list of tuples with the node and its PageRank score, sorted in descending order of the PageRank score.
[('PolyU', 0.4113039270586079),
('HKU', 0.1945013448661985),
('CityU', 0.14888513201115303),
('LingU', 0.09978025157613143),
('CUHK', 0.07069262490080512),
('HKUST', 0.041291981078138223),
('Low Frequency', 0.03354473850896578)]
If you want this with the graph:
import matplotlib.pyplot as plt
import networkx as nx
G = nx.from_pandas_edgelist(df, 'grad', 'to', edge_attr='count', create_using=nx.DiGraph())
pagerank = nx.pagerank(G, weight='count')
sorted_pagerank = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='skyblue', node_size=[v * 100 for v in pagerank.values()])
labels = nx.get_edge_attributes(G,'count')
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
plt.show()
I am trying to create boxplots for 24 hours, each hour already having the maxValue, quartile75, mean, quartile25 and minValue. Those values are stored in a dataframe - I put them into a dict.
{'hour': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23},
'minValue': {0: -491.69,
1: -669.49,
2: -551.22,
3: -514.2,
4: -506.94,
5: -665.7,
6: -484.89,
7: -488.99,
8: -524.22,
9: -851.9,
10: -610.0,
11: -998.8,
12: -580.57,
13: -737.22,
14: -895.2,
15: -500.0,
16: -852.0,
17: -610.0,
18: -500.0,
19: -610.0,
20: -1000.0,
21: -674.0,
22: -1005.0,
23: -499.33},
'quartile25': {0: 114.94,
1: 119.29,
2: 128.8,
3: 139.8,
4: 151.48,
5: 146.75,
6: 139.1,
7: 125.02,
8: 110.0,
9: 105.0,
10: 94.9,
11: 92.81,
12: 107.62,
13: 134.5,
14: 150.8,
15: 168.51,
16: 175.71,
17: 163.0,
18: 142.57,
19: 139.3,
20: 139.45,
21: 120.68,
22: 116.89,
23: 112.84},
'median': {0: 188.53,
1: 193.2,
2: 206.6,
3: 222.2,
4: 234.58,
5: 227.68,
6: 218.32,
7: 200.93,
8: 190.92,
9: 182.6,
10: 175.01,
11: 176.87,
12: 192.33,
13: 210.38,
14: 227.0,
15: 243.87,
16: 252.1,
17: 245.45,
18: 226.86,
19: 219.6,
20: 209.09,
21: 192.32,
22: 187.4,
23: 184.94},
'quartile75': {0: 292.1,
1: 295.33,
2: 316.62,
3: 340.8,
4: 357.0,
5: 345.3,
6: 330.4,
7: 305.28,
8: 290.4,
9: 280.1,
10: 268.23,
11: 270.99,
12: 301.84,
13: 321.04,
14: 345.61,
15: 373.84,
16: 393.39,
17: 382.79,
18: 359.89,
19: 341.55,
20: 325.5,
21: 292.1,
22: 287.2,
23: 285.96},
'maxValue': {0: 2420.3,
1: 1450.0,
2: 2852.0,
3: 7300.0,
4: 3967.0,
5: 3412.1,
6: 6999.99,
7: 2999.99,
8: 6000.0,
9: 3000.0,
10: 8885.9,
11: 9999.0,
12: 6254.0,
13: 2300.0,
14: 2057.58,
15: 2860.0,
16: 5000.0,
17: 4151.01,
18: 7000.0,
19: 3000.0,
20: 6000.0,
21: 3000.5,
22: 2000.0,
23: 2500.0}}
When I used a normal time series data set I plotted like this:
N=24
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)]
fig = go.Figure(data=[go.Box(
x=hour_dataframes[i]['hour'],
y=hour_dataframes[i]['priceNum'],
marker_color=c[i]
) for i in range(int(N))])
fig.update_layout(
xaxis=dict(showgrid=True, zeroline=True, showticklabels=True),
yaxis=dict(zeroline=True, gridcolor='white'),
paper_bgcolor='rgb(233,233,233)',
plot_bgcolor='rgb(233,233,233)',
autosize=False,
width=1500,
height=1000,
)
fig.show()
It worked fine but the data set became too big and Jupyterlab started crashing, so I pulled aggregated data but now I don't know how to plot multiple boxes (like the code above does) using the exact box plot values.
I have a dataframe with the fields: "nome", "acesspoint", "dia", "momento", "latitude" and "longitude". The field "momento" is 15 minute interval.
I need to count the number of users I have in each location according to the "dia" and "momento".
Example: On 03/06/2022 at 08:00 at DCF I have 2 users (bruno and thiago). On 06/03/2022 at 08:15 in DCF I have 2 users. On 06/03/2022 at 08:00 at DCC I have 1 user (Maria).
Output for the above example:
print(weight_list_access)
[-21.22604, -44.97349, 2], [-21.22604, -44.97349, 2], [-21.22780, -44.97850, 1]
#[latitude, longitude, counter]
Dict:
dicionario = {'nome': {0: ' bruno', 1: ' bruno', 2: ' bruno', 3: ' bruno', 4: ' bruno', 5: ' bruno', 6: ' bruno', 7: ' bruno', 8: ' Thiago', 9: ' Thiago', 10: ' Thiago', 11: ' Thiago', 12: ' Thiago', 13: ' Thiago', 14: ' Thiago', 15: ' Thiago', 16: ' Maria', 17: ' Maria', 18: ' Maria', 19: ' Maria', 20: ' Maria', 21: ' Maria', 22: ' Maria', 23: ' Maria', 24: ' Thiago', 25: ' Thiago', 26: ' Thiago', 27: ' Thiago', 28: ' Thiago', 29: ' Thiago', 30: ' Thiago', 31: ' Thiago'}, 'acesspoint': {0: 'DCF', 1: 'DCF', 2: 'DCF', 3: 'DCF', 4: 'DCF', 5: 'DCF', 6: 'DCF', 7: 'DCF', 8: 'DCF', 9: 'DCF', 10: 'DCF', 11: 'DCF', 12: 'DCF', 13: 'DCF', 14: 'DCF', 15: 'DCF', 16: 'DCC', 17: 'DCC', 18: 'DCC', 19: 'DCC', 20: 'DCC', 21: 'DCC', 22: 'DCC', 23: 'DCC', 24: 'DEX', 25: 'DEX', 26: 'DEX', 27: 'DEX', 28: 'DEX', 29: 'DEX', 30: 'DEX', 31: 'DEX'}, 'dia': {0: '03/06/2022', 1: '03/06/2022', 2: '03/06/2022', 3: '03/06/2022', 4: '03/06/2022', 5: '03/06/2022', 6: '03/06/2022', 7: '03/06/2022', 8: '03/06/2022', 9: '03/06/2022', 10: '03/06/2022', 11: '03/06/2022', 12: '03/06/2022', 13: '03/06/2022', 14: '03/06/2022', 15: '03/06/2022', 16: '03/06/2022', 17: '03/06/2022', 18: '03/06/2022', 19: '03/06/2022', 20: '03/06/2022', 21: '03/06/2022', 22: '03/06/2022', 23: '03/06/2022', 24: '04/06/2022', 25: '04/06/2022', 26: '04/06/2022', 27: '04/06/2022', 28: '04/06/2022', 29: '04/06/2022', 30: '04/06/2022', 31: '04/06/2022'}, 'momento': {0: '08:00', 1: '08:30', 2: '08:45', 3: '09:00', 4: '09:15', 5: '09:30', 6: '09:45', 7: '10:00', 8: '08:00', 9: '08:30', 10: '08:45', 11: '09:00', 12: '09:15', 13: '09:30', 14: '09:45', 15: '10:00', 16: '08:00', 17: '08:30', 18: '08:45', 19: '09:00', 20: '09:15', 21: '09:30', 22: '09:45', 23: '10:00', 24: '08:00', 25: '08:30', 26: '08:45', 27: '09:00', 28: '09:15', 29: '09:30', 30: '09:45', 31: '10:00'}, 'Latitude': {0: -21.22604, 1: -21.22604, 2: -21.22604, 3: -21.22604, 4: -21.22604, 5: -21.22604, 6: -21.22604, 7: -21.22604, 8: -21.22604, 9: -21.22604, 10: -21.22604, 11: -21.22604, 12: -21.22604, 13: -21.22604, 14: -21.22604, 15: -21.22604, 16: -21.2278, 17: -21.2278, 18: -21.2278, 19: -21.2278, 20: -21.2278, 21: -21.2278, 22: -21.2278, 23: -21.2278, 24: -21.22707, 25: -21.22707, 26: -21.22707, 27: -21.22707, 28: -21.22707, 29: -21.22707, 30: -21.22707, 31: -21.22707}, 'Longitude': {0: -44.97349, 1: -44.97349, 2: -44.97349, 3: -44.97349, 4: -44.97349, 5: -44.97349, 6: -44.97349, 7: -44.97349, 8: -44.97349, 9: -44.97349, 10: -44.97349, 11: -44.97349, 12: -44.97349, 13: -44.97349, 14: -44.97349, 15: -44.97349, 16: -44.9785, 17: -44.9785, 18: -44.9785, 19: -44.9785, 20: -44.9785, 21: -44.9785, 22: -44.9785, 23: -44.9785, 24: -44.97849, 25: -44.97849, 26: -44.97849, 27: -44.97849, 28: -44.97849, 29: -44.97849, 30: -44.97849, 31: -44.97849}}
I used the following code:
df_acesso = pd.DataFrame(dicionario)
weight_list_access = []
df_acesso['counter'] = 1
for x in df_acesso['dia'].sort_values().unique():
weight_list_access.append(df_acesso.loc[df_acesso['dia'] == x , ['Latitude', "Longitude", "counter"]].groupby(['Latitude',"Longitude"]).sum().reset_index().values.tolist())
With this code you are counting all the connections of the day("dia"), without considering the "momento" field (time interval). I tried to do it using nested for, with the "momento" field. But it did not work.
How is it possible to do?
Does this do what you need?
df = pd.DataFrame(dicionario)
df["coords"] = pd.Series(zip(df["Latitude"], df["Longitude"]))
df.groupby(["dia", "momento"])["coords"].value_counts().rename("count").reset_index()
Result
dia momento coords count
0 03/06/2022 08:00 (-21.22604, -44.97349) 2
1 03/06/2022 08:00 (-21.2278, -44.9785) 1
2 03/06/2022 08:30 (-21.22604, -44.97349) 2
3 03/06/2022 08:30 (-21.2278, -44.9785) 1
4 03/06/2022 08:45 (-21.22604, -44.97349) 2
5 03/06/2022 08:45 (-21.2278, -44.9785) 1
6 03/06/2022 09:00 (-21.22604, -44.97349) 2
...
I am working on a problem currently which involves moving over a large dataset, in which I am getting measurements that are spread out over distance (each measurement has a given lat/lon). I'm looking for a more pythonic/efficient way to solve this problem than I have currently, since Jupyter Notebook doesn't finish compiling after around 40000 iterations (I need around 300000).
My current solution is currently the following code, where the window size is 100m:
for m in range(6):
co_means = list(dfs[m]['co'])
dates = list(pd.to_datetime(dfs[m]['gps_time']))
dfs[m]['co'] = dfs[m]['co']*1000
R = 6373.0
for i in range (len(co_means)-3):
print (len(co_means))
print (i)
current_list_co2 = []
current_list_co = []
k = i
lat1 = radians(dfs[m]['lat'][i])
lon1 = radians(dfs[m]['lon'][i])
if (dates[k] != dates[-1]):
distance = 0
while(distance<100):
lat2 = radians(dfs[m]['lat'][k])
lon2 = radians(dfs[m]['lon'][k])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
print (distance)
distance = R * c *1000
if distance<100:
current_list_co2.append(dfs[m]['co2d'][k])
current_list_co.append(dfs[m]['co'][k])
k+=1
if (dates[k] == dates[-1]):
break
#only do calculations for those values that aren't in empty lists
print (dates[k])
if len(current_list_co2)!=0 and len(current_list_co) != 0:
results.write("%s" % (dates[i]))
#for ratio of co:co2
a_fit, cov = curve_fit(linear_function, current_list_co2, current_list_co)
y_int = a_fit[0]
slope = a_fit[1]
print (len(current_list_co2))
err_yint = math.sqrt(cov[0][0])
err_slope = math.sqrt(cov[1][1])
#to find r2:
z = stats.linregress(current_list_co2,current_list_co)
r2 = z[2]**2
x_list = list(range(1,len(current_list_co2)+1))
#for linregress parameters for co2 and co individually
a_fit_co2, cov_co2 = curve_fit(linear_function, x_list, current_list_co2)
a_fit_co, cov_co = curve_fit(linear_function, x_list, current_list_co)
Any help would be greatly appreciated!
Edit: Sample of dataset: dict
{'co': {0: 425.07144266999995, 1: 425.06915346999995, 2: 425.06915346999995, 3: 433.21636567, 4: 433.21636567,
5: 433.21803501999995, 6: 433.21803501999995, 7: 411.10666247, 8: 411.10666247, 9: 411.38779539999996,
10: 411.38779539999996, 11: 420.62025938000005, 12: 420.62025938000005, 13: 421.1036325, 14: 421.1036325,
15: 413.96486982000005, 16: 413.96486982000005, 17: 413.44999135, 18: 413.44999135, 19: 408.73726959},
'gps_time': {0: '2019-11-18 14:37:51.000000', 1: '2019-11-18 14:37:51.000000', 2: '2019-11-18 14:37:52.000000',
3: '2019-11-18 14:37:53.000000', 4: '2019-11-18 14:37:54.000000', 5: '2019-11-18 14:37:54.000000',
6: '2019-11-18 14:37:55.000000', 7: '2019-11-18 14:37:56.000000', 8: '2019-11-18 14:37:56.000000',
9: '2019-11-18 14:37:57.000000', 10: '2019-11-18 14:37:57.000000', 11: '2019-11-18 14:37:58.000000',
12: '2019-11-18 14:37:59.000000', 13: '2019-11-18 14:38:00.000000', 14: '2019-11-18 14:38:00.000000',
15: '2019-11-18 14:38:01.000000', 16: '2019-11-18 14:38:02.000000', 17: '2019-11-18 14:38:02.000000',
18: '2019-11-18 14:38:03.000000', 19: '2019-11-18 14:38:04.000000'},
'lat': {0: 45.5052230462, 1: 45.5052230462, 2: 45.5052236012, 3: 45.5052241548, 4: 45.5052247083, 5: 45.5052247083,
6: 45.505224740900005, 7: 45.505224193000004, 8: 45.505224193000004, 9: 45.5052236451,
10: 45.5052236451, 11: 45.5052230897, 12: 45.5052225243, 13: 45.505221958999996, 14: 45.505221958999996,
15: 45.5052211427, 16: 45.505220058, 17: 45.505220058, 18: 45.505218973299996, 19: 45.5052183333},
'lon': {0: -73.5761855743, 1: -73.5761855743, 2: -73.576185, 3: -73.576185, 4: -73.576185, 5: -73.576185,
6: -73.5761855183, 7: -73.5761866141, 8: -73.5761866141, 9: -73.5761877098, 10: -73.5761877098,
11: -73.576188577, 12: -73.5761891424, 13: -73.57618970770001, 14: -73.57618970770001,
15: -73.5761894761, 16: -73.57618839140001, 17: -73.57618839140001, 18: -73.5761873066, 19: -73.5761857775},
'co2d': {0: 380.58647938, 1: 381.44674445, 2: 451.67041972, 3: 451.67041972, 4: 451.66555392, 5: 451.66555392,
6: 456.29788806, 7: 456.29788806, 8: 456.29412627, 9: 456.29412627, 10: 520.61774288, 11: 520.61774288,
12: 520.62904898, 13: 520.62904898, 14: 630.97037738, 15: 630.97037738, 16: 630.9919346, 17: 630.9919346,
18: 512.76133406, 19: 512.76133406}}
All,
I have below Pandas dataframe, and I am trying to filter my dataframe such that my output displays country name along with the year 1989 column whose number is >1000000.For this I am using below code, but it is returning me below error.
{'Country': {0: 'Austria', 1: 'Belgium', 2: 'Denmark', 3: 'Finland', 4: 'France', 5: 'Germany', 6: 'Iceland', 7: 'Ireland', 8: 'Italy', 9: 'Luxemburg', 10: 'Netherland', 11: 'Norway', 12: 'Portugal', 13: 'Spain', 14: 'Sweden', 15: 'Switzerland', 16: 'United Kingdom'}, 'y1989': {0: 7602431, 1: 9927600, 2: 5129800, 3: 4954359, 4: 56269800, 5: 61715000, 6: 253500, 7: 3526600, 8: 57504700, 9: 374900, 10: 14805240, 11: 4226901, 12: 10304700, 13: 38851900, 14: 8458890, 15: 6619973, 16: 57236200}, 'y1990': {0: 7660345.0, 1: 9947800.0, 2: 5135400.0, 3: 4974383.0, 4: 0.0, 5: 62678000.0, 6: 255708.0, 7: 3505500.0, 8: 57576400.0, 9: 379300.0, 10: 14892574.0, 11: 4241473.0, 12: 0.0, 13: 38924500.0, 14: 8527040.0, 15: 6673850.0, 16: 57410600.0}, 'y1991': {0: 7790957, 1: 9987000, 2: 5146500, 3: 4998478, 4: 56893000, 5: 79753000, 6: 259577, 7: 3519000, 8: 57746200, 9: 384400, 10: 15010445, 11: 4261930, 12: 9858500, 13: 38993800, 14: 8590630, 15: 6750693, 16: 57649200}, 'y1992': {0: 7860800, 1: 10068319, 2: 5162100, 3: 5029300, 4: 57217500, 5: 80238000, 6: 262193, 7: 3542000, 8: 57788200, 9: 389800, 10: 15129200, 11: 4273634, 12: 9846000, 13: 39055900, 14: 8644100, 15: 6831900, 16: 58888800}, 'y1993': {0: 7909575, 1: 10100631, 2: 5180614, 3: 5054982, 4: 57529577, 5: 81338000, 6: 264922, 7: 3559985, 8: 57114161, 9: 395200, 10: 15354000, 11: 4324577, 12: 9987500, 13: 39790955, 14: 8700000, 15: 6871500, 16: 58191230}, 'y1994': {0: 7943652, 1: 10130574, 2: 5191000, 3: 5098754, 4: 57847000, 5: 81353000, 6: 266783, 7: 3570700, 8: 57201800, 9: 400000, 10: 15341553, 11: 4348410, 12: 9776000, 13: 39177400, 14: 8749000, 15: 7021200, 16: 58380000}, 'y1995': {0: 8054800, 1: 10143047, 2: 5251027, 3: 5116800, 4: 58265400, 5: 81845000, 6: 267806, 7: 3591200, 8: 57268578, 9: 412800, 10: 15492800, 11: 4370000, 12: 9920800, 13: 39241900, 14: 8837000, 15: 7060400, 16: 58684000}}
My code
df[(df.Country)& (df.y1989>1000000)]
Error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I am not sure what could be the reason, being a newbie to python if you could provide explanation for the error that will be greatly appreciated.
Thanks in advance,
'Country' doesn't form part of your filtering criteria, so don't use it to form your Boolean indexer. Instead, use the loc accessor to give a Boolean condition and specify necessary columns separately:
res = df.loc[df['y1989'] > 1000000, ['Country','y1989']]
Under no circumstances use chained assignment, e.g. via df[df['y1989']>1000000][['Country','y1989']], as this is ambiguous and explicitly discouraged in the docs.