How to Use Melt to Tidy Dataframe in Pandas? - python

dt = {'Ind': {0: 'Ind1',
1: 'Ind2',
2: 'Ind3',
3: 'Ind4',
4: 'Ind5',
5: 'Ind6',
6: 'Ind7',
7: 'Ind8',
8: 'Ind9',
9: 'Ind10',
10: 'Ind1',
11: 'Ind2',
12: 'Ind3',
13: 'Ind4',
14: 'Ind5',
15: 'Ind6',
16: 'Ind7',
17: 'Ind8',
18: 'Ind9',
19: 'Ind10'},
'Treatment': {0: 'Treat',
1: 'Treat',
2: 'Treat',
3: 'Treat',
4: 'Treat',
5: 'Treat',
6: 'Treat',
7: 'Treat',
8: 'Treat',
9: 'Treat',
10: 'Cont',
11: 'Cont',
12: 'Cont',
13: 'Cont',
14: 'Cont',
15: 'Cont',
16: 'Cont',
17: 'Cont',
18: 'Cont',
19: 'Cont'},
'value': {0: 4.5,
1: 8.3,
2: 6.2,
3: 4.2,
4: 7.1,
5: 7.5,
6: 7.9,
7: 5.1,
8: 5.8,
9: 6.0,
10: 11.3,
11: 11.6,
12: 13.3,
13: 12.2,
14: 13.4,
15: 11.7,
16: 12.1,
17: 12.0,
18: 14.0,
19: 13.8}}
mydt = pd.DataFrame(dt, columns = ['Ind', 'Treatment', 'value')
How can I tidy up my dataframe to make it look like?
Desired Output

You can use DataFrame.from_dict
pd.DataFrame.from_dict(data, orient='index')

Related

Boolean mask unexpected behavior when applying style

I am processing data where values may be of the format '<x' I want to return 'x/2'. So <5 would be returned as '2.5'. I have columns of mixed numbers and text. The problem is that I want to style the values that have been changed. Dummy data and code:
dummy={'Location': {0: 'Perth', 1: 'Perth', 2: 'Perth', 3: 'Perth', 4: 'Perth', 5: 'Perth', 6: 'Perth', 7: 'Perth', 8: 'Perth', 9: 'Perth', 10: 'Perth', 11: 'Perth', 12: 'Perth', 13: 'Perth', 14: 'Perth', 15: 'Perth', 16: 'Perth', 17: 'Perth'}, 'Date': {0: '11/01/2012 0:00', 1: '11/01/2012 0:00', 2: '20/03/2012 0:00', 3: '6/06/2012 0:00', 4: '14/09/2012 0:00', 5: '17/12/2013 0:00', 6: '1/02/2014 0:00', 7: '1/02/2014 0:00', 8: '1/02/2014 0:00', 9: '1/02/2014 0:00', 10: '1/02/2014 0:00', 11: '1/02/2014 0:00', 12: '1/02/2014 0:00', 13: '1/02/2014 0:00', 14: '1/02/2014 0:00', 15: '1/02/2014 0:00', 16: '1/02/2014 0:00', 17: '1/02/2014 0:00'}, 'As µg/L': {0: '9630', 1: '9630', 2: '8580', 3: '4990', 4: '6100', 5: '282', 6: '21', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '<1'}, 'As': {0: '9.63', 1: '9.63', 2: '8.58', 3: '4.99', 4: '6.1', 5: '0.282', 6: '0.021', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '10'}, 'Ba': {0: 1000.0, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'HCO3': {0: '10.00', 1: '0.50', 2: '0.50', 3: '<22', 4: '0.50', 5: '0.50', 6: '0.50', 7: np.nan, 8: np.nan, 9: np.nan, 10: '0.50', 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cd': {0: 0.0094, 1: 0.0094, 2: 0.011, 3: 0.0035, 4: 0.004, 5: 0.002, 6: 0.0019, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Ca': {0: 248.0, 1: 248.0, 2: 232.0, 3: 108.0, 4: 150.0, 5: 396.0, 6: 472.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: 472.0, 15: np.nan, 16: np.nan, 17: np.nan}, 'CO3': {0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5, 7: np.nan, 8: np.nan, 9: 0.5, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cl': {0: 2.0, 1: 2.0, 2: 2.0, 3: 2.0, 4: 0.5, 5: 2.0, 6: 5.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: 5.0, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}}
df=pd.DataFrame(dummy)
import pandas a pd
import numpy as np
mask = df.applymap(lambda x: (isinstance(x, str) and x.startswith('<')))
def remove_less_thans(x):
if type(x) is int:
return x
elif type(x) is float:
return x
elif type(x) is str and x[0]=="<":
try:
return float(x[1:])/2
except:
return x
elif type(x) is str and len(x)<10:
try:
return float(x)
except:
return x
else:
return x
def colour_mask(val):
colour='color: red; font-weight: bold' if val in df.values[mask] else ''
return colour
#perform remove less-thans and divide the remainder by two
df=df.applymap(remove_less_thans)
styled_df= df.style.applymap(colour_mask)
styled_df
the mask looks correct, the remove < function works ok but I get values formatted when they shouldn't be. In the dummy data the HCO3 column has the 0.5 values reformatted even though they do no start with < and are not appearing as True in the mask. I know that they are numbers stored as text but that is how the real data might appear and given the mask is being constructed as expected (i.e. the one True is there and the rest of the values in the column are False) I don't know why they are being formatted. Same for column CO3, all the non-nan values are formatted when none should be. Why is this happening and how do I fix it? Dataframe
Output
Idea is pass mask to Styler.apply with numpy.where:
def colour_mask(x):
arr = np.where(mask, 'color: red; font-weight: bold', '')
return pd.DataFrame(arr, index=x.index, columns=x.columns)
styled_df = df.style.apply(colour_mask, axis=None)
Or:
def colour_mask(x, props=''):
return np.where(mask, props, '')
styled_df = df.style.apply(colour_mask, props='color: red; font-weight: bold', axis=None)

Is there a way of creating boxplots using the exact boxplot values?

I am trying to create boxplots for 24 hours, each hour already having the maxValue, quartile75, mean, quartile25 and minValue. Those values are stored in a dataframe - I put them into a dict.
{'hour': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23},
'minValue': {0: -491.69,
1: -669.49,
2: -551.22,
3: -514.2,
4: -506.94,
5: -665.7,
6: -484.89,
7: -488.99,
8: -524.22,
9: -851.9,
10: -610.0,
11: -998.8,
12: -580.57,
13: -737.22,
14: -895.2,
15: -500.0,
16: -852.0,
17: -610.0,
18: -500.0,
19: -610.0,
20: -1000.0,
21: -674.0,
22: -1005.0,
23: -499.33},
'quartile25': {0: 114.94,
1: 119.29,
2: 128.8,
3: 139.8,
4: 151.48,
5: 146.75,
6: 139.1,
7: 125.02,
8: 110.0,
9: 105.0,
10: 94.9,
11: 92.81,
12: 107.62,
13: 134.5,
14: 150.8,
15: 168.51,
16: 175.71,
17: 163.0,
18: 142.57,
19: 139.3,
20: 139.45,
21: 120.68,
22: 116.89,
23: 112.84},
'median': {0: 188.53,
1: 193.2,
2: 206.6,
3: 222.2,
4: 234.58,
5: 227.68,
6: 218.32,
7: 200.93,
8: 190.92,
9: 182.6,
10: 175.01,
11: 176.87,
12: 192.33,
13: 210.38,
14: 227.0,
15: 243.87,
16: 252.1,
17: 245.45,
18: 226.86,
19: 219.6,
20: 209.09,
21: 192.32,
22: 187.4,
23: 184.94},
'quartile75': {0: 292.1,
1: 295.33,
2: 316.62,
3: 340.8,
4: 357.0,
5: 345.3,
6: 330.4,
7: 305.28,
8: 290.4,
9: 280.1,
10: 268.23,
11: 270.99,
12: 301.84,
13: 321.04,
14: 345.61,
15: 373.84,
16: 393.39,
17: 382.79,
18: 359.89,
19: 341.55,
20: 325.5,
21: 292.1,
22: 287.2,
23: 285.96},
'maxValue': {0: 2420.3,
1: 1450.0,
2: 2852.0,
3: 7300.0,
4: 3967.0,
5: 3412.1,
6: 6999.99,
7: 2999.99,
8: 6000.0,
9: 3000.0,
10: 8885.9,
11: 9999.0,
12: 6254.0,
13: 2300.0,
14: 2057.58,
15: 2860.0,
16: 5000.0,
17: 4151.01,
18: 7000.0,
19: 3000.0,
20: 6000.0,
21: 3000.5,
22: 2000.0,
23: 2500.0}}
When I used a normal time series data set I plotted like this:
N=24
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)]
fig = go.Figure(data=[go.Box(
x=hour_dataframes[i]['hour'],
y=hour_dataframes[i]['priceNum'],
marker_color=c[i]
) for i in range(int(N))])
fig.update_layout(
xaxis=dict(showgrid=True, zeroline=True, showticklabels=True),
yaxis=dict(zeroline=True, gridcolor='white'),
paper_bgcolor='rgb(233,233,233)',
plot_bgcolor='rgb(233,233,233)',
autosize=False,
width=1500,
height=1000,
)
fig.show()
It worked fine but the data set became too big and Jupyterlab started crashing, so I pulled aggregated data but now I don't know how to plot multiple boxes (like the code above does) using the exact box plot values.

Python Filter Dataframe with Dynamic arguments

Hi i want to Filter a dataframe from arguments dynamically.
this is my idea now:
tr=pd.read_csv("sales.csv")
def filtr(*arg2):
fltr = tr.loc[(tr[arg2[0]] arg2[1] arg2[2]) arg2[3] ....]
print(fltr)
filtr(*sys.argv[1:])
## python test.py "Unit Cost" "==" 4 & .......
i had the idea of making the (tr[arg2[0]] arg2[1] arg2[2]) as body and iterating it but i don't know how.
edit: Data Example:
{'Region': {0: 'Sub-Saharan Africa', 1: 'Europe', 2: 'Middle East and North Africa', 3: 'Sub-Saharan Africa', 4: 'Europe', 5: 'Sub-Saharan Africa', 6: 'Asia', 7: 'Asia', 8: 'Sub-Saharan Africa', 9: 'Central America and the Caribbean', 10: 'Sub-Saharan Africa', 11: 'Europe', 12: 'Europe', 13: 'Asia', 14: 'Middle East and North Africa', 15: 'Australia and Oceania', 16: 'Central America and the Caribbean', 17: 'Europe', 18: 'Middle East and North Africa', 19: 'Europe'}, 'Country': {0: 'Chad', 1: 'Latvia', 2: 'Pakistan', 3: 'Democratic Republic of the Congo', 4: 'Czech Republic', 5: 'South Africa', 6: 'Laos', 7: 'China', 8: 'Eritrea', 9: 'Haiti', 10: 'Zambia', 11: 'Bosnia and Herzegovina', 12: 'Germany', 13: 'India', 14: 'Algeria', 15: 'Palau', 16: 'Cuba', 17: 'Vatican City', 18: 'Lebanon', 19: 'Lithuania'}, 'Item Type': {0: 'Office Supplies', 1: 'Beverages', 2: 'Vegetables', 3: 'Household', 4: 'Beverages', 5: 'Beverages', 6: 'Vegetables', 7: 'Baby Food', 8: 'Meat', 9: 'Office Supplies', 10: 'Cereal', 11: 'Baby Food', 12: 'Office Supplies', 13: 'Household', 14: 'Clothes', 15: 'Snacks', 16: 'Beverages', 17: 'Beverages', 18: 'Personal Care', 19: 'Snacks'}, 'Sales Channel': {0: 'Online', 1: 'Online', 2: 'Offline', 3: 'Online', 4: 'Online', 5: 'Offline', 6: 'Online', 7: 'Online', 8: 'Online', 9: 'Online', 10: 'Offline', 11: 'Offline', 12: 'Online', 13: 'Online', 14: 'Offline', 15: 'Offline', 16: 'Online', 17: 'Online', 18: 'Offline', 19: 'Offline'}, 'Order Priority': {0: 'L', 1: 'C', 2: 'C', 3: 'C', 4: 'C', 5: 'H', 6: 'L', 7: 'C', 8: 'L', 9: 'C', 10: 'M', 11: 'M', 12: 'C', 13: 'C', 14: 'C', 15: 'L', 16: 'H', 17: 'L', 18: 'H', 19: 'H'}, 'Order Date': {0: '1/27/2011', 1: '12/28/2015', 2: '1/13/2011', 3: '9/11/2012', 4: '10/27/2015', 5: '7/10/2012', 6: '2/20/2011', 7: '4/10/2017', 8: '11/21/2014', 9: '7/4/2015', 10: '7/26/2016', 11: '10/20/2012', 12: '2/22/2015', 13: '8/27/2016', 14: '6/21/2011', 15: '9/19/2013', 16: '11/15/2015', 17: '4/6/2015', 18: '4/12/2010', 19: '9/26/2011'}, 'Order ID': {0: 292494523, 1: 361825549, 2: 141515767, 3: 500364005, 4: 127481591, 5: 482292354, 6: 844532620, 7: 564251220, 8: 411809480, 9: 327881228, 10: 773452794, 11: 479823005, 12: 498603188, 13: 151717174, 14: 181401288, 15: 500204360, 16: 640987718, 17: 206925189, 18: 221503102, 19: 878520286}, 'Ship Date': {0: '2/12/2011', 1: '1/23/2016', 2: '2/1/2011', 3: '10/6/2012', 4: '12/5/2015', 5: '8/21/2012', 6: '3/20/2011', 7: '5/12/2017', 8: '1/10/2015', 9: '7/20/2015', 10: '8/24/2016', 11: '11/15/2012', 12: '2/27/2015', 13: '9/2/2016', 14: '7/21/2011', 15: '10/4/2013', 16: '11/30/2015', 17: '4/27/2015', 18: '5/19/2010', 19: '10/2/2011'}, 'Units Sold': {0: 4484, 1: 1075, 2: 6515, 3: 7683, 4: 3491, 5: 9880, 6: 4825, 7: 3330, 8: 2431, 9: 6197, 10: 724, 11: 9145, 12: 6618, 13: 5338, 14: 9527, 15: 441, 16: 1365, 17: 2617, 18: 6545, 19: 2530}, 'Unit Price': {0: 651.21, 1: 47.45, 2: 154.06, 3: 668.27, 4: 47.45, 5: 47.45, 6: 154.06, 7: 255.28, 8: 421.89, 9: 651.21, 10: 205.7, 11: 255.28, 12: 651.21, 13: 668.27, 14: 109.28, 15: 152.58, 16: 47.45, 17: 47.45, 18: 81.73, 19: 152.58}, 'Unit Cost': {0: 524.96, 1: 31.79, 2: 90.93, 3: 502.54, 4: 31.79, 5: 31.79, 6: 90.93, 7: 159.42, 8: 364.69, 9: 524.96, 10: 117.11, 11: 159.42, 12: 524.96, 13: 502.54, 14: 35.84, 15: 97.44, 16: 31.79, 17: 31.79, 18: 56.67, 19: 97.44}, 'Total Revenue': {0: 2920025.64, 1: 51008.75, 2: 1003700.9, 3: 5134318.41, 4: 165647.95, 5: 468806.0, 6: 743339.5, 7: 850082.4, 8: 1025614.59, 9: 4035548.37, 10: 148926.8, 11: 2334535.6, 12: 4309707.78, 13: 3567225.26, 14: 1041110.56, 15: 67287.78, 16: 64769.25, 17: 124176.65, 18: 534922.85, 19: 386027.4}, 'Total Cost': {0: 2353920.64, 1: 34174.25, 2: 592408.95, 3: 3861014.82, 4: 110978.89, 5: 314085.2, 6: 438737.25, 7: 530868.6, 8: 886561.39, 9: 3253177.12, 10: 84787.64, 11: 1457895.9, 12: 3474185.28, 13: 2682558.52, 14: 341447.68, 15: 42971.04, 16: 43393.35, 17: 83194.43, 18: 370905.15, 19: 246523.2}, 'Total Profit': {0: 566105.0, 1: 16834.5, 2: 411291.95, 3: 1273303.59, 4: 54669.06, 5: 154720.8, 6: 304602.25, 7: 319213.8, 8: 139053.2, 9: 782371.25, 10: 64139.16, 11: 876639.7, 12: 835522.5, 13: 884666.74, 14: 699662.88, 15: 24316.74, 16: 21375.9, 17: 40982.22, 18: 164017.7, 19: 139504.2}}
Just use eval() and here are the code:
import pandas as pd
def filter_df(df, args_list):
constraints = []
for a in args_list:
col = a[0]
symbol = a[1]
value = a[2]
constraint = "(df.{}{}{})".format(col, symbol, value)
constraints.append(constraint)
filter_str = "&".join(constraints)
return df[eval(filter_str)]
data = {
"COL_A": [1,2,3,2,4,6],
"COL_B": [1,10,100,20,20,40],
"COL_C": ["aaa", "bbb", "zzz", "xxx", "xxx", "xxx"]
}
df = pd.DataFrame(data)
args_list = [["COL_A", "<=", "4"], ["COL_C", "==", "'xxx'"]]
df2 = filter_df(df, args_list)
This is df:
After filter COL_A <= 4 & COL_C == 'xxx', this is df2:
How about this ?
def filter(df, **args):
conditions = args["args"]
for key , value in conditions.items():
df = df[df[key] > value]
return df
Invoke using
df = filter(df, args={"Unit Cost": 500, "Unit Price": 500})
Result:
print(df.shape)
(5,14)
Note: This approach can be used only when you want to compare all the conditions using >. if you need to include multiple operation, you may need to find a better approach
def filter_df(arg2):
if arg2[1]==">":
return tr.loc[(tr[arg2[0]] > int(arg2[2]))]
elif arg2[1]=="<":
return tr.loc[(tr[arg2[0]] < int(arg2[2]))]
elif arg2[1]=="=":
return tr.loc[(tr[arg2[0]] == int(arg2[2]))]
else:
raise ValueError("invalid comparison: %s"%arg2[1])
filter_df(arg2)
now if (for example) arg2 = ('Unit Cost', '>', '500'), the function will return only the rows with Unit Cost>500:
If you want to pass multiple condition it is more complicated and my hint is to pass them step-by-step, separately.

Transform DataFrame into multidimensional TimeSeries?

I have the following pandas DataFrame with "periodic" values over the column 'county' as well as repeating values in 'reporting_period' and 'date':
data = pd.DataFrame({'county': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'A', 10: 'B', 11: 'C', 12: 'D', 13: 'E', 14: 'F', 15: 'G', 16: 'H', 17: 'I'}, 'new_covid_19_cases_per_100k': {0: 9.89857311398793, 1: 8.96808587445497, 2: 10.4018656786281, 3: 5.44259755461725, 4: 8.47402557487262, 5: 8.23708135804402, 6: 21.1781816000959, 7: 6.34201242466493, 8: 11.9630512616746, 9: 14.0, 10: 16.3, 11: 13.1, 12: 9.3, 13: 11.0, 14: 12.6, 15: 20.9, 16: 8.2, 17: 13.6}, 'new_covid_19_hospitalizations': {0: 0.735745284982339, 1: 0.681120446161137, 2: 1.07219230841243, 3: 0.118317338143853, 4: 0.526882419163064, 5: 0.599666185823225, 6: 1.07095735019448, 7: 0.141985352791006, 8: 0.854503661548189, 9: 0.9, 10: 0.8, 11: 1.5, 12: 0.2, 13: 0.5, 14: 0.8, 15: 0.9, 16: 0.1, 17: 0.7}, 'reporting_period': {0: '10/04/2020 - 10/17/2020', 1: '10/04/2020 - 10/17/2020', 2: '10/04/2020 - 10/17/2020', 3: '10/04/2020 - 10/17/2020', 4: '10/04/2020 - 10/17/2020', 5: '10/04/2020 - 10/17/2020', 6: '10/04/2020 - 10/17/2020', 7: '10/04/2020 - 10/17/2020', 8: '10/04/2020 - 10/17/2020', 9: '10/11/2020 - 10/24/2020', 10: '10/11/2020 - 10/24/2020', 11: '10/11/2020 - 10/24/2020', 12: '10/11/2020 - 10/24/2020', 13: '10/11/2020 - 10/24/2020', 14: '10/11/2020 - 10/24/2020', 15: '10/11/2020 - 10/24/2020', 16: '10/11/2020 - 10/24/2020', 17: '10/11/2020 - 10/24/2020'}, 'date': {0: '2020-10-22T00:00:00', 1: '2020-10-22T00:00:00', 2: '2020-10-22T00:00:00', 3: '2020-10-22T00:00:00', 4: '2020-10-22T00:00:00', 5: '2020-10-22T00:00:00', 6: '2020-10-22T00:00:00', 7: '2020-10-22T00:00:00', 8: '2020-10-22T00:00:00', 9: '2020-10-29T00:00:00', 10: '2020-10-29T00:00:00', 11: '2020-10-29T00:00:00', 12: '2020-10-29T00:00:00', 13: '2020-10-29T00:00:00', 14: '2020-10-29T00:00:00', 15: '2020-10-29T00:00:00', 16: '2020-10-29T00:00:00', 17: '2020-10-29T00:00:00'}})
My goal is to transform this DataFrame into some sort of multidimensional Time Series but I don't know what the best approach is or whether this is even possible.
My first idea was to use groupby and pivot_table but I'm not sure if this useful.
Easiest way to visualize the ts data as multiindex is to set_index.
reporting_period can also be converted to period type but that depends on the requirement.
If we want to apply any aggregation, reduction or any other transformation then we will have to use groupby or pivot.
data['date'] = pd.to_datetime(data.date)
data = data.set_index(['reporting_period', 'date'])
data
Sample Output
data.head(2)
county new_covid_19_cases_per_100k new_covid_19_hospitalizations
reporting_period date
10/04/2020 - 10/17/2020 2020-10-22 A 9.898573 0.735745
2020-10-22 B 8.968086 0.681120

TypeError: unsupported operand type(s) for &: 'str' and 'bool'

All,
I have below Pandas dataframe, and I am trying to filter my dataframe such that my output displays country name along with the year 1989 column whose number is >1000000.For this I am using below code, but it is returning me below error.
{'Country': {0: 'Austria', 1: 'Belgium', 2: 'Denmark', 3: 'Finland', 4: 'France', 5: 'Germany', 6: 'Iceland', 7: 'Ireland', 8: 'Italy', 9: 'Luxemburg', 10: 'Netherland', 11: 'Norway', 12: 'Portugal', 13: 'Spain', 14: 'Sweden', 15: 'Switzerland', 16: 'United Kingdom'}, 'y1989': {0: 7602431, 1: 9927600, 2: 5129800, 3: 4954359, 4: 56269800, 5: 61715000, 6: 253500, 7: 3526600, 8: 57504700, 9: 374900, 10: 14805240, 11: 4226901, 12: 10304700, 13: 38851900, 14: 8458890, 15: 6619973, 16: 57236200}, 'y1990': {0: 7660345.0, 1: 9947800.0, 2: 5135400.0, 3: 4974383.0, 4: 0.0, 5: 62678000.0, 6: 255708.0, 7: 3505500.0, 8: 57576400.0, 9: 379300.0, 10: 14892574.0, 11: 4241473.0, 12: 0.0, 13: 38924500.0, 14: 8527040.0, 15: 6673850.0, 16: 57410600.0}, 'y1991': {0: 7790957, 1: 9987000, 2: 5146500, 3: 4998478, 4: 56893000, 5: 79753000, 6: 259577, 7: 3519000, 8: 57746200, 9: 384400, 10: 15010445, 11: 4261930, 12: 9858500, 13: 38993800, 14: 8590630, 15: 6750693, 16: 57649200}, 'y1992': {0: 7860800, 1: 10068319, 2: 5162100, 3: 5029300, 4: 57217500, 5: 80238000, 6: 262193, 7: 3542000, 8: 57788200, 9: 389800, 10: 15129200, 11: 4273634, 12: 9846000, 13: 39055900, 14: 8644100, 15: 6831900, 16: 58888800}, 'y1993': {0: 7909575, 1: 10100631, 2: 5180614, 3: 5054982, 4: 57529577, 5: 81338000, 6: 264922, 7: 3559985, 8: 57114161, 9: 395200, 10: 15354000, 11: 4324577, 12: 9987500, 13: 39790955, 14: 8700000, 15: 6871500, 16: 58191230}, 'y1994': {0: 7943652, 1: 10130574, 2: 5191000, 3: 5098754, 4: 57847000, 5: 81353000, 6: 266783, 7: 3570700, 8: 57201800, 9: 400000, 10: 15341553, 11: 4348410, 12: 9776000, 13: 39177400, 14: 8749000, 15: 7021200, 16: 58380000}, 'y1995': {0: 8054800, 1: 10143047, 2: 5251027, 3: 5116800, 4: 58265400, 5: 81845000, 6: 267806, 7: 3591200, 8: 57268578, 9: 412800, 10: 15492800, 11: 4370000, 12: 9920800, 13: 39241900, 14: 8837000, 15: 7060400, 16: 58684000}}
My code
df[(df.Country)& (df.y1989>1000000)]
Error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I am not sure what could be the reason, being a newbie to python if you could provide explanation for the error that will be greatly appreciated.
Thanks in advance,
'Country' doesn't form part of your filtering criteria, so don't use it to form your Boolean indexer. Instead, use the loc accessor to give a Boolean condition and specify necessary columns separately:
res = df.loc[df['y1989'] > 1000000, ['Country','y1989']]
Under no circumstances use chained assignment, e.g. via df[df['y1989']>1000000][['Country','y1989']], as this is ambiguous and explicitly discouraged in the docs.

Categories