Comparing data from 2 nested dictionaries and producing a box-plot - python

I am trying to produce a box plot using matplotlib with data from nested dictionaries. Below is a rough outline of the structure of dictionary in question.
m_data = {scenario:{variable:{'model_name':value, ''model_name':value ...}
One issue is that I want to look at the change in the models output between the two different scenarios ( scenario 1 [VAR1] - scenario 2 [VAR2]) and then plot this difference in a box plot.
I have managed to do this, however, I want to be able to label the outliers with the model name. My current method separates the keys from the values, therefore the outlier data point has no name associated with it anymore.
#BOXPLOT
#set up blank lists
future_rain = []
past_rain = []
future_temp = []
past_temp = []
#single out the values for each model from the nested dictioaries
for key,val in m_data[FUTURE_SCENARIO][VAR1].items():
future_rain.append(val)
for key,val in m_data[FUTURE_SCENARIO][VAR2].items():
future_temp.append(val)
for key,val in m_data['historical'][VAR1].items():
past_rain.append(val)
for key,val in m_data['historical'][VAR2].items():
past_temp.append(val)
#blanks for final data
bx_plt_rain = []
bx_plt_temp = []
#allow for the subtration of two lists
zip_object = zip(future_temp, past_temp)
for future_temp_i, past_temp_i in zip_object:
bx_plt_temp.append(future_temp_i - past_temp_i)
zip_object = zip(future_rain, past_rain)
for future_rain_i, past_rain_i in zip_object:
bx_plt_rain.append(future_rain_i - past_rain_i)
#colour ouliers red
c = 'red'
outlier_col = {'flierprops': dict(color =c, markeredgecolor=c)}
#plot
bp = plt.boxplot(bx_plt_rain, patch_artist=True, showmeans=True, vert= False, meanline=True, **outlier_col)
bp['boxes'][0].set(facecolor = 'lightgrey')
plt.show()
If anyone knows of a workaround for this I would be extremely grateful.

As a bit of a hack you could create a function that looks through the dict for the outlier value and returns the key.
def outlier_name(outlier_val, inner_dict):
for key, value in inner_dict.items():
if value == outlier_val:
return key
This could be pretty intensive if your data sets are large.

Related

Return a unique dataframe name from function

I would like to return several dataframes from def function using unique names based on variables. My code as follows:
def plots_without_outliers(parameter):
"""
The function removes outliers from dataframe variables and plots boxplot and historams
"""
Q1 = df[parameter].quantile(0.25)
Q3 = df[parameter].quantile(0.75)
IQR = Q3 - Q1
df_without_outliers = df[(df[parameter] > (Q1-1.5*IQR)) & (df[parameter] < (Q3+1.5*IQR))]
g = sns.FacetGrid(df_without_outliers, col='tariff', height=5)
g.map(sns.boxplot, parameter, order=['ultra', 'smart'], color='#fec44f', showmeans=True)
g = sns.FacetGrid(df_without_outliers, col='tariff', height=5)
g.map(plt.hist, parameter, bins = 12, color='#41ab5d')
return df_without_outliers
Then I pass a number of variables :
plots_without_outliers('total_minutes_spent_per_month')
plots_without_outliers('number_sms_spent_per_month')
In addition to graphs I want to have dataframes returned with unique names to use them later on. For example:
df_without_outliers_total_minutes_spent_per_month
and
df_without_outliers_number_sms_spent_per_month
What would be the best way to deal with this issue? Thank you very much for your help.
A common way to deal with this is by using a dictionary, which you can make a global variable outside of the function and then update with the returned dataframe and the corresponding name as dictionary key.
dict_of_dfs = dict()
def plots_without_outliers(parameter):
# your function statements
return df_without_outliers
for col in ['total_minutes_spent_per_month', 'number_sms_spent_per_month']:
dict_of_dfs['df_without_outliers_' + col] = (
plots_without_outliers(col)
)
You can then get each dataframe from the dictionary with e.g., dict_of_dfs['df_without_outliers_total_minutes_spent_per_month']

Box plot a data from dictionaries key in python

I have two dictionaries, both dictionaries contains a number of keys. What i am trying to do here. I want to plot their data side by side. For example both dictionaries have key '1', so i want to plot the data of key 1 from both dictionaries side by side.
dict_a = {1: [10.60626299560636,9.808507783184758, 9.80184985166152, 9.820483229791137,9.822087257017674],
2: [10.60626299560636, 9.808507783184758, 9.80184985166152, 9.820483229791137, 9.822087257017674]}
dict_b = {1: [14.420548834522766,13.886147271592971,14.522980401561725,14.876615652026173,13.379224382776899],
2: [14.650926514851816,13.984378530820885,14.566825972585173, 16.434690726796628,15.24108978696146]}
after a search i came to towards the following code, but both code snippets helps to draw for one dict at one time.
fig, ax = plt.subplots()
ax.boxplot(dict_a .values())
ax.set_xticklabels(dict_a.keys())
Another code which i found is following but still it does not give me what i want.
labels, data = dict_a .keys(), dict_a .values()
plt.boxplot(data)
plt.xticks(range(1, len(labels) + 1), labels)
plt.show()
is there a way, which performs the same way i want.
Try this
key = 1
values = [data[key] for data in [dict_a, dict_b]]
fig, ax = plt.subplots()
ax.boxplot(values)
ax.set_xticklabels(['dict_a', 'dict_b'])
ax.set_title('value: %s' % key)

Specify function

I have the following function. It calculates the euclidean distances form some financial figures between companies and give me the closest company. Unfortunately, sometimes the closest company is the same company. Does anyone know how I can adjust the function so that it does not return the same company?
#Calculating the closest distances
records = df_ipos.to_dict('records') #converting dataframe to a list of dictionaries
def return_closest(df,inp_record):
"""returns the closest euclidean distanced record"""
filtered_records = df.to_dict('records')#converting dataframe to a list of dictionaries
for record in filtered_records: #iterating through dictionaries
params = ['z_SA','z_LEV','z_AT', 'z_PM', 'z_RG']#parameters to calculate euclidean distance
distance = []
for param in params:
d1,d2 = record.get(param,0),inp_record.get(param,0) # fetching value of these parameters. default is0 if not found
if d1!=d1: #checking isNan
d1 = 0
if d2!=d2:
d2 = 0
distance.append((d1 - d2)**2)
euclidean = math.sqrt(sum(distance))
record['Euclidean distance'] = round(euclidean,6) #assigning to a new key
distance_records = sorted(filtered_records,key = lambda x:x['Euclidean distance']) #sorting in increasing order
return next(filter(lambda x:x['Euclidean distance'],distance_records),None) #returning the lowest value which is not zero. Default None
for record in records:
ipo_year = record.get('IPO Year')
sic_code = record.get('SIC-Code')
df = df_fundamentals[df_fundamentals['Year']==ipo_year]
df = df[df['SIC-Code']==sic_code] #filtering dataframe
closest_record = return_closest(df,record)
if closest_record:
record['Closest Company'] = closest_record.get('Name') #adding new columns
record['Actual Distance'] = closest_record.get('Euclidean distance')
df_dist = pd.DataFrame(records) #changing list of dictionaries back to dataframe
thanks in advance!
Based on your question, it is not exactly clear to me what your inputs are.
But as a simple fix, I would suggest you check before your function's for loop, whether the record you are comparing is identical to the one which you check against, i.e., add:
...
filtered_records = [rec for rec in filtered_records if rec['Name'] != inp_record['Name']]
for record in filtered_records: #iterating through dictionaries
...
This only applies, if 'Name' really contains the company name. Also for your function not to work, there seems to be an absolute distance greater zero when comparing your parameters. I am not sure if this is intended, maybe you look at data from different years? I cannot really tell, due to the limited amount of information.

Creating column names from multiple lists using for loop

Say I have multiple lists:
names1 = [name11, name12, etc]
names2 = [name21, name22, etc]
names3 = [name31, name32, etc]
How do I create a for loop that combines the components of the lists in order ('name11name21name31', 'name11name21name32' and so on)?
I want to use this to name columns as I add them to a data frame. I tried like this:
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I am trying to take some results that I obtain as an array and introduce them one by one in a data frame and giving the columns names as I go on. It is for a machine learning model I am trying to make.
This is the whole code, I am sure it is messy because I am a beginner.
Train = [X_train_F, X_train_M, X_train_R, X_train_SM]
Test = [X_test_F, X_test_M, X_test_R, X_test_SM]
models_to_run = [knn, svc, forest, dtc]
model_names = ['knn', 'svc' ,'forest', 'dtc']
Data_names = ['F', 'M', 'R', 'SM']
Results = pd.DataFrame()
for T, t in zip(Train, Test):
for j, model in enumerate(models_to_run):
model.fit(T, y_train.values.ravel())
proba = model.predict_proba(t)
proba = pd.DataFrame(proba.max(axis=1))
proba = proba.to_numpy()
proba = proba.flatten()
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I dont know how to integrate 'i' in the loop, to use it to go through the list Data_names to add it to the column name. I am sure there is a cleaner way to do this. Please be gentle.
Edit: It currently gives me a data frame with 4 columns instead of 16 as it should, and it just adds the whole Data_names list to the column name.
How about:
Results= {}
for T, t, dname in zip(Train, Test, Data_names):
for mname, model in zip(model_names, models_to_run):
...
Results[(dname, mname)] = proba.to_list()
Results = pd.DataFrame(Results.values(), index=Results.keys()).T

Sort PowerPoint chart data with Python?

I am sourcing chart data from an excel spreadsheet, by using Openpyxl.
I need to be able to sort this data largest to smallest without it getting jumbled. Sometimes there are multiple series in the charts.
Is there a method by which this can be accomplished?
If it cannot be sorted before being dropped into the plot, is there a means by which it can be sorted afterward? I imagine it would need to be located, then sorted. This would need to happen for every chart in the document.
Here's what I did to solve this problem, and it seems to be working currently. If anyone has suggestions to make this process more streamlined, I'm all ears!
for cat in data_collect['categories']: #creates new lists by category
new_list = []
for key in data_collect:
if key != 'categories':
new_list.append(data_collect[key][indexcounter])
else:
pass
new_list.insert(1, cat)
indexcounter += 1
data_sort.append(new_list)
data_sort.sort() #sorts list by first column
cat_list_sorted = []
for lst in data_sort: #removes sorted category list and drops it into the chart
cat_list_sorted.append(lst[1])
del lst[1]
chart_data.categories = cat_list_sorted
indexcounter = 0 #pulls data back apart and creates new series lists. Drops each series into ChartData()
while indexcounter < len(series_name_list):
series_sorted = []
for lst in data_sort:
series_sorted.append(lst[indexcounter])
series_name = series_name_list[indexcounter]
chart_data.add_series(series_name, series_sorted, '0%')
indexcounter += 1```

Categories