I'm trying to use a function to iteratively return several machine learning models (pickles) with a function based on the accuracy cutoff I specify.
My issue is that I'm trying to load the pickles with eval, as their names correspond to the number given by sdf['number']. The eval function is not loading my pickles, and beyond that, I want them to be loaded and returned by my function. I have tested this by attempting to directly run data through each model after loading it before it moves on to the next one, but it is returning "learn0 not defined" for example.
Any thoughts on how to better do this iteratively?
Variables Explained:
jar = A list of the different variable names (learner names) that I
expected it to load. For example, learn0, learn1, etc.
cutoff = Accuracy Cutoff
sdf_temp = Temporary Study DataFrame
def piklJar(sdf,cutoff):
sdf_temp = sdf[sdf['value'] <= cutoff]
jar = []
i=0
for pklNum in sdf_temp['number']:
eval('"learn{} = load_learner({}/Models/Pkl {}.pkl)".format(i,datapath,pklNum)')
jar.append('learn{}'.format(i))
i+=1
return jar
eval isn't needed. Your example wasn't working code, but this is approximately the same thing:
def piklJar(sdf,cutoff):
sdf_temp = sdf[sdf['value'] <= cutoff]
return [load_learner(f'{datapath}/Models/Pkl {pklNum}') for pklNum in sdf_temp['number']]
After calling jar = pklJar(...), jar[0] would be equivalent to learn0, jar[1] would be learn1, etc. The various load_learner calls are stored in a list generated from a list comprehension.
Related
How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))
I want to create multiple variable through for loop to further use and compare in the program.
Here is the code -
for i in range(0,len(Header_list)):
(f'len_{i} = {len(Header_list[i]) + 2 }')
print(len_0);print(f'len{i}')
for company in Donar_list:
print(company[i],f"len_{i}")
if len(str(company[i])) > len((f"len_{i}")) :
(f'len_{i}') = len(str(company[i]))
print(f"len_{i}")
But what is happening, though I managed to create variable len_0,len_1,len_2... in line-2, in line - 3 I also can print len_0..etc variable by only using print(len_0), but I can't print it the values of these by - print(f'len_{i}')
In line 5 I also can't compare with the cofition with my intension. I want it do create variable and compare it further when as it necessary, in this case under for loop.What should I do now? I am a beginner and I can do it using if statement but that wouldn't be efficient, also my intention is not to create any **data structure ** for this.
I don't know whether I could manage to deliver you what I am trying to say. Whatever I just wanna create different variable using suffix and also comprare them through for loop in THIS scenario.
Instead of dynamically creating variables, I would HIGHLY recommend checking out dictionaries.
Dictionaries allow you to store variables with an associated key, as so:
variable_dict = dict()
for i in range(0,len(Header_list)):
variable_dict[f'len_{i}'] = {len(Header_list[i]) + 2 }
print(len_0)
print(f'len{i}')
for company in Donar_list:
print(company[i],f"len_{i}")
if len(str(company[i])) > len(variable_dict[f"len_{i}"]) :
variable_dict[f'len_{i}'] = len(str(company[i]))
print(f"len_{i}")
This allows you to access the values using the same key:
len_of_4 = variable_dict['len_4']
If you REALLY REALLY need to dynamically create variables, you could use the exec function to run strings as python code. It's important to note that the exec function is not safe in python, and could be used to run any potentially malicious code:
for i in range(0,len(Header_list)):
exec(f'len_{i} = {len(Header_list[i]) + 2 }')
print(len_0);print(f'len{i}')
for company in Donar_list:
print(company[i],f"len_{i}")
if exec(f"len(str(company[i])) > len(len_{i})"):
exec(f'len_{i} = len(str(company[i]))')
print(f"len_{i}")
In python everything is object so use current module as object and use it like this
import sys
module = sys.modules[__name__]
Header_list=[0,1,2,3,4]
len_ = len(Header_list)
for i in range(len_):
setattr(module, f"len_{i}", Header_list[i]+2)
print(len_0)
print(len_1)
Header_list=[0,1,2,3,4]
for i in range(0,5):
exec(f'len_{i} = {Header_list[i] + 2 }')
print(f'len{i}')
output:
len0
len1
len2
len3
len4
Hi got into another roadblock in tensorflow crashcourse...at the representation programming excercises at this page.
https://developers.google.com/…/repres…/programming-exercise
I'm at Task 2: Make Better Use of Latitude
seems I narrowed the issue to when I convert the raw latitude data into "buckets" or ranges which will be represented as 1 or zero in my feature. The actual code and issue I have is in the paste bin. Any advice would be great! thanks!
https://pastebin.com/xvV2A9Ac
this is to convert the raw latitude data in my pandas dictionary into "buckets" or ranges as google calls them.
LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45))
the above code I changed and replaced xrange with just range since xrange is already deprecated python3.
could this be the problem? using range instead of xrange? see below for my conundrum.
def select_and_transform_features(source_df):
selected_examples = pd.DataFrame()
selected_examples["median_income"] = source_df["median_income"]
for r in LATITUDE_RANGES:
selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(
lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
return selected_examples
The next two are to run the above function and convert may exiting training and validation data sets into ranges or buckets for latitude
selected_training_examples = select_and_transform_features(training_examples)
selected_validation_examples = select_and_transform_features(validation_examples)
this is the training model
_ = train_model(
learning_rate=0.01,
steps=500,
batch_size=5,
training_examples=selected_training_examples,
training_targets=training_targets,
validation_examples=selected_validation_examples,
validation_targets=validation_targets)
THE PROBLEM:
oki so here is how I understand the problem. When I run the training model it throws this error
ValueError: Feature latitude_32_to_33 is not in features dictionary.
So I called selected_training_examples and selected_validation_examples
here's what I found. If I run
selected_training_examples = select_and_transform_features(training_examples)
then I get the proper data set when I call selected_training_examples which yields all the feature "buckets" including Feature #latitude_32_to_33
but when I run the next function
selected_validation_examples = select_and_transform_features(validation_examples)
it yields no buckets or ranges resulting in the
`ValueError: Feature latitude_32_to_33 is not in features dictionary.`
so I next tried disabling the first function
selected_training_examples = select_and_transform_features(training_examples)
and I just ran the second function
selected_validation_examples = select_and_transform_features(validation_examples)
If I do this, I then get the desired dataset for
selected_validation_examples .
The problem now is running the first function no longer gives me the "buckets" and I'm back to where I began? I guess my question is how are the two functions affecting each other? and preventing the other from giving me the datasets I need? If I run them together?
Thanks in advance!
a python developer gave me the solution so just wanted to share. LATITUDE_RANGES = zip(xrange(32, 44), xrange(33, 45)) can only be used once the way it was written so I placed it inside the succeding def select_and_transform_features(source_df) function which solved the issues. Thanks again everyone.
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.
I'm developping a QGIS plugin (under version 2.8.1) for traffic assignment where I want to show the results of my simulation at each time step. Right now I'm using Time Manager plugin but it gets very slow when my layer has hundreds of thousands of attributes. In my case I know exactly what feature IDs I want to show at each time step so I thought it would be easy to make it faster.
Here is what I tried (sorry of my way of python programming but I'm quite new using this language): at each time step of my loop I set the ordered list of indexes of attributes to show (they are always ordered in my case).
# TEST 1 -----------------------------------
for step in time_steps:
index_start = my_list_of_indexes_start[step]
index_end = my_list_of_indexes_end[step]
expression = 'fid >= ' + str(index_start) + ' AND fid <= ' + str(index_end)
# Or for optimization tests
# expression = '"FIELD_TIME"' + "=" + str(step)
layer_dynamic.setSubsetString(expression)
self.iface.mapCanvas().refresh()
time.sleep(0.2)
# TEST 2 ------------------------------------
for step in time_steps:
index_start = my_list_of_indexes_start[step]
index_end = my_list_of_indexes_end[step]
indexes = list(j for j in range(index_start, index_end))
request = QgsFeatureRequest().setFilterFids(indexes)
layer_dynamic.getFeatures(request)
self.iface.mapCanvas().refresh()
time.sleep(0.2)
Solution 1 with
layer_dynamic.setSubsetString(expression)
works as it refresh the view with the correct filtered features displayed on canvas at each time step but it is even slower than using a SQL expression not based on the indexes but on attributes values (as shown in comment in TEST 1 loop).
Solution 2 with
layer_dynamic.getFeatures(request)
is fast but the display of the layer doesn't change.
Any idea why?
The method
bool QgsVectorLayer.setSubsetString(self, QString subset)
filters the layer (more details in setSubsetString), so only the features that match the filter (provided using a SQL statement or other definition string the the "subset" QString) "will belong to the layer" after it's being filtered. Thus, when you call refresh, only the filtered features are displayed.
On the other hand, the method
QgsFeatureIterator QgsVectorLayer.getFeatures(self, QgsFeatureRequest request=QgsFeatureRequest())
returns a iterator for the features matching you request (more details in getFeatures). It doesn't filter the layer. Using the iterator, you just iterate over the features matching the request.