I am running a model evaluation protocol for Modeller. It evaluates every model and writes its result to a separate file. However I have to run it for every model and write to a single file.
This is the original code:
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
# read model file
mdl = complete_pdb(env, 'TvLDH.B99990001.pdb')
# Assess all atoms with DOPE:
s = selection(mdl)
s.assess_dope(output='ENERGY_PROFILE NO_REPORT', file='TvLDH.profile',
normalize_profile=True, smoothing_window=15)
I added a loop to evaluate every model in a single run, however I am creating several files (one for each model) and I want is to print all evaluations in a single file
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
#My loop starts here
for i in range (1,1001):
number=str(i)
if i<10:
name='000'+number
else:
if i<100:
name='00'+number
else:
if i<1000:
name='0'+number
else:
name='1000'
# read model file
mdl = complete_pdb(env, 'TcP5CDH.B9999'+name+'.pdb')
# Assess all atoms with DOPE: this is the assesment that i want to print in the same file
s = selection(mdl)
savename='TcP5CDH.B9999'+name+'.profile'
s.assess_dope(output='ENERGY_PROFILE NO_REPORT',
file=savename,
normalize_profile=True, smoothing_window=15)
As I am new to programming, any help will be very helpful!
Welcome :-) Looks like you're very close. Let's introduce you to using a python function and the .format() statement.
Your original has a comment line # read model file, which looks like it could be a function, so let's try that. It could look something like this.
from modeller import *
from modeller.scripts import complete_pdb
log.verbose() # request verbose output
# I'm assuming this can be done just once
# and re-used for all your model files...
# (if not, the env stuff should go inside the
# read_model_file() function.
env = environ()
env.libs.topology.read(file='$(LIB)/top_heav.lib') # read topology
env.libs.parameters.read(file='$(LIB)/par.lib') # read parameters
def read_model_file(file_name):
print('--- read_model_file(file_name='+file_name+') ---')
mdl = complete_pdb(env, file_name)
# Assess all atoms with DOPE:
s = selection(mdl)
output_file = file_name+'.profile'
s.assess_dope(
output='ENERGY_PROFILE NO_REPORT',
file=output_file,
normalize_profile=True,
smoothing_window=15)
for i in range(1,1001):
file_name = 'TcP5CDH.B9999{:04d}pdb'.format(i)
read_model_file(file_name)
Using .format() we can get ride of the multiple if-statement checks for 10? 100? 1000?
Basically .format() replaces {} curly braces with the argument(s).
It can be pretty complex but you don't need to digetst all of it.
Example:
'Hello {}!'.format('world') yields Hello world!. The {:04d} stuff uses formatting, basically that says "Please make a 4-character wide digit-substring and zero-fill it, so you should get '0001', ..., '0999', '1000'.
Just {:4d} (no leading zero) would give you space padded results (e.g. ' 1', ..., ' 999', '1000'.
Here's a little more on the zero-fill: Display number with leading zeros
Related
I have a rule that calculates new variables based on a set of variables, these variables are separated into different files. I have another rule for calculating the averages of all the different variables in the whole database. My problem is that snakemake tries to find my derived variables in the original database, which of course are not there.
Is there a way to have constrained the averaging rule such that it will calculate the average for all variables except for a list of the variables that are derived
Psudo code of how the rule look like
rule calc_average:
input:
pi_clim_var = lambda w: get_control_path(w, w.variable),
output:
outpath = outdir+'{experiment}/{variable}/{variable}_{experiment}_{model}_{freq}.nc'
log:
"logs/calc_average/{variable}_{model}_{experiment}_{freq}.log"
wildcard_constraints:
variable= '!calculated1!calculated2' # "orrvar1|orrvar2"....
notebook:
"../notebooks/calc_clim.py.ipynb"
I can of make a list of all the variables that I would like to have in the database and
then do:
wildcard_constraints:
variable="|".join(list_of_vars)
But I was wondering if it is possible to do it the other way round? E.g:
wildcard_constraints:
variable="!".join(negate_list_of_vars) # don't match these wildcards
EDIT:
The get_control_path(w, w.variable) constructs the path to the input file based on a lookup table that uses the wildcards as a keys.
def get_control_path(w, variable, grid_label=None):
if grid_label == None:
grid_label = config['default_grid_label']
try:
paths = get_paths(w, variable,'piClim-control', grid_label,activity='RFMIP', control=True)
except KeyError:
paths = get_paths(w, variable,'piClim-control', grid_label,activity='AerChemMIP', control=True)
return paths
def get_paths(w, variable,experiment, grid_label=None, activity=None, control=False):
"""
Get CMIP6 model paths in database based on the lookup tables.
Parameters:
-----------
w : snake.wildcards
a named tuple that contains the snakemake wildcards
"""
if w.model in ["NorESM2-LM", "NorESM2-MM"]:
root_path = f'{ROOT_PATH_NORESM}/{CMIP_VER}'
look_fnames = LOOK_FNAMES_NORESM
else:
root_path = f'{ROOT_PATH}/{CMIP_VER}'
look_fnames = LOOK_FNAMES
if activity:
activity= activity
else:
activity = LOOK_EXP[experiment]
model = w.model
if control:
variant=config['model_specific_variant']['control'].get(model, config['variant_default'])
else:
variant = config['model_specific_variant']['experiment'].get(model, config['variant_default'])
table_id = TABLE_IDS.get(variable,DEFAULT_TABLE_ID)
institution = LOOK_INSTITU[model]
try:
file_endings = look_fnames[activity][model][experiment][variant][table_id]['fn']
except:
raise KeyError(f"File ending is not defined for this combination of {activity}, {model}, {experiment}, {variant} and {table_id} " +
"please update config/lookup_file_endings.yaml accordingly")
if grid_label == None:
grid_label = look_fnames[activity][model][experiment][variant][table_id]['gl'][0]
check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
if os.path.exists(check_path)==False:
grid_labels = ['gr','gn', 'gl','grz', 'gr1']
i = 0
while os.path.exists(check_path)==False and i < len(grid_labels):
grid_label = grid_labels[i]
check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
i += 1
if control:
version = config['version']['version_control'].get(w.model, 'latest')
else:
version = config['version']['version_exp'].get(w.model, 'latest')
fname = f'{variable}_{table_id}_{model}_{experiment}_{variant}_{grid_label}'
paths = expand(
f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/{version}/{fname}_{{file_endings}}'
,file_endings=file_endings)
# Sometimes the verisons are just messed up... try one more time with latest
if not os.path.exists(paths[0]):
paths=expand(
f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/latest/{fname}_{{file_endings}}'
,file_endings=file_endings)
# Sometimes the file ending are different depending on varialbe
if not os.path.exists(paths[0]) and len(paths) >= 2:
paths = [paths[1]]
return paths
Your description is a little too abstract for me to fully grasp your intent. You may be able to use regex's to solve this, but depending on the number of variables to consider, it could be a very slow calculation to match. Here are some other ideas, if they don't seem right please update your question with a little more context (the rule requesting calc_average and the get_control_path function).
Place derived and original files in different subdirectories. Then you can restrict the average rule to just be the original files.
Incorporate the logic into an input function/expand. Say the requesting rule is doing something like
rule average:
input: expand('/path/to/{input}', input=[input for input in inputs if input not in negate_list_of_vars])
output: 'path/to/average'
What can i do to prevent same file being read more then twice?
For the background, i have below detail
Im trying to read list of file in a folder, transform it, output it into a file, and check the gap before and after transformation
first for the reading part
def load_file(file):
df = pd.read_excel(file)
return df
file_list = glob2.glob("folder path here")
future_list = [delayed(load_file)(file) for file in file_list]
read_result_dd = dd.from_delayed(future_list)
After that , i will do some transformation to the data:
def transform(df):
# do something to df
return df
transformation_result = read_result_dd.map_partitions(lambda df: transform(df))
i would like to achieve 2 things:
first to get the transformation output:
Outputfile = transformation_result.compute()
Outputfile.to_csv("path and param here")
second to get the comparation
read_result_comp = read_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
transformation_result_comp = transformation_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Checker.to_csv("path and param here")
The problem is if i call Outputfile and Checker in sequence, i.e.:
Outputfile = transformation_result.compute()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Outputfile.to_csv("path and param here")
Checker.to_csv("path and param here")
it will read the entire file twice (for each of the compute)
Is there any way to have the read result done only once?
Also are there any way to have both compute() to run in a sequence? (if i run it in two lines, from the dask dashboard i could see that it will run the first, clear the dasboard, and run the second one instead of running both in single sequence)
I cannot run .compute() for the result file because my ram can't contain it, the resulting dataframe is too big. both the checker and the output file is significantly smaller compared to the original data.
Thanks
You can call the dask.compute function on multiple Dask collections
a, b = dask.compute(a, b)
https://docs.dask.org/en/latest/api.html#dask.compute
In the future, I recommend producing an MCVE
I altered some CLIPS/CLIPSpy code to look for when the Variable column in a CSV is the word Oil Temp and when the duration of that column is above 600 or above. The rule should fire twice according to the CSV I'm using:
I'm receiving the following error.
Here is my code currently. I think it's failing on the variable check or the logical and check.
import sys
from tempfile import mkstemp
import os
import clips
CLIPS_CONSTRUCTS = """
(defglobal ?*oil-too-hot-times* = 0)
(deftemplate oil-is-too-hot-too-long
(slot Variable (type STRING))
(slot Duration (type INTEGER)))
(defrule check-for-hot-oil-too-long-warning
(oil-is-too-hot-too-long (Variable ?variable) (Duration ?duration))
(test (?variable Oil Temp))
(and (>= ?duration 600))
=>
(printout t "Warning! Check engine light on!" tab ?*oil-too-hot-times* crlf))
"""
def main():
environment = clips.Environment()
# use environment.load() to load constructs from a file
constructs_file, constructs_file_name = mkstemp()
file = open(constructs_file, 'wb')
file.write(CLIPS_CONSTRUCTS.encode())
file.close()
environment.load(constructs_file_name)
os.remove(constructs_file_name)
# enable fact duplication as data has duplicates
environment.eval("(set-fact-duplication TRUE)")
# Template facts can be built from their deftemplate
oil_too_hot_too_long_template = environment.find_template("oil-is-too-hot-too-long")
for variable, duration in get_data_frames(sys.argv[1]):
new_fact = oil_too_hot_too_long_template.new_fact()
# Template facts are represented as dictionaries
new_fact["Variable"] = variable
new_fact["Duration"] = int(duration)
# Add the fact into the environment Knowledge Base
new_fact.assertit()
# Fire all the rules which got activated
environment.run()
def get_data_frames(file_path):
"""Parse a CSV file returning the dataframes."""
with open(file_path) as data_file:
return [l.strip().split(",") for i, l in enumerate(data_file) if i > 1]
if __name__ == "__main__":
main()
CLIPS adopts Polish/Prefix notation. Therefore, your rule should be written as follows.
(defrule check-for-hot-oil-too-long-warning
(oil-is-too-hot-too-long (Variable ?variable) (Duration ?duration))
(test (and (eq ?variable "Oil Temp")
(>= ?duration 600)))
=>
(printout t "Warning! Check engine light on!" tab ?*oil-too-hot-times* crlf))
Also notice how the type STRING requires double quotes ".
Yet I'd suggest you to leverage the alpha network matching of the engine which is more concise and efficient.
(defrule check-for-hot-oil-too-long-warning
(oil-is-too-hot-too-long (Variable "Oil Temp") (Duration ?duration))
(test (>= ?duration 600))
=>
(printout t "Warning! Check engine light on!" tab ?*oil-too-hot-times* crlf))
The engine can immediately see that your Variable slot is a constant and can optimize the matching logic accordingly. I am not sure it can make the same assumption within the joint test.
I am trying to create my own tool to use in ArcMap but keep running into a problem. I want to create a buffer (which I can do) and then clip the points that fall within the buffer. The problem I run into is that I cannot figure out how to use the buffer as the input feature for the clip section of my tool.
import arcpy
import os
from arcpyimmport env
env.workspace = "C:/LabData"
arcpy.env.overwriteOutput = True
In_lake = arcpy.GetParameterAsText(0)
Out_Buff = arcpy.GetParameterAsText(1)
Buffer_Distance = arcpy.GetParameterAstext(2)
in_cities = arcpy.GetParameterAsText(3)
cliped_cities = GetParameterAsText(4)
New_Table = arcpy.GetParameterAsText(5)
Join_Input = arcpy.GetParameteAsText(6)
# step 1 create a buffer around the lakes
arcpy.Buffer_analysis(In_Lake, Out_Buff, Buffer_Distance)
# Step 2 Clip all cities that fall within the buffer
arcpy.Clip_analysis( in_cities,out_Buff, clipped_cities)
# Step 3
arcpy.Statistics_analysis(clipped_cities, New_Table, statistics_fields,\
'Population SUM', 'CNTRY_NAME')
# Step 5
arcpy.AddField_management (New_Table, 'Country', 'TEXT')
]1
Check carefully that your variable names match -- Python and ArcPy are case sensitive.
In_Lake = arcpy.GetParameterAsText(0) ## was In_lake
Out_Buff = arcpy.GetParameterAsText(1)
Buffer_Distance = arcpy.GetParameterAstext(2)
in_cities = arcpy.GetParameterAsText(3)
clipped_cities = GetParameterAsText(4) ## was cliped_cities
New_Table = arcpy.GetParameterAsText(5)
Join_Input = arcpy.GetParameteAsText(6)
# step 1 create a buffer around the lakes
arcpy.Buffer_analysis(In_Lake, Out_Buff, Buffer_Distance)
# Step 2 Clip all cities that fall within the buffer
arcpy.Clip_analysis(in_cities, Out_Buff, clipped_cities) ## was out_Buff
Unless you want to keep the lake buffer, it doesn't necessarily need to be an input parameter specified by the user. Consider instead using the in_memory workspace -- just be aware any data in it will be deleted once the tool execution is completed.
Out_Buff = r'in_memory\lakeBuffer'
A similar strategy can be used for any intermediate feature class or table that you don't really care about. However, it's sometimes useful to have those intermediate results around to verify that your tool is doing what you expect at every step.
I am trying to read a log file and compare certain values against preset thresholds. My code manages to log the raw data from with the first for loop in my function.
I have added print statements to try and figure out what was going on and I've managed to deduce that my second for loop never "happens".
This is my code:
def smartTest(log, passed_file):
# Threshold values based on averages, subject to change if need be
RRER = 5
SER = 5
OU = 5
UDMA = 5
MZER = 5
datafile = passed_file
# Log the raw data
log.write('=== LOGGING RAW DATA FROM SMART TEST===\r\n')
for line in datafile:
log.write(line)
log.write('=== END OF RAW DATA===\r\n')
print 'Checking SMART parameters...',
log.write('=== VERIFYING SMART PARAMETERS ===\r\n')
for line in datafile:
if 'Raw_Read_Error_Rate' in line:
line = line.split()
if int(line[9]) < RRER and datafile == 'diskOne.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK ONE OK!\r\n" %int(line[9]))
elif int(line[9]) < RRER and datafile == 'diskTwo.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK TWO OK!\r\n" %int(line[9]))
else:
print 'FAILED'
log.write("WARNING: Raw_Read_Error_Rate SMART parameter is: %s. Value over threshold!\r\n" %int(line[9]))
rcode = mbox(u'Attention!', u'One or more hardrives may need replacement.', 0x30)
This is how I am calling this function:
dataOne = diskOne()
smartTest(log, dataOne)
print 'Disk One Done'
diskOne() looks like this:
def diskOne():
if os.path.exists(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl"):
os.chdir(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl")
os.system("Smartctl -a /dev/csmi0,0 > C:\Dejero\Installation-Scripts\diskOne.txt")
# Store file in variable
os.chdir(r"C:\Dejero\Installation-Scripts")
datafile = open('diskOne.txt', 'rb')
return datafile
else:
log.write('Smart utility not found.\r\n')
I have tried googling similar issues to mine and have found none. I tried moving my first for loop into diskOne() but the same issue occurs. There is no syntax error and I am just not able to see the issue at this point.
It is not skipping your second loop. You need to seek the position back. This is because after reading the file, the file offset will be placed at the end of the file, so you will need to put it back at the start. This can be done easily by adding a line
datafile.seek(0);
Before the second loop.
Ref: Documentation