This question is a follow-up to the question I asked here, which in summary was:
"In python how do I read in parameters from the text file params.txt, creating the variables and assigning them the values that are in the file? The contents of the file are (please ignore the auto syntax highlighting, params.txt is actually a plain text file):
Lx = 512 Ly = 512
g = 400
================ Dissipation =====================
nupower = 8 nu = 0
...[etc]
and I want my python script to read the file so that I have Lx, Ly, g, nupower, nu etc available as variables (not keys in a dictionary) with the appropriate values given in params.txt. By the way I'm a python novice."
With help, I have come up with the following solution that uses exec():
with open('params.txt', 'r') as infile:
for line in infile:
splitline = line.strip().split(' ')
for i, word in enumerate(splitline):
if word == '=':
exec(splitline[i-1] + splitline[i] + splitline[i+1])
This works, e.g. print(Lx) returns 512 as expected.
My questions are:
(1) Is this approach safe? Most questions mentioning the exec() function have answers that contain dire warnings about its use, and imply that you shouldn't use it unless you really know what you're doing. As mentioned, I'm a novice so I really don't know what I'm doing, so I want to check that I won't be making problems for myself with this solution. The rest of the script does some basic analysis and plotting using the variables read in from this file, and data from other files.
(2) If I want to wrap up the code above in a function, e.g. read_params(), is it just a matter of changing the last line to exec(splitline[i-1] + splitline[i] + splitline[i+1], globals())? I understand that this causes exec() to make the assignments in the global namespace. What I don't understand is whether this is safe, and if not why not. (See above about being a novice!)
(1) Is this approach safe?
No, it is not safe. If someone can edit/control/replace params.txt, they can craft it in such a way to allow arbitrary code execution on the machine running the script.
It really depends where and who will run your Python script, and whether they can modify params.txt. If it's just a script run directly on a normal computer by a user, then there's not much to worry about, because they already have access to the machine and can do whatever malicious things they want, without having to do it using your Python script.
(2) If I want to wrap up the code above in a function, e.g. read_params(), is it just a matter of changing the last line to exec(splitline[i-1] + splitline[i] + splitline[i+1], globals())?
Correct. It doesn't change the fact you can execute arbitrary code.
Suppose this is params.txt:
Lx = 512 Ly = 512
g = 400
_ = print("""Holy\u0020calamity,\u0020scream\u0020insanity\nAll\u0020you\u0020ever\u0020gonna\u0020be's\nAnother\u0020great\u0020fan\u0020of\u0020me,\u0020break\n""")
_ = exec(f"import\u0020ctypes")
_ = ctypes.windll.user32.MessageBoxW(None,"Releasing\u0020your\u0020uranium\u0020hexaflouride\u0020in\u00203...\u00202...\u00201...","Warning!",0)
================ Dissipation =====================
nupower = 8 nu = 0
And this is your script:
def read_params():
with open('params.txt', 'r') as infile:
for line in infile:
splitline = line.strip().split(' ')
for i, word in enumerate(splitline):
if word == '=':
exec(splitline[i-1] + splitline[i] + splitline[i+1], globals())
read_params()
As you can see, it has correctly assigned your variables, but it has also called print, imported the ctypes library, and has then presented you with a dialog box letting you know that your little backyard enrichment facility has been thwarted.
As martineau suggested, you can use configparser. You'd have to modify params.txt so there is only one variable per line.
tl;dr: Using exec is unsafe, and not best practice, but that doesn't matter if your Python script will only be run on a normal computer by users you trust. They can already do malicious things, simply by having access to the computer as a normal user.
Is there an alternative to configparser?
I'm not sure. With your use-case, I don't think you have much to worry about. Just roll your own.
This is similar to some of the answers in your other question, but is uses literal_eval and updates the globals dictionary so you can directly use the variables as you want to.
params.txt:
Lx = 512 Ly = 512
g = 400
================ Dissipation =====================
nupower = 8 nu = 0
alphapower = -0 alpha = 0
================ Timestepping =========================
SOMEFLAG = 1
SOMEOTHERFLAG = 4
dt = 2e-05
some_dict = {"key":[1,2,3]}
print = "builtins_can't_be_rebound"
Script:
import ast
def read_params():
'''Reads the params file and updates the globals dict.'''
_globals = globals()
reserved = dir(_globals['__builtins__'])
with open('params.txt', 'r') as infile:
for line in infile:
tokens = line.strip().split(' ')
zipped_tokens = zip(tokens, tokens[1:], tokens[2:])
for prev_token, curr_token, next_token in zipped_tokens:
if curr_token == '=' and prev_token not in reserved:
#print(prev_token, curr_token, next_token)
try:
_globals[prev_token] = ast.literal_eval(next_token)
except (SyntaxError, ValueError) as e:
print(f'Cannot eval "{next_token}". {e}. Continuing...')
read_params()
# We can now use the variables as expected
Lx += Ly
print(Lx, Ly, SOMEFLAG, some_dict)
Output:
1024 512 1 {'key': [1, 2, 3]}
Related
I am using python to pre-process a LaTeX file before running it through the LaTeX compiler. I have a python script which reads a .def file. The .def file contains some python code at the top which helps to initialize problems with randomization. Below the python code I have LaTeX code for the problem. For each variable in the LaTeX code, I use the symbol # to signify that it should be randomized and replaced before compiling. For example, I may write #a to be a random integer between 1 and 10.
There may be larger issues with what I'm trying to do, but so far it's working mostly as I need it to. Here is a sample .def file:
a = random.choice(range(-3,2))
b = random.choice(range(-6,1))
x1 = random.choice(range(a-3,a))
x2 = x1+3
m = 2*x1 - 2*a + 3
y1 = (x1-a)**2+b
y2 = (x2-a)**2+b
xmin = a - 5
xmax = a + 5
ymin = b-1
ymax = b+10
varNames = [["#a", str(a)],["#b", str(b)], ["#x1",str(x1)], ["#x2",str(x2)], ["#m", str(m)], ["#y1",str(y1)], ["#y2",str(y2)], ["#xmin", str(xmin)], ["#xmax", str(xmax)], ["#ymin", str(ymin)], ["#ymax", str(ymax)]]
#####
\question The graph of $f(x) = (x-#a)^2+#b$ is shown below along with a secant line between the points $(#x1,#y1)$ and $(#x2,#y2)$.
\begin{center}
\begin{wc_graph}[xmin=#xmin, xmax=#xmax, ymin=#ymin, ymax=#ymax, scale=.75]
\draw[domain=#a-3.16:#a + 3.16, smooth] plot ({\x}, {(\x-#a)^2+#b});
\draw (#x1,#y1) to (#x2,#y2);
\pic at (#x1,#y1) {closed};
\pic at (#x2,#y2) {closed};
\end{wc_graph}
\end{center}
\begin{parts}
\part What does the slope of the secant line represent?
\vfill
\part Compute the slope of the secant line.
\vfill
\end{parts}
\newpage
As you can see, removing the #a and replacing it with the actual value of a is starting to become tedious. In my python script, I just replace all of the #ed things in my latexCode string.
for x in varNames:
latexCode = latexCode.replace(x[0],x[1])
which seems to work okay. However, it seems obnoxious and ugly.
My Question: Is there a better way of working between the variable identifier and the string version of the identifier? It would be great if I could simply make a python list of variable names I'm using in the .def file, and then run a simple command to update the LaTeX code. What I've done is cumbersome and painful. Thanks in advance!
Yes either eval (name) or getattr (obj, name) or globals () [name]. In this case I'd say globals [name].
Also vars would work:
https://docs.python.org/3/library/functions.html#vars
In the following fragment it's e.g used to make objects know their own name:
def _setPublicElementNames (self):
for var in vars (self):
if not var.startswith ('_'):
getattr (self, var) ._name = var
Yet another, unnecessary complicated, solution would be to generate a .py file with the right replace statements from your .def file.
I have been working with some code that exports layers individually filled with important data into a folder. The next thing I want to do is bring each one of those layers into a different program so that I can combine them and do some different tests. The current way that I know how to do it is by importing them one by one (as seen below).
fn0 = 'layer0'
f0 = np.genfromtxt(fn0 + '.csv', delimiter=",")
fn1 = 'layer1'
f1 = np.genfromtxt(fn1 + '.csv', delimiter=",")
The issue with continuing this way is that I may have to deal with up to 100 layers at a time, and it would be very inconvenient to have to import each layer individually.
Is there a way I can change my code to do this iteratively so that I can have a code similar to such:
N = 100
for i in range(N)
fn(i) = 'layer(i)'
f(i) = np.genfromtxt(fn(i) + '.csv', delimiter=",")
Please let me know if you know of any ways!
you can use string formatting as follows
N = 100
f = [] #create an empty list
for i in range(N)
fn_i = 'layer(%d)'%i #parentheses!
f.append(np.genfromtxt(fn_i + '.csv', delimiter=",")) #add to f
What I mean by parentheses! is that they are 'important' characters. They indicate function calls and tuples, so you shouldn't use them in variables (ever!)
The answer of Mohammad Athar is correct. However, you should not use the % printing any longer. According to PEP 3101 (https://www.python.org/dev/peps/pep-3101/) it is supposed to be replaced by format(). Moreover, as you have more than 100 files a format like layer_007.csv is probably appreciated.
Try something like:
dataDict=dict()
for counter in range(214):
fileName = 'layer_{number:03d}.csv'.format(number=counter)
dataDict[fileName] = np.genfromtxt( fileName, delimiter="," )
When using a dictionary, like here, you can directly access your data later by using the file name; it is unsorted though, such that you might prefer the list version of Mohammad Athar.
I think this code takes too long to execute, so maybe there are better ways to do this. I'm not looking for an answer related to parallelising the for loops, or using more than one processor.
What I'm trying to do is to read values from "file" using "np.genfromtxt(file)". I have 209*500*16 of these files. I want to extract the minimum value of the highest 1000 values of the 209 loop, and putting these 500 values in 16 different files. If the files are missing or the data hasn't the adequate size, the info is written to the "missing_all" file.
The questions are:
Is this the best method to open a file?
Is this the best method to write to files?
How can I make this code faster?
Code:
import numpy as np
import os.path
output_filename2 = '/home/missing_all.txt'
target2 = open(output_filename2, 'w')
for w in range(16):
group = 1200 + 50*w
output_filename = '/home/veto_%s.txt' %(group)
target = open(output_filename, 'w')
for z in range(1,501):
sig_b = np.zeros((209*300))
y = 0
for index in range(1,210):
file = '/home/BandNo_%s_%s/%s_209.dat' %(group,z,index)
if not os.path.isfile(file):
sig_b[y:y+300] = 0
y = y + 300
target2.write('%s %s %s\n' % (group,z,index))
continue
data = np.genfromtxt(file)
if (data.shape[0] < 300):
sig_b[y:y+300] = 0
y = y + 300
target2.write('%s %s %s\n' % (group,z,index))
continue
sig_b[y:y+300] = np.sort(data[:,4])[::-1][0:300]
y = y + 300
sig_b = np.sort(sig_b[:])[::-1][0:1000]
target.write('%s\n' % (sig_b[-1]))
Profiler
You can use a profiler to figure out what parts of your script take the most time. This way you know exactly what takes the most time and can optimize those lines instead of blindly trying to optimize your code. The time invested to figure out how the profiler works will pay for itself easily later on.
Some possible slow-downs
Here are some guesses, but they really are only guesses.
You open() only 17 files, so it probably doesn't matter how exactly you do this.
I don't know much about writing-performance. Using file.write() seems fine to me.
genfromtextfile probably takes quite a while (depends on your input files), is loadtxt an alternative for you? The docs states you can use it for data without holes.
Using a binary file format instead of text could speed up reading the file.
You sort your array on every iteration. Is there a way to sort it only at the end?
Usually asking the file system something is not very fast, i.e. os.path.isfile(file) is potentially slow. You could try creating a dict of all the children of the parent directory and use that cached version.
Similarly, if most of your files exist, using exceptions can be faster:
try:
data = np.genfromtxt(file)
except FileNotFoundError: # not sure if this is the correct exception
sig_b[y:y+300] = 0
y += 300
target2.write('%s %s %s\n' % (group,z,index))
continue
I didn't try to understand your code in detail. Maybe you can reduce the necessary work by using a smarter algorithm?
PS: I like that you try to put all equal signs on the same column. Unfortunately here it makes it harder to read your code.
I've wrtien a script which read different files and search molecular ID in big sdf databases (about 4.0 GB each).
the idea of this script is to copy every molecules from a list of id (about 287212 molecules) from my original databases to a new one in a way to have only one single copy of each molecule (in this case, the first copy encountered)
I've writen this script:
import re
import sys
import os
def sdf_grep (molname,files):
filin = open(files, 'r')
filine= filin.readlines()
for i in range(0,len(filine)):
if filine[i][0:-1] == molname and filine[i][0:-1] not in past_mol:
past_mol.append(filine[i][0:-1])
iterate = 1
while iterate == 1:
if filine[i] == "$$$$\n":
filout.write(filine[i])
iterate = 0
break
else:
filout.write(filine[i])
i = i+1
else:
continue
filin.close()
mol_dock = os.listdir("test")
listmol = []
past_mol = []
imp_listmol = open("consensus_sorted_surflex.txt", 'r')
filout = open('test_ini.sdf','wa')
for line in imp_listmol:
listmol.append(line.split('\t')[0])
print 'list ready... reading files'
imp_listmol.close()
for f in mol_dock:
print 'reading '+f
for molecule in listmol:
if molecule not in past_mol:
sdf_grep(molecule , 'test/'+f)
print len(past_mol)
filout.close()
it works perfectely, but it is very slow... too slow for what I need. Is there a way to rewrite this script in a way that can reduce the computation time?
thank you very much.
The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules.
Something like this:
For each sdf file
For each line, if it is molecule definition
Check in dictionary of unfound molecules
If present, process it and remove from dictionary of unfound molecules
This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.
Let past_mol be a set, rather than a list. That will speed up
filine[i][0:-1] not in past_mol
since checking membership in a set is O(1), while checking membership in a list is O(n).
Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.
It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.
By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).
If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.
As an added bonus, Python looks up local variables faster than global variables:
def using_local():
x = set()
for i in range(10**6):
x
y = set
def using_global():
for i in range(10**6):
y
In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop
In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop
sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):
import re
import sys
import os
def sdf_grep(molname, files, past_mol):
chunk = []
found = False
with open(files, 'r') as filin:
for line in filin:
sline = line.rstrip()
if sline == molname and sline not in past_mol:
found = True
past_mol.add(sline)
elif sline == '$$$$':
chunk.append(line)
found = False
if found:
chunk.append(line)
return '\n'.join(chunk)
def main():
past_mol = set()
with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
listmol = [line.split('\t')[0] for line in imp_listmol]
print 'list ready... reading files'
with open('test_ini.sdf', 'wa') as filout:
for f in os.listdir("test"):
print 'reading ' + f
for molecule in listmol:
if molecule not in past_mol:
filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))
print len(past_mol)
if __name__ == '__main__':
main()
Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1