Related
I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?
If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.
I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.
To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...
I've found some pieces of answers here and there but I can't figure the exact way to build what I want. Thank you by advance if you can help.
I have multiple text files, all built the same way but with different informations in each one of them. I'd like to loop over each file and return the infos in it line by line. On the other hand I have some booleans which define if one specific line in the file has to be skipped or not. For example: "if boolean1 is true and lineInTheCorrespondingFile = 40, then skip that line, else, read it but skip line 36 and 37 instead".
The thing is I don't know how to proceed for the function knows which file is opened and which line is read and if the it has to skip it or not. Knowing that I need each line to be returned independently at the end of the function.
And here is my code so far:
def locatorsDatas (self):
preset = cmds.optionMenu ("presetMenu", q = 1, v = 1)
rawFile = presetsDir + preset.lower() + ".txt"
with open(rawFile) as file:
file.seek (0)
for lineNum, line in enumerate(file, start = 1):
if lineNum > 8 : # Skip header
locator = eval (line)
locName = locator[0]
xVal = locator[1]
yVal = locator[2]
zVal = locator[3]
locScale = locator[4]
locColor = locator[5]
if locator == "":
break
return (locName, xVal, yVal, zVal, locScale, locColor)
I don't know what values I should pass into the function to make it skip the lines I want, knowing that I can't write it directly into it since each file doesn't break at the same lines.
Oh, and it only return one line of the file instead of each separately.
Hope it's clear and you can help me, thanks again.
I see a number of issues with your code.
To start with, you're always returning the data from line 8 and never any other data. If you have many values you want to extract from the file, you might want to make your function a generator by using a yield statement rather than a return. Then the calling code can access the data with a for loop or pass the generator to list or another function that accepts any iterable.
def locatorsDatas(self):
# ...
lineNum, line in enumerate(file, start=1):
# ...
yield results
If you can't use a generator but need your function to return successive lines, you'll need to save the file iterator (or perhaps the enumerate iterator wrapped around it) somewhere outside the function's scope. That means you won't need to reopen the file every time the function is called. You could do something like:
def __init__(self):
preset = cmds.optionMenu ("presetMenu", q = 1, v = 1)
rawFile = presetsDir + preset.lower() + ".txt"
self.preset_file_enumerator = enumerate(open(rawFile)) # save the iterator on self
def locatorsDatas(self):
try:
lineNum, line = next(self.preset_file_enumerator) # get a value from the iterator
# do your processing
return results
except StopIteration:
# do whatever is appropriate when there's no more data in the file here
raise ValueError("no more data") # such as raising an exception
The next issue I see is how you're processing each line to get the separate pieces of data. You're using eval which is a very bad idea if the data you're processing is even slightly entrusted. That's because eval interprets its argument as Python code. It can do anything, including deleting files from your hard drive! A safer version is available as ast.literal_eval, which only allows the string to contain Python literals (including lists, dictionaries and sets, but not variable lookups, function calls or other more complicated code).
You also have an error check that I don't think will do what you intend. The if locator == "" test is probably placed too late to avoid errors from the earlier lines extracting data from the eval'd line. And the break statement you run will cause the function to exit without returning anything more. If you just want to skip blank lines, you should put the check at the top of the loop and use continue rather than break.
Now finally we can get to the issue you're asking about in the title of the question. If you want to skip certain lines based on various flags, you just need to check those flags as you're looping and do a continue to skip past the lines you don't want to read. I don't entirely understand what you were asking about regarding how the flags are passed, but assuming you can give them as arguments, here's a sketch of how the code could look:
def locatorsDatas(self, skip_40=False, skip_50=True):
# open file, ...
for lineNum, line in enumerate(file, start = 1):
if (not line or
lineNum < 8 or
skip_40 and lineNum == 40 or
skip_50 and lineNum == 50):
continue
# parse the line
yield result
Obviously you should use your own flag names and logic rather than the ones I've made up for my example code. If your logic is more complicated, you might prefer using separate if statements for each flag, rather than packing them all into one lone conditional like I did.
Very shortly I need to verify if 3 conditions are verified and if not execute something regarding the failed conditions. I know I can iterate through the 3 conditions with multiple if/else statements but I was wondering if there is a simpler and more concise way to do it.
In a more generic way:
if condition1 and condition2 and condition3: pass
else: print which condition has failed
For an applied case:
if file_exist("1.txt") and file_exist("2.txt") and file_exist("3.txt"):
pass
else:
#find which condition has failed
#for file of the failed condition
create_file(...)
I am not looking to solve the example above! My question is about a way of finding which condition is not verified in a series of conditions on a single if/else statement!
Regards
I know I can iterate through the 3 files with multiple if/else statements
Every time you notice a repetition like this in a programming problem, it's a pretty good sign you can use a cycle:
for filename in ("1.txt", "2.txt", "3.txt"):
if not file_exist(filename):
create_file(...)
You could also use a list comprehension:
[create_file( filename ) for filename in ("1.txt", "2.txt", "3.txt") if not file_exist(filename)]
This is closer to the way you read it in english, but some people will frown upon it, because you're using a list comprehension to cause side-effects, instead of actually creating a list.
No, it's not possible. The if statement has only one condition, in this case that condition is condition1 and condition2 and condition3. It doesn't "remember" the results of sub-expressions in that expression, so unless they have side-effects you're out of luck.
Also be aware that if condition1 is false then it doesn't evaluate condition2 at all. So if you wanted to know which conditions (plural) failed, then and would be entirely the wrong tool for the job. You could instead do something like:
results = (condition1, condition2, condition3)
if all(results):
pass
else:
# look at the individual values
In practice, though, if you're going to "do something" for each false value when you look at the individual values, then you don't need to special-case them all being true. Just execute the same code doing nothing at each step.
I suppose that just to prove a point, you could do something peculiar to record the first failure:
def countem(result):
if result:
countem.count += 1
return result
countem.count = 0
if countem(condition1) and countem(condition2) and countem(condition3):
pass
else:
print countem.count
Or get rid of the if to be a tiny bit more concise:
conditions = (lambda: condition1, lambda: condition2, lambda: condition3)
first_failed = sum(1 for _ in itertools.takewhile(lambda f: f(), conditions))
Of course this is not sensible code for your example, but as far as it goes, it handles the general case.
Instead of just using an if/else solution:
for file_name in('1.txt', '2.txt', '3.txt'):
try:
with open(file_name): # default mode 'r' to read file
#do something... or not
except IOError:
with open(file_name, 'w') as f:
#something to do...
#modes can be 'w', 'a', 'w+', 'a+' for writing, appending,write/read, append/read respectively. There are others...
There is also:
import os.path
file_path = '/this path if all files/ have the same path.../'
for file_name in('1.txt', '2.txt', '3.txt'):
if os.path.exists(file_path/file_name):
continue
else:
#create the file
# though os.path.exists this will return True for directories also
I was just going to say it looks like a for loop to meet certain conditions.
try putting in print statements to see if each condition is being met...simple but effective.
I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")