I'm writing a script that logs errors from another program and restarts the program where it left off when it encounters an error. For whatever reasons, the developers of this program didn't feel it necessary to put this functionality into their program by default.
Anyways, the program takes an input file, parses it, and creates an output file. The input file is in a specific format:
UI - 26474845
TI - the title (can be any number of lines)
AB - the abstract (can also be any number of lines)
When the program throws an error, it gives you the reference information you need to track the error - namely, the UI, which section (title or abstract), and the line number relative to the beginning of the title or abstract. I want to log the offending sentences from the input file with a function that takes the reference number and the file, finds the sentence, and logs it. The best way I could think of doing it involves moving forward through the file a specific number of times (namely, n times, where n is the line number relative to the beginning of the seciton). The way that seemed to make sense to do this is:
i = 1
while i <= lineNumber:
print original.readline()
i += 1
I don't see how this would make me lose data, but Python thinks it would, and says ValueError: Mixing iteration and read methods would lose data. Does anyone know how to do this properly?
You get the ValueError because your code probably has for line in original: in addition to original.readline(). An easy solution which fixes the problem without making your program slower or consume more memory is changing
for line in original:
...
to
while True:
line = original.readline()
if not line: break
...
Use for and enumerate.
Example:
for line_num, line in enumerate(file):
if line_num < cut_off:
print line
NOTE: This assumes you are already cleaning up your file handles, etc.
Also, the takewhile function could prove useful if you prefer a more functional flavor.
Assuming you need only one line, this could be of help
import itertools
def getline(fobj, line_no):
"Return a (1-based) line from a file object"
return itertools.islice(fobj, line_no-1, line_no).next() # 1-based!
>>> print getline(open("/etc/passwd", "r"), 4)
'adm:x:3:4:adm:/var/adm:/bin/false\n'
You might want to catch StopIteration errors (if the file has less lines).
Here's a version without the ugly while True pattern and without other modules:
for line in iter(original.readline, ''):
if …: # to the beginning of the title or abstract
for i in range(lineNumber):
print original.readline(),
break
Related
I'm currently working through a introductory python book called "Think Python". In one of the exercises, I'm supposed to write a program that takes a string of characters, and counts how many words in file called "words.txt" (http://greenteapress.com/thinkpython2/code/words.txt) do not have letters from that string of characters.
My code is here:
fin = open('words.txt')
def avoids(word,forbidden):
avoided=True
for i in forbidden:
if i in word:
avoided=False
break #break out of for loop
if avoided==True:
return avoided
def number_avoids(forbidden):
"""Finds number of words excluded by character"""
avoided=0
for line in fin:
if avoids(line,forbidden):
avoided+=1
return avoided
print(number_avoids("a"))
print(number_avoids("a"))
What I'm confused about though, is why I got two different answers for the same code. For the first print(number_avoids("a")), the result was 57196. For the second one, the program printed out 0. Could someone explain to me why the same code will give out two different answers?
Thanks.
Problem
When you open a file, there's a cursor which points to the current position in file. At first function call, this cursor is at the starting of the file. So, it reads all the contents and your program works well.
But, when you call the function second time, the cursor is at the End of File. So, there are no more characters to read. You can verify it by adding a print(line) statement inside your loop of number_avoids function.
Solution
There's a builtin function to move the file cursor. You can use it to move your cursor to initial position:
...
print(number_avoids("a"))
fin.seek(0)
print(number_avoids("a"))
It will move your cursor to the start of file. So, all of the file contents will be read and evaluated again.
Note: I have tried to make this answer as basic as I can so that it can be understood by anyone without the knowledge of file handling. Feel free to ask for any clarifications in comments.
I'm very new to programming and am working on some code to extract data from a bunch of text files. I've been able to do this however the data is not useful to me in Excel. Therefore, I would like to print it all on a single line and separate it by a special character, which I can then delimit in Excel.
Here is my code:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
for line in inF:
for a in data:
string=a
if string in line:
print (line,end='*') #print on same line
inF.close()
So basically what I'm doing is finding if a keyword is on that line and then printing that line if it is.
Even though I have print(,end='*'), I don't get the print on a single line. It outputs:
Find me
*find you
*find us
Where is the problem? (I'm using Python 3.5.1)
Your immediate problem is that you're not removing the newline characters from your lines before printing them. The usual way to do this is with strip(), eg:
print(line.strip(), end='*')
You'll also print multiple copies of the line if more than one of your special phrases appear in the line. To avoid that, add a break statement after your print, or (better, but a more advanced construct that might not make sense until you're used to generator expressions) use if any(keyword in line for keyword in data):
You also don't need to explicitly close the input file - the point of the with open(...) as ...: context manager is that it closes the file when exiting it.
And I would avoid using string as a variable name - it doesn't tell anyone anything about what the variable is used for, and it can cause confusion if you end up using the built-in string module for anything. It's not as bad as shadowing a built-in constructor like list, but it's worth avoiding. Especially since it does nothing for you here, you can just use if a in line: here if you don't want to use the any() version above.
In addition to all that, if your data is not extremely large (and I hope it's not if you're trying to fit it all on one line) you'll get tidier code and avoid the trailing delimiter by using the .join() method on strings, eg something like:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
print "*".join(line.strip() for line in inF if any(keyword in line for keyword in data))
It is very similar to this:
How to tell if a string contains valid Python code
The only difference being instead of the entire program being given altogether, I am interested in a single line of code at a time.
Formally, we say a line of python is "syntactically valid" if there exists any syntactically valid python program that uses that particular line.
For instance, I would like to identify these as syntactically valid lines:
for i in range(10):
x = 1
Because one can use these lines in some syntactically valid python programs.
I would like to identify these lines as syntactically invalid lines:
for j in range(10 in range(10(
x =++-+ 1+-
Because no syntactically correct python programs could ever use these lines
The check does not need to be too strict, it just need to be good enough to filter out obviously bogus statements (like the ones shown above). The line is given as a string, of course.
This uses codeop.compile_command to attempt to compile the code. This is the same logic that the code module does to determine whether to ask for another line or immediately fail with a syntax error.
import codeop
def is_valid_code(line):
try:
codeop.compile_command(line)
except SyntaxError:
return False
else:
return True
It can be used as follows:
>>> is_valid_code('for i in range(10):')
True
>>> is_valid_code('')
True
>>> is_valid_code('x = 1')
True
>>> is_valid_code('for j in range(10 in range(10(')
True
>>> is_valid_code('x = ++-+ 1+-')
False
I'm sure at this point, you're saying "what gives? for j in range(10 in range(10( was supposed to be invalid!" The problem with this line is that 10() is technically syntactically valid, at least according to the Python interpreter. In the REPL, you get this:
>>> 10()
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
10()
TypeError: 'int' object is not callable
Notice how this is a TypeError, not a SyntaxError. ast.parse says it is valid as well, and just treats it as a call with the function being an ast.Num.
These kinds of things can't easily be caught until they actually run.
If some kind of monster managed to modify the value of the cached 10 value (which would technically be possible), you might be able to do 10(). It's still allowed by the syntax.
What about the unbalanced parentheses? This fits the same bill as for i in range(10):. This line is invalid on its own, but may be the first line in a multi-line expression. For example, see the following:
>>> is_valid_code('if x ==')
False
>>> is_valid_code('if (x ==')
True
The second line is True because the expression could continue like this:
if (x ==
3):
print('x is 3!')
and the expression would be complete. In fact, codeop.compile_command distinguishes between these different situations by returning a code object if it's a valid self-contained line, None if the line is expected to continue for a full expression, and throwing a SyntaxError on an invalid line.
However, you can also get into a much more complicated problem than initially stated. For example, consider the line ). If it's the start of the module, or the previous line is {, then it's invalid. However, if the previous line is (1,2,, it's completely valid.
The solution given here will work if you only work forward, and append previous lines as context, which is what the code module does for an interactive session. Creating something that can always accurately identify whether a single line could possibly exist in a Python file without considering surrounding lines is going to be extremely difficult, as the Python grammar interacts with newlines in non-trivial ways. This answer responds with whether a given line could be at the beginning of a module and continue on to the next line without failing.
It would be better to identify what the purpose of recognizing single lines is and solve that problem in a different way than trying to solve this for every case.
I am just suggesting, not sure if going to work... But maybe something with exec and try-except?
code_line += "\n" + ("\t" if code_line[-1] == ":" else "") + "pass"
try:
exec code_line
except SyntaxError:
print "Oops! Wrong syntax..."
except:
print "Syntax all right"
else:
print "Syntax all right"
Simple lines should cause an appropriate answer
I’m familiar with text wrapping, however I was wondering if there was a way to prevent text from wrapping. If you have a long print statement, when it reaches the end of the line it automatically wraps around and starts a new line. Is there a way I can force it to print the entire statement to one line even if it doesn’t all fit? I would rather the text cut off when it reaches the end of the window rather than it wrapping to the next line.
More precisely: I’m trying to list the contents of a directory on one line, and only one line, because the next line will list the contents of a different directory. It’s only meant to give a preview of what a directory contains so I don’t care if the program doesn’t output all the contents if they can’t all fit on one line. However, I want it to take advantage of as much horizontal space as it’s given. Rather than making the code factor in the width of the window(even if the user resizes it) and the length of each filename to determine how many filenames it can fit in a single line. I was curious if it would just be easier and more efficient to just cut the text off at the end of the line, especially since none of the directories are going to have more than 15 files and are often less (but sometimes the contents can’t all fit on one line). Here's a rough example of what I'm trying:
import os
while 1:
wd = input("Input full path for directory: ")
try:
os.listdir(wd)
except:
print("invalid input...")
continue
break
list = os.listdir(wd)
print(wd, ": ", end=" ")
try:
print(os.listdir(wd)) # THIS IS WHERE I WANT TO FORCE THE OUTPUT TO A SINGLE LINE
except:
print()
For any given width:
print_width = 79
Just print a slice of a string of the directory contents joined by spaces:
print(' '.join(os.listdir(wd))[:print_width])
If you don't know the width but can print unicode, you can attempt to replace the spaces with non-breaking spaces, and it may work if your frame doesn't force wrapping:
print(u"\u00A0".join(os.listdir(wd)))
I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?
I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.
To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.
Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.
This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)
I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.