I have a quick and dirty build script that needs to update a couple of lines in a small xml config file. Since the file is so small, I'm using an admittedly inefficient process to update the file in place just to keep things simple:
def update_xml(property, value):
for line in fileinput.input(os.path.join(app_dir, 'my.xml'), inplace=True):
if property is 'version':
line = re.sub(r'(<version>).*?(</version>)', '\1%s\2' % value, line, flags=re.IGNORECASE)
elif property is 'brand':
line = re.sub(r'(<property name="brand" type="string">).*?(</property>)', '\1%s\2' % value, line, flags=re.IGNORECASE)
elif property is 'env':
line = re.sub(r'(<property name="env" type="string">).*?(</property>)', '\1%s\2' % value, line, flags=re.IGNORECASE)
print line
I have 2 problems:
The back references aren't capturing what I expect. Instead of getting <version>a.b.c</version>, for example, I get the version value surrounded by control characters. I've tried doubling up the backslash, removing the formatted print and a couple of other things, but can't get it quite right.
When I write the line back to the file (print line), I get several extra line breaks.
What am I borking up here?
Try to replace "\1%s\2" by "\g<1>%s\g<2>" , it might be the problem..
About the newlines , the print might be adding a second new line on top of the existing one .
you can try: print line, with a comma to suppress the new line char
Use a raw string to avoid \1 and \2 becoming control chars: r'\1%s\2'
Related
I want to compare two text files in Python, and return the lines that are different. My attempt uses difflib, but I'm open to other suggestions. I need to get the lines that are different, as well as the lines that appear in one file but not the other. Order is somewhat important, but if a good solution exists that doesn't take order into consideration, I can let go of that.
The problem is that one file has lines that have multiple trailing characters \t and \n, while the other doesn't; I don't want to consider that as a difference. For other files, the first file has only \n and the other files has \t characters at the end. The lines contain elements that are separated by tabs or spaces, so those are important; I just don't care for the trailing characters \t and \n.
My solution:
from difflib import Differ
with open(file_path) as actual:
with open(test_file_path) as test:
differ = Differ()
for line in differ.compare(actual.readlines(), test.readlines()):
if line.startswith('-'):
log.error('EXPECTED: {}'.format(line[2:]))
if line.startswith('+'):
log.error('TEST FILE: {}'.format(line[2:]))
I expect the output to show EXPECTED and TEST FILE lines when there's a difference, and just EXPECTED or just TEST FILE when one contains a line the other doesn't. Right now, I'm seeing a lot of the following types of errors:
00:02:40: ERROR EXPECTED: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
00:02:40: ERROR TEST FILE: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
As you can see (if you highlight it), the first line contains a number of spaces after 'Grp' and the other line doesn't. I want to consider these two lines the same.
I've tried to explicitly specify the tabs and line breaks:
actual_file = actual.readlines()
expected_file = []
for line in actual_file:
if line[-1] == '\n':
expected_file.append(line.rstrip('\n').rstrip('\t') + '\n')
else:
expected_file.append(line.rstrip('\t'))
However, it (a) slows the process down quite a bit, and (b) is required for every file type in a different way, since some files have trailing tabs followed by line breaks, some have just line breaks, and some have nothing at all. If there's no better way, I can strip every line of every trailing tab and linebreak, but it seems like a lot of processing power (I have to run a lot of files) for something that seems fairly easy to resolve.
Take a look at string.rstrip() here: https://docs.python.org/2/library/string.html#string.rstrip
string.rstrip() should do exactly what you need by stripping whitespace off the end of a string, while leaving \t and \n characters before the end alone.
Check it out:
>>> import string
>>> s = "This \t is \t a \t line \t\t\t\n\n\n"
>>> print(s)
This is a line
>>>
>>> s = string.rstrip(s)
>>> s
'This \t is \t a \t line'
>>> print(s)
This is a line
>>>
Hope this helps!
I have a large textfile on my computer (location: /home/Seth/documents/bruteforce/passwords.txt) and I'm trying to find a specific string in the file. The list has one word per line and 215,000 lines/words. Does anyone know of simple Python script I can use to find a specific string?
Here's the code I have so far,
f = open("home/seth/documents/bruteforce/passwords.txt", "r")
for line in f.readlines():
line = str(line.lower())
print str(line)
if str(line) == "abe":
print "success!"
else:
print str(line)
I keep running the script, but it never finds the word in the file (and I know for sure the word is in the file).
Is there something wrong with my code? Is there a simpler method than the one I'm trying to use?
Your help is greatly appreciated.
Ps: I'm using Python 2.7 on a Debian Linux laptop.
I'd rather use the in keyword to look for a string in a line. Here I'm looking for the keyword 'KHANNA' in a csv file and for any such existence the code returns true.
In [121]: with open('data.csv') as f:
print any('KHANNA' in line for line in f)
.....:
True
It's just because you forgot to strip the new line char at the end of each line.
line = line.strip().lower()
would help.
Usually, when you read lines out of a file, they have a newline character at the end. Thus, they will technically not be equal to the same string without the newline character. You can get rid of this character by adding the line line=line.strip() before the test for equality to your target string. By default, the strip() method removes all white space (such as newlines) from the string it is called on.
What do you want to do? Just test whether the word is in the file? Here:
print 'abe' in open("passwords.txt").read().split()
Or:
print 'abe' in map(str.strip, open("passwords.txt"))
Or if it doesn't have to be Python:
egrep '^abe$' passwords.txt
EDIT: Oh, I forgot the lower. Probably because passwords are usually case sensitive. But if it really does make sense in your case:
print 'abe' in open("passwords.txt").read().lower().split()
or
print 'abe' in (line.strip().lower() for line in open("passwords.txt"))
or
print 'abe' in map(str.lower, map(str.strip, open("passwords.txt")))
Your script doesn't find the line because you didn't check for the newline characters:
Your file is made of many "lines". Each "line" ends with a character that you didn't account for - the newline character ('\n'1). This is the character that creates a new line - it is what gets written to the file when you hit enter. This is how the next line is created.
So, when you read the lines out of your file, the string contained in each line actually ends with a newline character. This is why your equality test fails. You should instead, test equality against the line, after it has been stripped of this newline character:
with open("home/seth/documents/bruteforce/passwords.txt") as infile:
for line in infile:
line = line.rstrip('\n')
if line == "abe":
print 'success!'
1 Note that on some machines, the newline character is in fact two characters - the carriage return (CR), and line-feed (LF). This terminology comes from back in the day when typewriters had to jump a line-width of space on the paper that was being written to, and that the carriage that contained the paper had to be returned to its starting position. When seen in a line in the file, this appears as '\r\n'
What I am trying to do is go through a document line by line, find each blank line, keep traversing until I hit the next line of text, and pop that line.
So for example, what I want to do is this:
Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop.
So it will go through each number of blank lines until it hits the next sentence, and pops that sentence only, then continues on. I am thinking I should use re.split('\n') , but I am not sure.
I am sorry I have no code to post but I really don't know where to start
any help would be much appreciated, thank you!
this is part of a larger code, which i've worked days and days on and have figured out up to this point, so I have done the bulk of the word.
If you do for line in filehandle: it will iterate over each line. If you have a flag that is true when the previous line is blank you can skip the next line then reset the flag.
The easiest novice solution by far is probably the way Steve suggested: Just iterate the lines, and use a flag to keep track of whether the last line was a blank line.
But if you want a higher-level solution, you need to rethink the problem at a higher level. What you're actually trying to specify is the first line of every paragraph but the first, where "paragraphs" are things divided by empty lines. Right?
So, how could you do that? Well, you can split on '\n\n' just as easily as on \n. So:
paragraphs = document.split('\n\n')
first_lines = [paragraph.partition('\n')[0] for paragraph in paragraphs]
popped_lines = first_lines[1:]
(I used partition instead of split here both because it splits only at the first '\n', leaving the rest alone, and because it handles one-line paragraphs right—which paragraph.split('\n', 1) would not.)
But you don't want a list of the popped lines, you want a list of everything but the popped lines, right?
paragraphs = document.split('\n\n')
first, rest = paragraphs[0], paragraphs[1:]
rest_edited = [paragraph.partition('\n')[1] for paragraph in rest]
And if you want to turn that back into a document:
all_edited = [first] + rest_edited
document_edited = '\n\n'.join(all_edited)
You can shorten that a bit by using slice assignment, although I'm not sure it's quite as readable:
paragraphs = document.split('\n\n')
paragraphs[1:] = [paragraph.partition('\n')[1] for paragraph in paragraphs[1:]]
document_edited = '\n\n'.join(paragraphs)
As J.F. Sebastian points out, the question is a little ambiguous. Does "blank lines" mean "empty lines", or "lines with nothing but whitespace in them"? If it's the latter, things are a bit more complicated, and the easiest solution probably is a simple regex (r'\n\s*\n') for the splitting into paragraphs.
Meanwhile, if what you have is a sequence of lines (and note that a file is a sequence of lines!) rather than one big string, you can do this without split at all, in a few different ways.
For example, paragraphs are groups of non-blank lines, right? So you can use the groupby function to get them:
groups = itertools.groupby(lines, bool)
Or, if "blank" doesn't mean "empty":
groups = itertools.groupby(lines, lambda line: not line.strip())
Note that this gives you (False, <sequence of lines>) for each paragraph, and (True, <sequence of blank lines>) for each blank line. If you want to preserve blank lines as-is, you can—but if you're happy just replacing each run of blank lines with a single empty line (which you obviously are if "blank" does mean "empty"), it's probably easier to throw away the blank paragraphs:
paragraphs = (group for (key, group) in paragraphs if not key)
Then you can remove the first element from all but the first group, and finally chain the groups back together into one big sequence:
first = next(paragraphs)
edited_paragraphs = (itertools.islice(paragraph, 1) for paragraph in paragraphs)
edited_document = itertools.chain(first, *edited_paragraphs)
Finally, what if you have runs of multiple blank lines in a row? Well, first you have to decide how to deal with them. If you have two blank lines, do you remove the second? If so, do you remove the first line of the next paragraph (because it was originally after a blank line), or not (because the blank line it was after was already removed)? What if you have three in a row? Splitting on '\n\n' will do one thing, splitting on '\n\s*\n' a different thing, and groupby yet another… but until you know what you want, it's impossible to say which is "right" or how to "fix" the others, of course.
I assume the original poster (OP) wants to remove those lines in-place, meaning removing those lines from the file. Here is a revised solution (my previous solution was off the mark. Thank you J.F Sebastian for telling me.
import fileinput
def remove_line_after_blank(filename, in_place_edit=False):
previous_line = ''
for line in fileinput.input(filename, inplace=in_place_edit):
if not (previous_line == '\n' and line != '\n'):
print line.rstrip()
previous_line = line
if __name__ == '__main__':
remove_line_after_blank('data.txt', in_place_edit=True)
Discussion
If you do not want to modify the original data file, remove , in_place_edit=True.
use re.findall to match all occurrence in a string:
>>> text = """Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop."""
>>> re.findall("\n\n+(.+)", text)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
>>> re.findall("\n\n+(.+)$", text, re.MULTILINE)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
The easiest way would be to split the text on newlines:
lines = your_string.split("\n")
That would break it up into an array (stored in lines), where each element of the array is a separate line of text. (As noted in the comments, if you have a file object already, you can just loop through that.)
Then you could go through each line of lines, checking for a newline. If you find one, you could "pop" out the next one. (I don't know what you mean by pop, so I just have the code printing out the lines you want.)
for line in lines:
if print_next_line:
print(line)
print_next_line = False
if line == "":
print_next_line = True
I would like to replace every line that starts with a certain expression (example: <Output>) with what I want the output path to be. I have found and got to work a python script that replaces one string with another, in every occurrence in a file - something like:
text = open( path ).read()
if output_pattern in text:
open( path, 'w' ).write( text.replace( pattern, replace ) )
However I would like to replace the text.replace( pattern, replace ) with something that replaces the entire line that contains pattern with replace. I have tried some things and failed miserably.
Note: I can read but not quite write python...
One of my failures did replace the pattern with the line. Actually, it replaced the entire file with only the replace pattern, as many times as it was needed... Yeah, not funny since I was doing a recursive search (and the previous attempt, to replace one string with another, worked perfectly, so I was brave and set my target directory as the root of what I want to work with)
There are other great examples that read line by line and write to an output file, and then copy the output file to the input file, but I got an error doing that.
I don't really want to use regex because the patterns that I might want to search for (and especially what I want to replace) (may) contain many special characters including backslashes, but these could be escaped if needed.
To replace lines with replace if they start with pattern:
text = open(path).read()
new_text = '\n'.join(replace if line.startswith(pattern) else line
for line in text.splitlines())
open(path, 'w').write(new_text)
Or optimized for memory usage, and using the with statement, which is a bit more idiomatic:
with open(input_path) as text, open(output_path, 'w') as new_text:
new_text.write(''.join(replace if line.startswith(pattern) else line
for line in text))
You'll want to make sure replace has a newline character (\n) in it for the latter example to work as you'd expect.
In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax
you can use startswith()
eg
for line in open("file"):
li=line.strip()
if not li.startswith("#"):
print line.rstrip()
I recommend you don't ignore the whole line when you see a # character; just ignore the rest of the line. You can do that easily with a string method function called partition:
with open("filename") as f:
for line in f:
line = line.partition('#')[0]
line = line.rstrip()
# ... do something with line ...
partition returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with [0] we take just the part before the partition string.
EDIT:
If you are using a version of Python that doesn't have partition(), here is code you could use:
with open("filename") as f:
for line in f:
line = line.split('#', 1)[0]
line = line.rstrip()
# ... do something with line ...
This splits the string on a '#' character, then keeps everything before the split. The 1 argument makes the .split() method stop after a one split; since we are just grabbing the 0th substring (by indexing with [0]) you would get the same answer without the 1 argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from #gnr. My original code was messier for no good reason; thanks, #gnr.)
You could also just write your own version of partition(). Here is one called part():
def part(s, s_part):
i0 = s.find(s_part)
i1 = i0 + len(s_part)
return (s[:i0], s[i0:i1], s[i1:])
#dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
c_backslash = '\\'
c_dquote = '"'
c_comment = '#'
def chop_comment(line):
# a little state machine with two state varaibles:
in_quote = False # whether we are in a quoted string right now
backslash_escape = False # true if we just saw a backslash
for i, ch in enumerate(line):
if not in_quote and ch == c_comment:
# not in a quote, saw a '#', it's a comment. Chop it and return!
return line[:i]
elif backslash_escape:
# we must have just seen a backslash; reset that flag and continue
backslash_escape = False
elif in_quote and ch == c_backslash:
# we are in a quote and we see a backslash; escape next char
backslash_escape = True
elif ch == c_dquote:
in_quote = not in_quote
return line
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.
I'm coming at this late, but the problem of handling shell style (or python style) # comments is a very common one.
I've been using some code almost everytime I read a text file.
Problem is that it doesn't handle quoted or escaped comments properly. But it works for simple cases and is easy.
for line in whatever:
line = line.split('#',1)[0].strip()
if not line:
continue
# process line
A more robust solution is to use shlex:
import shlex
for line in instream:
lex = shlex.shlex(line)
lex.whitespace = '' # if you want to strip newlines, use '\n'
line = ''.join(list(lex))
if not line:
continue
# process decommented line
This shlex approach not only handles quotes and escapes properly, it adds a lot of cool functionality (like the ability to have files source other files if you want). I haven't tested it for speed on large files, but it is zippy enough of small stuff.
The common case when you're also splitting each input line into fields (on whitespace) is even simpler:
import shlex
for line in instream:
fields = shlex.split(line, comments=True)
if not fields:
continue
# process list of fields
This is the shortest possible form:
for line in open(filename):
if line.startswith('#'):
continue
# PROCESS LINE HERE
The startswith() method on a string returns True if the string you call it on starts with the string you passed in.
While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is 'r', which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with 'rt'. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).
The second problem is the open file handle. The open() function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call the close() method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called __del__). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!
For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
f = open(file, 'rt')
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
f.close()
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
with open(filename, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
if line.startswith('#'):
to this:
if line.lstrip().startswith('#'):
In Python, strings are immutable, so this doesn't change the value of line. The lstrip() method returns a copy of the string with all its leading whitespace removed.
I've found recently that a generator function does a great job of this. I've used similar functions to skip comment lines, blank lines, etc.
I define my function as
def skip_comments(file):
for line in file:
if not line.strip().startswith('#'):
yield line
That way, I can just do
f = open('testfile')
for line in skip_comments(f):
print line
This is reusable across all my code, and I can add any additional handling/logging/etc. that I need.
I know that this is an old thread, but this is a generator function that I
use for my own purposes. It strips comments no matter where they
appear in the line, as well as stripping leading/trailing whitespace and
blank lines. The following source text:
# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!
will yield:
host02
host03
host04
Here is documented code, which includes a demo:
def strip_comments(item, *, token='#'):
"""Generator. Strips comments and whitespace from input lines.
This generator strips comments, leading/trailing whitespace, and
blank lines from its input.
Arguments:
item (obj): Object to strip comments from.
token (str, optional): Comment delimiter. Defaults to ``#``.
Yields:
str: Next uncommented non-blank line from ``item`` with
comments and leading/trailing whitespace stripped.
"""
for line in item:
s = line.split(token, 1)[0].strip()
if s:
yield s
if __name__ == '__main__':
HOSTS = """# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!""".split('\n')
hosts = strip_comments(HOSTS)
print('\n'.join(h for h in hosts))
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
if __name__ == '__main__':
with open('aa.txt', 'r') as f:
hosts = strip_comments(f)
for host in hosts:
print('\'%s\'' % host)
A more compact version of a filtering expression can also look like this:
for line in (l for l in open(filename) if not l.startswith('#')):
# do something with line
(l for ... ) is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets [l for ... ] which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.
Sometimes you might want to have it less one-liney and more readable:
lines = open(filename)
lines = (l for l in lines if ... )
# more filters and mappings you might want
for line in lines:
# do something with line
All the filters will be executed on the fly in one iteration.
Use regex re.compile("^(?:\s+)*#|(?:\s+)") to skip the new lines and comments.
I tend to use
for line in lines:
if '#' not in line:
#do something
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #
a good thing to get rid of coments that works for both inline and on a line
def clear_coments(f):
new_text = ''
for line in f.readlines():
if "#" in line: line = line.split("#")[0]
new_text += line
return new_text