Python - Ignore FIRST character (tab) every line when reading - python

This is a continuation of my former questions (check them if you are curious).
I can already see the light at the end of the tunnel, but I've got a last problem.
For some reason, every line starts with a TAB character.
How can I ignore that first character ("tab" (\t) in my case)?
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
csv.list_dialects()
for line in csv.reader(open(filename, "rb"), delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True):
print line
orar = OraRend._make(line) # Here comes the trouble!
The text file:
http://pastebin.com/UYg4P4J1
(Can't really paste it here with all the tabs.)
I have found lstrip, strip and other methods, all of them would eat all the chars, so the filling of the tuple would fail.

You could do line = line[1:] to just strip the first character. But if you do this, you should add an assertion that the first character is indeed a tab, to avoid mangling data without leading tab.
There is an easier alternative that also handles several other cases and doesn't break things if the things to be removed aren't there. You can strip all leading and trailing whitespace with line = line.strip(). Alternatively, use .lstrip() to strip only leading whitespace, and add '\t' as argument to either method call if you want to leave other whitespace in place and just remove tabs.

To remove the first character from a string:
>>> s = "Hello"
>>> s
'Hello'
>>> s[1:]
'ello'

From the docs:
str.lstrip([chars])
Return a copy of the string with leading characters removed. The chars
argument is a string specifying the
set of characters to be removed. If
omitted or None, the chars argument
defaults to removing whitespace. The
chars argument is not a prefix;
rather, all combinations of its values
are stripped
If you want to only remove the tab at the beginning of a line, use
str.lstrip("\t")
This has the benefit that you don't have to check to make sure the first character is, in fact, a tab. However, if there are cases when there are more than one tab, and you want to keep the second tab and on, you're going to have to use str[1:].

Consider this. You don't need to pass a "file" to csv.reader. A file-line object that is a sequence of string values works nicely.
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
with open(filename, "rb") as source:
cleaned = ( line.lstrip() for line in source )
rdr= csv.reader( cleaned, delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True)
for line in rdr
print line
orar = OraRend._make(line) # Here comes the trouble!

Related

regex to highlight new line characters in the beginning and end

I am trying to figure out how to write a simple regex that would highlight newline characters only if they appear at the beginning or end of some data while preserving the newline.
In the below example, line 1 and line 14 both are new lines. Those are the only two lines I am trying to highlight as they appear at the beginning and end of the data.
import regex as re
from colorama import Fore, Back
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
print(
re.sub(r'(^\n|\n$)', red(r'\1'), data)
)
In the open expression, data is the same content as the example posted above.
In the above example, this is the result I am getting:
As one can see, the red highlight is missing on line 1 and is spanning all the way in line 14. What I would like is for the color to appear only once per new line character.
You can actually use your regex, but without the "multiline" flag. Than it will see the whole string as one and you will actually match your desired output.
^\n|\n$
Here you can see that there are two matches. And if you delete new lines in front or in the end, the matches will disapear. The multilene flag is set or disabled at the end of the regex line. You could do that in your language too.
https://regex101.com/r/pSRHPU/2
After reading all the comments, and suggestions, and combining a subset of them all, I finally have a working version. For anyone that is interested:
One issue I cannot overcome without writing an os specific check is how an extra new line being added for windows.
A couple of highlights that were picked up:
cannot color a \n. So replace that with a space and newline.
have not tested this, but by getting rid of the group replacement, it may be possible to apply this to bytes also.
Windows supported can be attained with init in colorama
import regex as re
from colorama import Back, init
init() # for windows
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
fist_line = re.sub('\A\n', red(' ')+'\n', data)
last_line = re.sub('\n\Z', '\n'+red(' '), fist_line)
print(last_line)
OSX/Linux
Windows
I found a way that seems to allow you to match the start/end of the whole string. See the "Permanent Start of String and End of String Anchors" part from https://www.regular-expressions.info/anchors.html
\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string.
I created a demo here https://regex101.com/r/n2DAWh/1
Regex is: (\A\n|\n\Z)

Most efficient way to delete needless newlines in Python

I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.

python convert multiline to single line

I want to convert Python multiline string to a single line. If I open the string in a Vim , I can see ^M at the start of each line. How do I process the string to make it all in a single line with tab separation between each line. Example in Vim it looks like:
Serialnumber
^MName Rick
^MAddress 902, A.street, Elsewhere
I would like it to be something like:
Serialnumber \t Name \t Rick \t Address \t 902, A.street,......
where each string is in one line. I tried
somestring.replace(r'\r','\t')
But it doesn't work. Also, once the string is in a single line if I wanted a newline(UNIX newline?) at the end of the string how would I do that?
Deleted my previous answer because I realized it was wrong and I needed to test this solution.
Assuming that you are reading this from the file, you can do the following:
f = open('test.txt', 'r')
lines = f.readlines()
mystr = '\t'.join([line.strip() for line in lines])
As ep0 said, the ^M represents '\r', which the carriage return character in Windows. It is surprising that you would have ^M at the beginning of each line since the windows new-line character is \r\n. Having ^M at the beginning of the line indicates that your file contains \n\r instead.
Regardless, the code above makes use of a list comprehension to loop over each of the lines read from test.txt. For each line in lines, we call str.strip() to remove any whitespace and non-printing characters from the ENDS of each line. Finally, we call '\t'.join() on the resulting list to insert tabs.
You can replace "\r" characters by "\t".
my_string.replace("\r", "\t")
I use splitlines() to detect all types of lines, and then join everything together. This way you don't have to guess to replace \r or \n etc.
"".join(somestring.splitlines())
it is hard coding. But it works.
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
lst=list(poem)
str=''
for i in lst:
str+=i
print(str)
lst1=str.split("\n")
str1=""
for i in lst1:
str1+=i+" "
str2=str1[:-2]
print(str2)
This occurs of how VIM interprets CR (carriage return), used by Windows to delimit new lines. You should use just one editor (I personally prefer VIM). Read this: VIM ^M
This trick also can be useful, write "\n" as a raw string. Like :
my_string = my_string.replace(r"\n", "\t")
this should do the work:
def flatten(multiline):
lst = multiline.split('\n')
flat = ''
for line in lst:
flat += line.replace(' ', '')+' '
return flat
This should do the job:
string = """Name Rick
Address 902, A.street, Elsewhere"""
single_line = string.replace("\n", "\t")

How to read a CSV line with "?

A trivial CSV line could be spitted using string split function. But some lines could have ", e.g.
"good,morning", 100, 300, "1998,5,3"
thus directly using string split would not solve the problem.
My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.
What's the best practice for this problem?
I am interested if there's a Python or F# code snippet for this.
EDIT: I am more interested in the implementation detail, rather than using a library.
There's a csv module in Python, which handles this.
Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)
The state machine for this task would probably only need two states:
Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).
By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".
From python's CSV module:
reading a normal CSV file:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Reading a file with an alternate format:
import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
There are some nice usage examples in LinuxJournal.com.
If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.
Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.
The generic implementation detail would be something like this (untested)
def csvline2fields(line):
fields = []
quote = None
while line.strip():
line = line.strip()
if line[0] in ("'", '"'):
# Find the next quote:
end = line.find(line[0])
fields.append(line[1:end])
# Find the beginning of the next field
next = line.find(SEPARATOR)
if next == -1:
break
line = line[next+1:]
continue
# find the next separator:
next = line.find(SEPARATOR)
fields.append(line[0:next])
line = line[next+1:]

Python: How to ignore #comment lines when reading in a file

In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax
you can use startswith()
eg
for line in open("file"):
li=line.strip()
if not li.startswith("#"):
print line.rstrip()
I recommend you don't ignore the whole line when you see a # character; just ignore the rest of the line. You can do that easily with a string method function called partition:
with open("filename") as f:
for line in f:
line = line.partition('#')[0]
line = line.rstrip()
# ... do something with line ...
partition returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with [0] we take just the part before the partition string.
EDIT:
If you are using a version of Python that doesn't have partition(), here is code you could use:
with open("filename") as f:
for line in f:
line = line.split('#', 1)[0]
line = line.rstrip()
# ... do something with line ...
This splits the string on a '#' character, then keeps everything before the split. The 1 argument makes the .split() method stop after a one split; since we are just grabbing the 0th substring (by indexing with [0]) you would get the same answer without the 1 argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from #gnr. My original code was messier for no good reason; thanks, #gnr.)
You could also just write your own version of partition(). Here is one called part():
def part(s, s_part):
i0 = s.find(s_part)
i1 = i0 + len(s_part)
return (s[:i0], s[i0:i1], s[i1:])
#dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
c_backslash = '\\'
c_dquote = '"'
c_comment = '#'
def chop_comment(line):
# a little state machine with two state varaibles:
in_quote = False # whether we are in a quoted string right now
backslash_escape = False # true if we just saw a backslash
for i, ch in enumerate(line):
if not in_quote and ch == c_comment:
# not in a quote, saw a '#', it's a comment. Chop it and return!
return line[:i]
elif backslash_escape:
# we must have just seen a backslash; reset that flag and continue
backslash_escape = False
elif in_quote and ch == c_backslash:
# we are in a quote and we see a backslash; escape next char
backslash_escape = True
elif ch == c_dquote:
in_quote = not in_quote
return line
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.
I'm coming at this late, but the problem of handling shell style (or python style) # comments is a very common one.
I've been using some code almost everytime I read a text file.
Problem is that it doesn't handle quoted or escaped comments properly. But it works for simple cases and is easy.
for line in whatever:
line = line.split('#',1)[0].strip()
if not line:
continue
# process line
A more robust solution is to use shlex:
import shlex
for line in instream:
lex = shlex.shlex(line)
lex.whitespace = '' # if you want to strip newlines, use '\n'
line = ''.join(list(lex))
if not line:
continue
# process decommented line
This shlex approach not only handles quotes and escapes properly, it adds a lot of cool functionality (like the ability to have files source other files if you want). I haven't tested it for speed on large files, but it is zippy enough of small stuff.
The common case when you're also splitting each input line into fields (on whitespace) is even simpler:
import shlex
for line in instream:
fields = shlex.split(line, comments=True)
if not fields:
continue
# process list of fields
This is the shortest possible form:
for line in open(filename):
if line.startswith('#'):
continue
# PROCESS LINE HERE
The startswith() method on a string returns True if the string you call it on starts with the string you passed in.
While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is 'r', which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with 'rt'. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).
The second problem is the open file handle. The open() function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call the close() method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called __del__). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!
For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
f = open(file, 'rt')
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
f.close()
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
with open(filename, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
if line.startswith('#'):
to this:
if line.lstrip().startswith('#'):
In Python, strings are immutable, so this doesn't change the value of line. The lstrip() method returns a copy of the string with all its leading whitespace removed.
I've found recently that a generator function does a great job of this. I've used similar functions to skip comment lines, blank lines, etc.
I define my function as
def skip_comments(file):
for line in file:
if not line.strip().startswith('#'):
yield line
That way, I can just do
f = open('testfile')
for line in skip_comments(f):
print line
This is reusable across all my code, and I can add any additional handling/logging/etc. that I need.
I know that this is an old thread, but this is a generator function that I
use for my own purposes. It strips comments no matter where they
appear in the line, as well as stripping leading/trailing whitespace and
blank lines. The following source text:
# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!
will yield:
host02
host03
host04
Here is documented code, which includes a demo:
def strip_comments(item, *, token='#'):
"""Generator. Strips comments and whitespace from input lines.
This generator strips comments, leading/trailing whitespace, and
blank lines from its input.
Arguments:
item (obj): Object to strip comments from.
token (str, optional): Comment delimiter. Defaults to ``#``.
Yields:
str: Next uncommented non-blank line from ``item`` with
comments and leading/trailing whitespace stripped.
"""
for line in item:
s = line.split(token, 1)[0].strip()
if s:
yield s
if __name__ == '__main__':
HOSTS = """# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!""".split('\n')
hosts = strip_comments(HOSTS)
print('\n'.join(h for h in hosts))
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
if __name__ == '__main__':
with open('aa.txt', 'r') as f:
hosts = strip_comments(f)
for host in hosts:
print('\'%s\'' % host)
A more compact version of a filtering expression can also look like this:
for line in (l for l in open(filename) if not l.startswith('#')):
# do something with line
(l for ... ) is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets [l for ... ] which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.
Sometimes you might want to have it less one-liney and more readable:
lines = open(filename)
lines = (l for l in lines if ... )
# more filters and mappings you might want
for line in lines:
# do something with line
All the filters will be executed on the fly in one iteration.
Use regex re.compile("^(?:\s+)*#|(?:\s+)") to skip the new lines and comments.
I tend to use
for line in lines:
if '#' not in line:
#do something
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #
a good thing to get rid of coments that works for both inline and on a line
def clear_coments(f):
new_text = ''
for line in f.readlines():
if "#" in line: line = line.split("#")[0]
new_text += line
return new_text

Categories