Removing \n from readlines with rstrip("\n") - python

im reading 2 files per lines that contains strings
and grouping per other one here in my code im grooping 1 line from proxyfile with 3 lines from websitefiles
major problem when i try to print the lines from 2nd file i cant remove the \n
i have tried to use rstrip("\n") but i got this exception
AttributeError: 'tuple' object has no attribute 'rstrip'
this is my code :
proxy = open("proxy.txt",'r')
readproxy = proxy.readlines()
sites = open("websites.txt",'r')
readsites = sites.readlines()
for i,j in zip(readproxy, zip(*(iter(readsites),) * 3)):
try:
i = i.rstrip("\n")
#j = j.rstrip("\n")
print(i,j)
except ValueError:
print 'whatever'
i removed the \n from "i" successfully , but coudent remove it from "j"
this screenshot can explain all i think

A tuple doesn't have a method that magically propagates to all string elements.
To do that you have to apply rstrip for all items in a list comprehension to create a list.
j = [x.rstrip("\n") for x in j]
that rebuilds a list with items stripped. Convert to tuple if really needed:
j = tuple(x.rstrip("\n") for x in j)
note that you could avoid this post processing by doing:
readsites = sites.read().splitlines()
instead of sites.readlines() since splitlines splits into lines while removing the termination character if no argument is specified. The only drawback is in the case when the file is huge. In that case, you could encounter a memory issue (memory needed by read + split is doubled). In which case, you could do:
with open("websites.txt",'r') as sites:
readsites = [line.rstrip("\n") for line in sites]

Related

What causes this return() to create a SyntaxError?

I need this program to create a sheet as a list of strings of ' ' chars and distribute text strings (from a list) into it. I have already coded return statements in python 3 but this one keeps giving
return(riplns)
^
SyntaxError: invalid syntax
It's the return(riplns) on line 39. I want the function to create a number of random numbers (randint) inside a range built around another randint, coming from the function ripimg() that calls this one.
I see clearly where the program declares the list I want this return() to give me. I know its type. I see where I feed variables (of the int type) to it, through .append(). I know from internet research that SyntaxErrors on python's return() functions usually come from mistype but it doesn't seem the case.
#loads the asciified image ("/home/userX/Documents/Programmazione/Python projects/imgascii/myascify/ascimg4")
#creates a sheet "foglio1", same number of lines as the asciified image, and distributes text on it on a randomised line
#create the sheet foglio1
def create():
ref = open("/home/userX/Documents/Programmazione/Python projects/imgascii/myascify/ascimg4")
charcount = ""
field = []
for line in ref:
for c in line:
if c != '\n':
charcount += ' '
if c == '\n':
charcount += '*' #<--- YOU GONNA NEED TO MAKE THIS A SPACE IN A FOLLOWING FUNCTION IN THE WRITER.PY PROGRAM
for i in range(50):#<------- VALUE ADJUSTMENT FROM WRITER.PY GOES HERE(default : 50):
charcount += ' '
charcount += '\n'
break
for line in ref:
field.append(charcount)
return(field)
#turn text in a list of lines and trasforms the lines in a list of strings
def poemln():
txt = open("/home/gcg/Documents/Programmazione/Python projects/imgascii/writer/poem")
arrays = []
for line in txt:
arrays.append(line)
txt.close()
return(arrays)
#rander is to be called in ripimg()
def rander(rando, fldepth):
riplns = []
for i in range(fldepth):
riplns.append(randint((rando)-1,(rando)+1)
return(riplns) #<---- THIS RETURN GIVES SyntaxError upon execution
#opens a rip on the side of the image.
def ripimg():
upmost = randint(160, 168)
positions = []
fldepth = 52 #<-----value is manually input as in DISTRIB function.
positions = rander(upmost,fldepth)
return(positions)
I omitted the rest of the program, I believe these functions are enough to get the idea, please tell me if I need to add more.
You have incomplete set of previous line's parenthesis .
In this line:-
riplns.append(randint((rando)-1,(rando)+1)
You have to add one more brace at the end. This was causing error because python was reading things continuously and thought return statement to be a part of previous uncompleted line.

Replace space-char by newline-char within max-line-length in python

I am trying to write a python script,
which breaks a continuous string into lines,
when the max_line_length has been exceeded.
It shall not break words,
and searches therefore the last occurrence of a whitespace-char,
which will be replaced by a newline-char.
For some reason it does not break within the specified limit.
E.g. when defining the max_line_length = 80,
the text sometimes breaks at 82 or 83, etc.
Since quite some time I am trying to fix the problem,
however it feels like i am having the tunnel vision
and don't see the problem here:
#!/usr/bin/python
import sys
if len(sys.argv) < 3:
print('usage: $ python3 breaktext.py <max_line_length> <file>')
print('example: $ python3 breaktext.py 80 infile.txt')
exit()
filename = str(sys.argv[2])
with open(filename, 'r') as file:
text_str = file.read().replace('\n', '')
m = int(sys.argv[1]) # max_line_length
text_list = list(text_str) # convert string to list
l = 0; # line_number
i = m+1 # line_character_index
index = m+1 # total_list_index
while index < len(text_list):
while text_list[l * m + i] != ' ':
i -= 1
pass
text_list[l * m + i] = '\n'
l += 1
i = m+1
index += m+1
pass
text_str = ''.join(text_list)
print(text_str)
I guess we'll take this from the top.
text_str = file.read().replace('\n', '')
Here's one assumption about the input data I don't know if it's true. You're replacing all the newline characters with nothing; if there weren't spaces next to them, this means the code below will never break the lines in the same places.
text_list = list(text_str) # convert string to list
This splits the input file into single character strings. I guess you might have done so to make it mutable, such that you can replace individual characters, but it's a very expensive operation and loses all the features of a string. Python is a high level language that would allow you to split into e.g. words instead.
index = m+1 # total_list_index
while index < len(text_list):
#...
index += m+1
Let's consider what this means. We're not entering into the loop if index exceeds the text_list length. But index is advancing in steps of m+1. So we're splitting math.floor(len(text)/(max_line_length+1)) times. Unless every line is exactly max_line_length characters, not counting its space we replace with a newline, that's too few times. Too few times means too long lines, at least at the end.
l = 0; # line_number
i = m+1 # line_character_index
#loop:
while text_list[l * m + i] != ' ':
i -= 1
text_list[l * m + i] = '\n'
l += 1
i = m+1
This is making things difficult with index math. Quite clearly the one index we ever use is l * m + i. This moves in a quite odd way; it searches backwards for a space, then leaps forward as l increments and i resets. Whatever position it had reversed to is lost as all the leaps are in steps of m.
Let's apply m=5 to the string "Fee fie faw fum who did you see now". For the first iteration, 0 * 5 + 5+1 hits the second word, and i seeks back to the first space. The first line then is "Fee", as expected. The second search starts at 1*5 + 5+1, which is a space, and the second line becomes "fie faw", which already exceeds our limit of 5! The reason is that l * m isn't the beginning of the line; it's actually in the middle of "fie", a discrepancy which can only grow as you continue through the file. It grows whenever you split off a line that is shorter than m.
The solution involves remembering where you did your split. That could be as simple as replacing l * m with index, and updating it by index += i instead of m+1.
Another odd effect happens if you ever encounter a word that exceeds the maximum line length. Beyond meaning a line is longer than the limit, i will still search backwards until it finds a space; that space could then be in an earlier line altogether, producing extra short lines as well as too long ones. That's a result of handling the entire text as one array and not limiting which section we're looking at.
Personally I'd much rather use Python's built in methods, such as str.rindex, which can find a particular character in a given region within a string:
s = "Fee fie faw fum who did you see now"
maxlen = 5
start = 8
end = s.rindex(' ', start, start+maxlen)
print(s[start:end])
start = end + 1
We also, as PaulMcG pointed out, can go full "batteries included" and use the standard library textwrap module for the entire task.

Regex vs readline for text processing

I have a text to process (router output) and generate useful data structure (dictionary having keys as iface name and values as packet counts) from it. I have two approaches to do the same task. I would like to know which one should I use for efficiency and which one looks more prone to fail for bigger data samples.
Readline1 gets a list from readline and processes output and writes into the dictionary with key as interface name and values as next three items.
Readline2 uses re module and match the groups and from groups it writes to dictionary keys and values.
input self.output to these functions will be something like this:
message =
"""
Interface 1/1\n\t
input : 1234\n\t
output : 3456\n\t
dropped : 12\n
\n
Interface 1/2\n\t
input : 7123\n\t
output : 2345\n\t
dropped : 31\n\t
"""
def ReadLine1(self):
lines = self.output.splitlines()
for index, line in enumerate(lines):
if "Interface" in line:
valuelist = []
for i in [1,2,3]:
valuelist.append((lines[index+i].split(':'))[1].strip())
self.IFlist[line.split()[1]] = valuelist
return self.IFlist
def Readline2(self):
#print repr(self.output)
n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
blocks = self.output.split('\n\n')
for block in blocks:
m_object = re.match(n, block)
self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]
Both of your methods use specific aspects of the format to achieve the parsing you are trying to do, and if that format was changed / broken one of the methods could also break...
For example if you added a space in the empty line between the two entries (which you cannot see) then the blocks = self.output.split('\n\n') would fail to find two consecutive newline characters and the regex version would miss out on the second entry:
{'1/1': ['1234', '3456', '13']}
Or if you added an extra newline between input and output like this:
Interface 1/2
input : 7123
output : 2345
dropped : 31
The regex \s* would deal with the extra space fine but the non-regex parsing would assume that lines[index+i].split(':') has an indice [1] so it would raise an IndexError with that data
Or if you added some extra space at the end of any line then the regex would fail to see the newline right after the content and re.match(n, lock) would return None so the next line would raise an AttributeError: 'NoneType' object has no attribute 'group'
Or if you changed Interface to interface for one of the entries (no longer capital I) then the regex would raise the same error as above but the non-regex would simply ignore that entry.
While I was testing it I found that the regex was easier to mess up with small edits to the sample message, but I also found that the version I made using a generator expression and str.partition was significantly more robust then both of them:
def readline3():
gen_lines = (line for line in self.output.splitlines()
if line and not line.isspace())
try:
while True: #ended when next() throws a StopIteration
start,_,key = next(gen_lines).partition(" ")
if start == "Interface":
IFlist[key] = [next(gen_lines).rpartition(" : ")[2]
for _ in "123"]
except StopIteration: # reached end of output
return self.IFlist
This succeeded in every case mentioned above and a few more, and since the only method this is relying on is str.partition which alway returns a 3 item tuple there is nothing to raise any unexpected errors unless self.output is something other then a string.
Also running a benchmark using timeit your readline1 consistently was faster then readline2 and my readline3 was usually slightly more then readline1:
#using the default 1000000 loops using 'message'
<function readline1 at 0x100756f28>
11.225649802014232
<function readline2 at 0x1057e3950>
14.838601427007234
<function readline3 at 0x1057e39d8>
11.693351223017089

Python: split line by comma, then by space

I'm using Python 3 and I need to parse a line like this
-1 0 1 0 , -1 0 0 1
I want to split this into two lists using Fraction so that I can also parse entries like
1/2 17/12 , 1 0 1 1
My program uses a structure like this
from sys import stdin
...
functions'n'stuff
...
for line in stdin:
and I'm trying to do
for line in stdin:
X = [str(elem) for elem in line.split(" , ")]
num = [Fraction(elem) for elem in X[0].split()]
den = [Fraction(elem) for elem in X[1].split()]
but all I get is a list index out of range error: den = [Fraction(elem) for elem in X[1].split()]
IndexError: list index out of range
I don't get it. I get a string from line. I split that string into two strings at " , " and should get one list X containing two strings. These I split at the whitespace into two separate lists while converting each element into Fraction. What am I missing?
I also tried adding X[-1] = X[-1].strip() to get rid of \n that I get from ending the line.
The problem is that your file has a line without a " , " in it, so the split doesn't return 2 elements.
I'd use split(',') instead, and then use strip to remove the leading and trailing blanks. Note that str(...) is redundant, split already returns strings.
X = [elem.strip() for elem in line.split(",")]
You might also have a blank line at the end of the file, which would still only produce one result for split, so you should have a way to handle that case.
With valid input, your code actually works.
You probably get an invalid line, with too much space or even an empty line or so. So first thing inside the loop, print line. Then you know what's going on, you can see right above the error message what the problematic line was.
Or maybe you're not using stdin right. Write the input lines in a file, make sure you only have valid lines (especially no empty lines). Then feed it into your script:
python myscript.py < test.txt
How about this one:
pairs = [line.split(",") for line in stdin]
num = [fraction(elem[0]) for elem in pairs if len(elem) == 2]
den = [fraction(elem[1]) for elem in pairs if len(elem) == 2]

Python/IPython strange non reproducible list index out of range error

I have recently been learning some Python and how to apply it to my work. I have written a couple of scripts successfully, but I am having an issue I just cannot figure out.
I am opening a file with ~4000 lines, two tab separated columns per line. When reading the input file, I get an index error saying that the list index is out of range. However, while I get the error every time, it doesn't happen on the same line every time (as in, it will throw the error on different lines everytime!). So, for some reason, it works generally but then (seemingly) randomly fails.
As I literally only started learning Python last week, I am stumped. I have looked around for the same problem, but not found anything similar. Furthermore I don't know if this is a problem that is language specific or IPython specific. Any help would be greatly appreciated!
input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
splits = each.split("\t")
changelist = list(splits[0])
second = int(splits[1])
print second
if changelist[7] == ";":
changelist.insert(6, "000")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[8] == ";":
changelist.insert(6, "00")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[9] == ";":
changelist.insert(6, "0")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
else:
#output.write(str("".join(changelist)))
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
output.close()
The error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
57 splits = each.split("\t")
58 changelist = list(splits[0])
---> 59 second = int(splits[1])
60
61 print second
IndexError: list index out of range
Input:
ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
Desired output:
ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
The reason you're getting the IndexError is that your input-file is apparently not entirely tab delimited. That's why there is nothing at splits[1] when you attempt to access it.
Your code could use some refactoring. First of all you're repeating yourself with the if-checks, it's unnecessary. This just pads the cds0 to 7 characters which is probably not what you want. I threw the following together to demonstrate how you could refactor your code to be a little more pythonic and dry. I can't guarantee it'll work with your dataset, but I'm hoping it might help you understand how to do things differently.
to_sort = []
# We can open two files using the with statement. This will also handle
# closing the files for us, when we exit the block.
with open("count.txt", "r") as inp, open("output.txt", "w") as out:
for each in inp:
# Split at ';'... So you won't have to worry about whether or not
# the file is tab delimited
changed = each.split(";")
# Get the value you want. This is called unpacking.
# The value before '=' will always be 'ID', so we don't really care about it.
# _ is generally used as a variable name when the value is discarded.
_, value = changed[0].split("=")
# 0-pad the desired value to 7 characters. Python string formatting
# makes this very easy. This will replace the current value in the list.
changed[0] = "ID={:0<7}".format(value)
# Join the changed-list with the original separator and
# and append it to the sort list.
to_sort.append(";".join(changed))
# Write the results to the file all at once. Your test data already
# provided the newlines, you can just write it out as it is.
output.writelines(to_sort)
# Do what else you need to do. Maybe to_list.sort()?
You'll notice that this code is reduces your code down to 8 lines but achieves the exact same thing, does not repeat itself and is pretty easy to understand.
Please read the PEP8, the Zen of python, and go through the official tutorial.
This happens when there is a line in count.txt which doesn't contain the tab character. So when you split by tab character there will not be any splits[1]. Hence the error "Index out of range".
To know which line is causing the error, just add a print(each) after splits in line 57. The line printed before the error message is your culprit. If your input file keeps changing, then you will get different locations. Change your script to handle such malformed lines.

Categories