Python Textwrap - forcing 'hard' breaks - python

I am trying to use textwrap to format an import file that is quite particular in how it is formatted. Basically, it is as follows (line length shortened for simplicity):
abcdef <- Ok line
abcdef
ghijk <- Note leading space to indicate wrapped line
lm
Now, I have got code to work as follows:
wrapper = TextWrapper(width=80, subsequent_indent=' ', break_long_words=True, break_on_hyphens=False)
for l in lines:
wrapline=wrapper.wrap(l)
This works nearly perfectly, however, the text wrapping code doesn't do a hard break at the 80 character mark, it tries to be smart and break on a space (at approx 20 chars in).
I have got round this by replacing all spaces in the string list with a unique character (#), wrapping them and then removing the character, but surely there must be a cleaner way?
N.B Any possible answers need to work on Python 2.4 - sorry!

A generator-based version might be a better solution for you, since it wouldn't need to load the entire string in memory at once:
def hard_wrap(input, width, indent=' '):
for line in input:
indent_width = width - len(indent)
yield line[:width]
line = line[width:]
while line:
yield '\n' + indent + line[:indent_width]
line = line[indent_width:]
Use it like this:
from StringIO import StringIO # Makes strings look like files
s = """abcdefg
abcdefghijklmnopqrstuvwxyz"""
for line in hard_wrap(StringIO(s), 12):
print line,
Which prints:
abcdefg
abcdefghijkl
mnopqrstuvw
xyz

It sounds like you are disabling most of the functionality of TextWrapper, and then trying to add a little of your own. I think you'd be better off writing your own function or class. If I understand you right, you're simply looking for lines longer than 80 chars, and breaking them at the 80-char mark, and indenting the remainder by one space.
For example, this:
s = """\
This line is fine.
This line is very long and should wrap, It'll end up on a few lines.
A short line.
"""
def hard_wrap(s, n, indent):
wrapped = ""
n_next = n - len(indent)
for l in s.split('\n'):
first, rest = l[:n], l[n:]
wrapped += first + "\n"
while rest:
next, rest = rest[:n_next], rest[n_next:]
wrapped += indent + next + "\n"
return wrapped
print hard_wrap(s, 20, " ")
produces:
This line is fine.
This line is very lo
ng and should wrap,
It'll end up on a
few lines.
A short line.

Related

Python String Wrap

I want to wrap the string at 30,700 in this script. What is the best way of doing this, I have tried using textWrap but it does not seem to work. This is my code:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
canvas = canvas.Canvas("Forensic Report.pdf", pagesize=letter)
canvas.setLineWidth(.3)
canvas.setFont('Helvetica', 12)
canvas.drawString(30,750,'LYIT MOBILE FORENSICS DIVISION')
canvas.drawString(500,750,"Date: 12/02/2018")
canvas.line(500,747,595,747)
canvas.drawString(500,725,'Case Number:')
canvas.drawString(580,725,"10")
canvas.line(500,723,595,723)
canvas.drawString(30, 700, 'This forensic report has been compiled by the forensic examiner in conclusion to the investigation into the RTA case which occured on 23/01/2018')
canvas.save()
print("Forensic Report Generated")
Perhaps you want to use the drawText?
Doing so, your code will be
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
canvas = canvas.Canvas("Forensic Report.pdf", pagesize=letter)
canvas.setLineWidth(.3)
canvas.setFont('Helvetica', 12)
canvas.drawString(30,750,'LYIT MOBILE FORENSICS DIVISION')
canvas.drawString(500,750,"Date: 12/02/2018")
canvas.line(500,747,595,747)
canvas.drawString(500,725,'Case Number:')
canvas.drawString(580,725,"10")
canvas.line(500,723,595,723)
line1 = 'This forensic report has been compiled by the forensic'
line2 = 'examiner in conclusion to the investigation into the RTA'
line3 = 'case which occured on 23/01/2018'
textobject = canvas.beginText(30, 700)
lines = [line1, line2, line3]
for line in lines:
textobject.textLine(line)
canvas.drawText(textobject)
canvas.save()
This is also the solution suggested here. Unfortunately, I do not see it as a valid solution for automatic text wrapping in new lines, i.e. you should manage how to split the string yourself.
You should wrap the string itself using a function and draw what's returned. To get a properly wrapped string you could specify a line length like so:
def wrap(string, length):
if len(string) < length:
return string
return string[:length:] + '\n' + wrap(string[length::], length)
The wrap function first checks if the string length is less than the specified length and immediately returns it if so. If it's longer then it appends a newline character to the end of a substring up to length, plus sends the remainder of the string to the wrap() function to check the rest.
Running:
string = "This is a really super duper long string from some forensic report and should take up a lot of space..."
length = 20
print(wrap(string, length))
Will print out:
This is a really sup
er duper long string
from some forensic
report and should ta
ke up a lot of space
...
Because you PROBABLY don't want every single line to be truncated at the line width, we can fix this by adding another recursive function to check for the most recent whitespace character like so:
def seek(string, index):
if string[index-1] == ' ':
return index
return seek(string, index-1)
The seek() will return the index of the last whitespace character of any string (substring in this case).
Note: seek() HAS to check the previous character string[index-1] otherwise will give you the index of the space character and wrap() will append it to each new line.
You can then modify the wrap() function like so:
def wrap(string, length):
if len(string) < length:
return string
pos = seek(string, length)
return string[:pos:] + '\n' + wrap(string[pos::], length)
Running:
string = "This is a really super duper long string from some forensic report and should take up a lot of space..."
length = 20
print(wrap(string, length))
Prints out:
This is a really
super duper long
string from some
forensic report and
should take up a
lot of space...

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

Returning every instance of whatever's between two strings in a file [Python 3]

What I'm trying to do is open a file, then find every instance of '[\x06I"' and '\x06;', then return whatever is between the two.
Since this is not a standard text file (it's map data from RPG maker) readline() will not work for my purposes, as the file is not at all formatted in such a way that the data I want is always neatly within one line by itself.
What I'm doing right now is loading the file into a list with read(), then simply deleting characters from the very beginning until I hit the string '[\x06I'. Then I scan ahead to find '\x06;', store what's between them as a string, append said string to a list, then resume at the character after the semicolon I found.
It works, and I ended up with pretty much exactly what I wanted, but I feel like that's the worst possible way to go about it. Is there a more efficient way?
My relevant code:
while eofget == 0:
savor = 0
while savor == 0 or eofget == 0:
if line[0:4] == '[\x06I"':
x = 4
spork = 0
while spork == 0:
x += 1
if line[x] == '\x06':
if line[x+1] == ';':
spork = x
savor = line[5:spork] + "\n"
line = line[x+1:]
linefinal[lineinc] = savor
lineinc += 1
elif line[x:x+7] == '#widthi':
print("eof reached")
spork = 1
eofget = 1
savor = 0
elif line[x:x+7] == '#widthi':
print("finished map " + mapname)
eofget = 1
savor = 0
break
else:
line = line[1:]
You can just ignore the variable names. I just name things the first thing that comes to mind when I'm doing one-offs like this. And yes, I am aware a few things in there don't make any sense, but I'm saving cleanup for when I finalize the code.
When eofget gets flipped on this subroutine terminates and the next map is loaded. Then it repeats. The '#widthi' check is basically there to save time, since it's present in every map and indicates the beginning of the map data, AKA data I don't care about.
I feel this is a natural case to use regular expressions. Using the findall method:
>>> s = 'testing[\x06I"text in between 1\x06;filler text[\x06I"text in between 2\x06;more filler[\x06I"text in between \n with some line breaks \n included in the text\x06;ending'
>>> import re
>>> p = re.compile('\[\x06I"(.+?)\x06;', re.DOTALL)
>>> print(p.findall(s))
['text in between 1', 'text in between 2', 'text in between \n with some line breaks \n included in the text']
The regex string '\[\x06I"(.+?)\x06;'can be interpreted as follows:
Match as little as possible (denoted by ?) of an undetermined number of unspecified characters (denoted by .+) surrounded by '[\x06I"' and '\x06;', and only return the enclosed text (denoted by the parentheses around .+?)
Adding re.DOTALL in the compile makes the .? match line breaks as well, allowing multi-line text to be captured.
I would use split():
fulltext = 'adsfasgaseg[\x06I"thisiswhatyouneed\x06;sdfaesgaegegaadsf[\x06I"this is the second what you need \x06;asdfeagaeef'
parts = fulltext.split('[\x06I"') # split by first label
results = []
for part in parts:
if '\x06;' in part: # if second label exists in part
results.append(part.split('\x06;')[0]) # get the part until the second label
print results

Python: Read large file in chunks

Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

Categories