Python not splitting CRLF correctly

Python not splitting CRLF correctly - python

I'm writing a script to convert very simple function documentation to XML in python. The format I'm using would convert:
date_time_of(date) Returns the time part of the indicated date-time value, setting the date part to 0.
to:
<item name="date_time_of">
<arg>(date)</arg>
<help> Returns the time part of the indicated date-time value, setting the date part to 0.</help>
</item>
So far it works great (the XML I posted above was generated from the program) but the problem is that it should be working with several lines of documentation pasted, but it only works for the first line pasted into the application. I checked the pasted documentation in Notepad++ and the lines did indeed have CRLF at the end, so what is my problem?
Here is my code:
mainText = input("Enter your text to convert:\r\n")
try:
for line in mainText.split('\r\n'):
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
Any idea of what the issue is here?
Thanks.

input() only reads one line.
Try this. Enter a blank line to stop collecting lines.
lines = []
while True:
line = input('line: ')
if line:
lines.append(line)
else:
break
print(lines)

The best way to handle reading lines from standard input (the console) is to iterate over the sys.stdin object. Rewritten to do this, your code would look something like this:
from sys import stdin
try:
for line in stdin:
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
That said, It's worth noting that your parsing code could be significantly simplified with a little help from regular expressions. Here's an example:
import re, sys
for line in sys.stdin:
result = re.match(r"(.*?)\((.*?)\)(.*)", line)
if result:
name = result.group(1)
arg = result.group(2).split(",")
hlp = result.group(3)
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
else:
print "There was an error parsing this line: '%s'" % line
I hope this helps you simplify your code.

Patrick Moriarty,
It seems to me that you didn't particularly mention the console and that your main concern is to pass several lines together at one time to be treated. There's only one manner in which I could reproduce your problem: it is, executing the program in IDLE, to copy manually several lines from a file and pasting them to raw_input()
Trying to understand your problem led me to the following facts:
when data is copied from a file and pasted to raw_input() , the newlines \r\n are transformed into \n , so the string returned by raw_input() has no more \r\n . Hence no split('\r\n') is possible on this string
pasting in a Notepad++ window a data containing isolated \r and \n characters, and activating display of the special characters, it appears CR LF symbols at all the extremities of the lines, even at the places where there are \r and \n alone. Hence, using Notepad++ to verify the nature of the newlines leads to erroneous conclusion
.
The first fact is the cause of your problem. I ignore the prior reason of this transformation affecting data copied from a file and passed to raw_input() , that's why I posted a question on stackoverflow:
Strange vanishing of CR in strings coming from a copy of a file's content passed to raw_input()
The second fact is responsible of your confusion and despair. Not a chance....
.
So, what to do to solve your problem ?
Here's a code that reproduce this problem. Note the modified algorithm in it, replacing your repeated splits applied to each line.
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
print "open 'funcdoc.txt' to manually copy its content, and paste it on the following line"
mainText = raw_input("Enter your text to convert:\n")
print "OK, copy-paste of file 'funcdoc.txt' ' s content has been performed"
print "\nrepr(mainText)==",repr(mainText)
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
Here's the solution mentioned by delnan : « read from the source instead of having a human copy and paste it. »
It works with your split('\r\n') :
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
#####################################
with open('funcdoc.txt','rb') as f:
mainText = f.read()
print "\nfile 'funcdoc.txt' has just been opened and its content copied and put to mainText"
print "\nrepr(mainText)==",repr(mainText)
print
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
And finally, here's the solution of Python to process the altered human copy: providing the splitlines() function that treat all kind of newlines (\r or \n or \r\n) as splitters. So replace
for line in mainText.split('\r\n'):
by
for line in mainText.splitlines():

Related

Python reading file adds extra characters

I have a python file with some passwords. The problem is when Python prints these passwords or saves them in a variable it adds a random "Â" to the mix and I'm trying to figure out why. I could just replace the "Â" with nothing but I would like to know why this is happening.
Some examples:
^oqi£"HS prints out ^oqiÂ£"HS
rS£g)5Q% prints out rSÂ£g)5Q%
Code:
with open('pass.txt') as f:
first_line = f.readline().rstrip()
print(first_line)

How to replace text in a PDF using Python?

I have taken the code from another thread here that uses the library PyPDF2 to parse and replace the text of a PDF. The given example PDF in the thread is parsed as a PyPDF2.generic.DecodedStreamObject. I am currently working with a PDF that the company has provided me that was created using Microsoft Word's Export to PDF feature. This generates a PyPDF2.generic.EncodedStreamObject. From exploration, the main difference is that there is what appears to be kerning in some places in the text.
This caused two problems for me with the sample code. Firstly, the line if len(contents) > 0: in main seems to get erroneously triggered and attempts to use the key of the EncodedStreamObject dictionary instead of the EncodedStreamObject itself. To work around this, I commented out the if block and used the code in the else block for both cases.
The second problem was that the (what I assume are) kerning markings broke up the text I was trying to replace. I noticed that kerning was not in every line, so I made the assumption that the kerning markers were not strictly necessary, and tried to see what the output would look like with them removed. The text was structured something like so: [(Thi)4(s)-1(is t)2(ext)]. I replaced the line in the sample code replaced_line = line in replace_text with replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ". This preserved the observed structure while allowing the text to be searched for replacements. I verified this was actually replacing the text of the line.
Neither of those changes prevented the code from executing, however the output PDF seems to be completely unchanged despite the code appearing to work using print statements to check if the replaced line has the new text. I initially assumed this was because of the if block in process_data that determined if it was Encoded or Decoded. However, I dug through the actual source code for this library located here, and it seems that if the object is Encoded, it generates a Decoded version of itself which the if block reflects. My only other idea is that the if block that I commented out in main wasn't erroneously catching my scenario, but was instead handling it incorrectly. I have no idea how I would fix it so that it handles it properly.
I feel like I'm incredibly close to solving this, but I'm at my wits end as to what to do from here. I would ask the poster of the linked solution in a comment, but I do not have enough reputation to comment on SO. Does anyone have any leads on how to solve this problem? I don't particularly care what library or file format is used, but it must retain the formatting of the Word document I have been provided. I have already tried exporting to HTML, but that removes most of the formatting and also the header. I have also tried converting the .docx to PDF in Python, but that requires me to actually have Word installed on the machine, which is not a cross-platform solution. I also explored using RTF, but from what I found the solution for that file type is to convert it to a .docx and then to PDF.
Here is the full code that I have so far:
import PyPDF2
import re
def replace_text(content, replacements=dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(obj, replacements):
data = obj.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if obj.decodedSelf is not None:
obj.decodedSelf.setData(encoded_data)
else:
obj.setData(encoded_data)
pdf = PyPDF2.PdfFileReader("template.pdf")
# pdf = PyPDF2.PdfFileReader("sample.pdf")
writer = PyPDF2.PdfFileWriter()
replacements = {
"some text": "replacement text"
}
for page in pdf.pages:
contents = page.getContents()
# if len(contents) > 0:
# for obj in contents:
# streamObj = obj.getObject()
# process_data(streamObj, replacements)
# else:
process_data(contents, replacements)
writer.addPage(page)
with open("output.pdf", 'wb') as out_file:
writer.write(out_file)
EDIT:
I've somewhat tracked down the source of my problems. The line obj.decodedSelf.setData(encoded_data) seems to not actually set the data properly. After that line, I added
print(encoded_data[:2000])
print("----------------------")
print(obj.getData()[:2000])
The first print statement was different from the second print statement, which definitely should not be the case. To really test see if this was true, I replaced every single line with [()], which I know to be valid as there are many lines that are already that. For the life of me, though, I can't figure out why this function call fails to do any lasting changes.
EDIT 2:
I have further identified the problem. In the source code for an EncodedStreamObject in the getData method, it returnsself.decodedSelf.getData() if self.decodedSelf is True. HOWEVER, after doing obj.decodedSelf.setData(encoded_data), if I do print(bool(obj.decodedSelf)), it prints False. This means that when the EncodedStreamObject is getting accessed to be written out to the PDF, it is re-parsing the old PDF and overriding the self.decodedSelf object! Short of going in and fixing the source code, I'm not sure how I would solve this problem.
EDIT 3:
I have managed to convince the library to use the decoded version that has the replacements! By inserting the line page[PyPDF2.pdf.NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), it forces the page to have the updated contents. Unfortunately, my previous assumption about the text kerning was incorrect. After I replaced things, some of my text mysteriously disappeared from the PDF. I assume this is because the format is incorrect somehow.
FINAL EDIT:
I figure I'd put this in here in case anyone else stumbles across this. I never did manage to get it to finally work as expected. I instead moved to a solution to mimic the PDF with HTML/CSS. If you add the following style tag in your HTML, you can get it to print more like how you'd expect a PDF to print:
<style type="text/css" media="print">
#page {
size: auto;
margin: 0;
}
</style>
I'd recommend this solution for anyone looking to do what I was doing. There are Python libraries to convert HTML to CSS, but they do not support HTML5 and CSS3 (notably they do not support CSS flex or grid). You can just print the HTML page to PDF from any browser to accomplish the same thing. It definitely doesn't answer the question, so I felt it best to leave it as an edit. If anyone manages to complete what I have attempted, please post an answer for me and any others.

Unable to Format Output Text File to Desired Form using "write" function Python

I am unable to format the output of my text file as I want it. I have fooled around with this for almost an hour, to no avail, and it's driving me mad. I want the first four floats to be on one line, and the next 10 values to be delimited by new lines.
if not (debug_flag>0):
text_file = open("Markov.txt", "w")
text_file.write("%.2f,%.2f,%.2f,%.2f" % (prob_not_to_not,prob_not_to_occured, prob_occured_to_not, prob_occured_to_occured))
for x in xrange(0,10):
text_file.write("\n%d" % markov_sampler(final_probability))
text_file.close()
Does anyone know what the issue is? The output I'm getting is all on 1 line.

You have to put the line feed at the end of the first line for it to work.
Also your text editor may be configure to have the \r\n end of line( if you are using notepad ), in wich case you should be seeing everything in the same line.
The code with the desired output may look something like this
if not (debug_flag>0):
text_file = open("Markov.txt", "w")
text_file.write("%.2f,%.2f,%.2f,%.2f\n" % (prob_not_to_not,prob_not_to_occured, prob_occured_to_not, prob_occured_to_occured))
for x in xrange(0,10):
text_file.write("%d\n" % markov_sampler(final_probability))
text_file.close()

How Can I Remove Skipped Lines from Pastebin Output?

I am trying to use Pastebin to host two text files for me to allow any copy of my script to update itself through the internet. My code is working, but the resultant .py file has a blank line added between each line. Here is my script...
import os, inspect, urllib2
runningVersion = "1.00.0v"
versionUrl = "http://pastebin.com/raw.php?i=3JqJtUiX"
codeUrl = "http://pastebin.com/raw.php?i=GWqAQ0Xj"
scriptFilePath = (os.path.abspath(inspect.getfile(inspect.currentframe()))).replace("\\", "/")
def checkUpdate(silent=1):
# silently attempt to update the script file by default, post messages if silent==0
# never update if "No_Update.txt" exists in the same folder
if os.path.exists(os.path.dirname(scriptFilePath)+"/No_Update.txt"):
return
try:
versionData = urllib2.urlopen(versionUrl)
except urllib2.URLError:
if silent==0:
print "Connection failed"
return
currentVersion = versionData.read()
if runningVersion!=currentVersion:
if silent==0:
print "There has been an update.\nWould you like to download it?"
try:
codeData = urllib2.urlopen(codeUrl)
except urllib2.URLError:
if silent==0:
print "Connection failed"
return
currentCode = codeData.read()
with open(scriptFilePath.replace(".py","_UPDATED.py"), mode="w") as scriptFile:
scriptFile.write(currentCode)
if silent==0:
print "Your program has been updated.\nChanges will take effect after you restart"
elif silent==0:
print "Your program is up to date"
checkUpdate()
I stripped the GUI (wxpython) and set the script to update another file instead of the actual running one. The "No_Update" bit is for convenience while working.
I noticed that opening the resultant file with Notepad does not show the skipped lines, opening with Wordpad gives a jumbled mess, and opening with Idle shows the skipped lines. Based on that, this seems to be a formatting problem even though the "raw" Pastebin file does not appear to have any formatting.
EDIT: I could just strip all blank lines or leave it as is without any problems, (that I've noticed) but that would greatly reduce readability.

Try adding the binary qualifier in your open():
with open(scriptFilePath.replace(".py","_UPDATED.py"), mode="wb") as scriptFile:
I notice that your file on pastebin is in DOS format, so it has \r\n in it. When you call scriptFile.write(), it translates \r\n to \r\r\n, which is terribly confusing.
Specifying "b" in the open() will cause scriptfile to skip that translate and write the file is DOS format.
In the alternative, you could ensure that the pastebin file has only \n in it, and use mode="w" in your script.

How to read next logical line in python

I would like to read the next logical line from a file into python, where logical means "according to the syntax of python".
I have written a small command which reads a set of statements from a file, and then prints out what you would get if you typed the statements into a python shell, complete with prompts and return values. Simple enough -- read each line, then eval. Which works just fine, until you hit a multi-line string.
I'm trying to avoid doing my own lexical analysis.
As a simple example, say I have a file containing
2 + 2
I want to print
>>> 2 + 2
4
and if I have a file with
"""Hello
World"""
I want to print
>>>> """Hello
...World"""
'Hello\nWorld'
The first of these is trivial -- read a line, eval, print. But then I need special support for comment lines. And now triple quotes. And so on.

You may want to take a look at the InteractiveInterpreter class from the code module .
The runsource() method shows how to deal with incomplete input.

Okay, so resi had the correct idea. Here is my trivial code which does the job.
#!/usr/bin/python
import sys
import code
class Shell(code.InteractiveConsole):
def write(data):
print(data)
cons = Shell()
file_contents = sys.stdin
prompt = ">>> "
for line in file_contents:
print prompt + line,
if cons.push(line.strip()):
prompt = "... "
else:
prompt = ">>> "

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python not splitting CRLF correctly - python

input() only reads one line. Try this. Enter a blank line to stop collecting lines. lines = [] while True: line = input('line: ') if line: lines.append(line) else: break print(lines)

Related

Python reading file adds extra characters

How to replace text in a PDF using Python?

Unable to Format Output Text File to Desired Form using "write" function Python

How Can I Remove Skipped Lines from Pastebin Output?

How to read next logical line in python

Categories

Resources