Use of readline function with Python StringIO module - python

I am trying to read a file from an FTP site and process one line at a time. I write from the FTP server to a StringIO object and call the readline function, but this returns the entire file, rather that the first line. I downloaded the file to my pc and examined it with a hex editor, and the file uses x0d0a for a newline character, or a carriage return with a line feed. Could somebody point out to me where I might be going wrong here?
Thanks in advance!
#!/usr/bin/python
import ftplib
import StringIO
settles = StringIO.StringIO()
ftp = ftplib.FTP('ftp.cmegroup.com')
ftp.login()
ftp.cwd('pub/settle/')
ftp.retrlines('RETR cbt.settle.s.txt', settles.write)
settles.seek(0)
print settles.readline()

According to the FTP.retrlines documentation:
... The callback function is called for each line with a string argument containing the line with the trailing CRLF stripped. ....
Replace retrlines with retrbinary.
Alternatively, you can ..retrlines .. lines as follow (appending newlines):
ftp.retrlines('RETR cbt.settle.s.txt', lambda line: settles.write(line + '\n'))

Related

Reading a dynamically updated log file via readline

I'm trying to read a log file, written line by line, via readline.
I'm surprised to observe the following behaviour (code executed in the interpreter, but same happens when variations are executed from a file):
f = open('myfile.log')
line = readline()
while line:
print(line)
line = f.readline()
# --> This displays all lines the file contains so far, as expected
# At this point, I open the log file with a text editor (Vim),
# add a line, save and close the editor.
line = f.readline()
print(line)
# --> I'm expecting to see the new line, but this does not print anything!
Is this behaviour standard? Am I missing something?
Note: I know there are better way to deal with an updated file for instance with generators as pointed here: Reading from a frequently updated file. I'm just interested in understanding the issue with this precise use case.
For your specific use case, the explanation is that Vim uses a write-to-temp strategy. This means that all writing operations are performed on a temporary file.
On the contrary, your scripts reads from the original file, so it does not see any change on it.
To further test, instead of Vim, you can try to directly write on the file using:
echo "Hello World" >> myfile.log
You should see the new line from python.
for following your file, you can use this code:
f = open('myfile.log')
while True:
line = readline()
if not line:
print(line)

How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.

python fileinput line endings

I've got some python code which is getting line endings all wrong:
command = 'svn cat -r {} "{}{}"'.format(svn_revision, svn_repo, svn_filename)
content = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read()
written = False
for line in fileinput.input(out_filename, inplace=1):
if line.startswith("INPUT_TAG") and not written:
print content
written = True
print line,
This fetches a copy of the file called svn_filename, and inserts the content into another file called out_filename at the "INPUT_TAG" location in the file.
The problem is the line endings in out_filename.
They're meant to be \r\n but the block I insert is \r\r\n.
Changing the print statement to:
print content, # just removes the newlines after the content block
or
print content.replace('\r\r','\r') # no change
has no effect. The extra carriage returns are inserted after the content leaves my code. It seems like something is deciding that because I'm on windows it should convert all \n to \r\n.
How can I get around this?
I can "solve" this problem by doing the following:
content = content.replace('\r\n', '\n')
converting the newlines to unix style so when the internal magic converts it again it ends up correct.
This can't be the right/best/pythonic way though....
CRLF = Carriage Return Line Feed.
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written.
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Can you output as binary file and not as a text file?
If you prefix the string with r to open the file as raw, does this prevent extra \r in the output?

Find a line of text with a pattern using Windows command line or Python

I need to run a command line tool that verifies a file and displays a bunch of information about it. I can export this information to a txt file but it includes a lot of extra data. I just need one line for the file:
"The signature is timestamped: Thu May 24 17:13:16 2012"
The time could be different, but I just need to extract this data and put it into a file. Is there a good way to do this from the command line itself or maybe python? I plan on using Python to locate and download the file to be verified, then run the command line tool to verify it so it can get the data then send that data in an email.
This is on a windows PC.
Thanks for your help
You don't need to use Python to do this. If you're using a Unix environment, you can use fgrep right from the command-line and redirect the output to another file.
fgrep "The signature is timestamped: " input.txt > output.txt
On Windows you can use:
find "The signature is timestamped: " < input.txt > output.txt
You mention the command line utility "displays" some information, so it may well be printing to stdout, so one way is to run the utility within Python, and capture the output.
import subprocess
# Try with some basic commands here maybe...
file_info = subprocess.check_output(['your_command_name', 'input_file'])
for line in file_info.splitlines():
# print line here to see what you get
if file_info.startswith('The signature is timestamped: '):
print line # do something here
This should fit in nicely with the "use python to download and locate" - so that can use urllib.urlretrieve to download (possibly with a temporary name), then run the command line util on the temp file to get the details, then the smtplib to send emails...
In python you can do something like this:
timestamp = ''
with open('./filename', 'r') as f:
timestamp = [line for line in f.readlines() if 'The signature is timestamped: ' in line]
I haven't tested this but I think it'd work. Not sure if there's a better solution.
I'm not too sure about the exact syntax of this exported file you have, but python's readlines() function might be helpful for this.
h=open(pathname,'r') #opens the file for reading
for line in h.readlines():
print line#this will print out the contents of each line of the text file
If the text file has the same format every time, the rest is easy; if it is not, you could do something like
for line in h.readlines():
if line.split()[3] == 'timestamped':
print line
output_string=line
as for writing to a file, you'll want to open the file for writing, h=open(name, "w"), then use h.write(output_string) to write it to a text file

Python 3: receive user input including newline characters

I'm trying to read in the following text from the command-line in Python 3 (copied verbatim, newlines and all):
lcbeika
rraobmlo
grmfina
ontccep
emrlin
tseiboo
edosrgd
mkoeys
eissaml
knaiefr
Using input, I can only read in the first word as once it reads the first newline it stops reading.
Is there a way I could read in them all without iteratively calling input?
You can import sys and use the methods on sys.stdin for example:
text = sys.stdin.read()
or:
lines = sys.stdin.readlines()
or:
for line in sys.stdin:
# Do something with line.
if you are passing the text into your script as a file , you can use readlines()
eg
data=open("file").readlines()
or you can use fileinput
import fileinput
for line in fileinput.input():
print line

Categories