I want to parse logfiles from rackspace. I'm using the official python sdk.
I have previously saved the file to disk and then read it from there with gzip.open.
Now I'm on heroku and can't / don't want to save the file to disk, but do the unzipping in memory.
However, I can't manage to download the object as string or pseudo file object to handle it.
Does someone has an idea?
logString = ''
buffer = logfile.stream()
while True:
try:
logString += buffer.next()
except StopIteration:
break
# logString is always empty here
# I'd like to have something that enables me to do this:
for line in zlib.decompress(logString):
# having each line of the log here
Update
I've noticed, that the empty string is not entirely true. This is going through a loop, and just the first occurence is empty. The next occurences I do have data (that looks like it's gzipped), but I get this zlib error:
zlib.error: Error -3 while decompressing data: incorrect header check
Update II
As suggested, I implemented cStringIO, with the same result:
buffer = logfile.stream()
output = cStringIO.StringIO()
while True:
try:
output.write(buffer.next())
except StopIteration:
break
print(output.getvalue())
Update III
This does work now:
output = cStringIO.StringIO()
try:
for buffer in logfile.stream():
output.write(buffer)
except StopIteration:
break
And at least no crash in here, but it seems not to get actual lines:
for line in gzip.GzipFile(fileobj=output).readlines():
# this is never reached
How to proceed here? Is there some easy way to see the incoming data as normal string to know if I'm on the right way?
I found out, that read() is also an option, that led to an easy solution like this:
io = cStringIO.StringIO(logfile.read())
for line in GzipFile(fileobj=io).readlines():
impression = LogParser._parseLine(line)
if impression is not None:
impressions.append(impression)
Related
I use python to connect multiple processing tools for NLP tasks together but also capture the output of each in case something fails and write it to a log.
Some tools need many hours and output their current status as a progress percentage with carriage returns (\r). They do many steps, so they mix normal messages and progress messages. That results in sometimes really large log files that are hard to view with less.
My log will look like this (for fast progresses):
[DEBUG ] [FILE] [OUT] ^M4% done^M8% done^M12% done^M15% done^M19% done^M23% done^M27% done^M31% done^M35% done^M38% done^M42% done^M46% done^M50% done^M54% done^M58% done^M62% done^M65% done^M69% done^M73% done^M77% done^M81% done^M85% done^M88% done^M92% done^M96% done^M100% doneFinished
What I want is an easy way to collapse those strings in python. (I guess it is also possible to do this after the pipeline is finished and replace progress messages with e. g. sed ...)
My code for running and capturing the output looks like this:
import subprocess
from tempfile import NamedTemporaryFile
def run_command_of(command):
try:
out_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='out')
err_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='err')
debug('Redirecting command output to temp files ...', \
'out =', out_file.name, ', err =', err_file.name)
p = subprocess.Popen(command, shell=True, \
stdout=out_file, stderr=err_file)
p.communicate()
status = p.returncode
def fr_gen(file):
debug('Reading from %s ...' % file.name)
file.seek(0)
for line in file:
# TODO: UnicodeDecodeError?
# reload(sys)
# sys.setdefaultencoding('utf-8')
# unicode(line, 'utf-8')
# no decoding ...
yield line.decode('utf-8', errors='replace').rstrip()
debug('Closing temp file %s' % file.name)
file.close()
os.unlink(file.name)
return (fr_gen(out_file), fr_gen(err_file), status)
except:
from sys import exc_info
error('Error while running command', command, exc_info()[0], exc_info()[1])
return (None, None, 1)
def execute(command, check_retcode_null=False):
debug('run command:', command)
out, err, status = run_command_of(command)
debug('-> exit status:', status)
if out is not None:
is_empty = True
for line in out:
is_empty = False
debug('[FILE]', '[OUT]', line.encode('utf-8', errors='replace'))
if is_empty:
debug('execute: no output')
else:
debug('execute: no output?')
if err is not None:
is_empty = True
for line in err:
is_empty = False
debug('[FILE]', '[ERR]', line.encode('utf-8', errors='replace'))
if is_empty:
debug('execute: no error-output')
else:
debug('execute: no error-output?')
if check_retcode_null:
return status == 0
return True
It is some older code in Python 2 (funny time with unicode strings) that I want to rewrite to Python 3 and improve upon. (I'm also open for suggestions in how to process the output in realtime and not when everything is finished. update: is too broad and not exactly part of my problem)
I can think of many approaches but do not know if there is a ready-to-use function/library/etc. but I could not find any. (My google-fu needs work.) The only things I found were ways to remove the CR/LF but not the string portion that gets visually replaced. So I'm open for suggestions and improvements before I invest my time in reimplementing the wheel. ;-)
My approach would be to use a regex to find sections in a string/line between \r and remove them. Optionally I would keep a single percentage value for really long processes. Something like \r([^\r]*\r).
Note: A possible duplicate of: How to pull the output of the most recent terminal command?
It may require a wrapper script. It can still be used to convert my old log files with script2log. I found/got a suggestion for a plain python way that fulfills my needs.
I think the solution for my use case is as simple as this snippet:
# my data
segments = ['abcdef', '567', '1234', 'xy', '\n']
s = '\r'.join(segments)
# really fast approach:
last = s.rstrip().split('\r')[-1]
# or: simulate the overwrites
parts = s.rstrip().split('\r')
last = parts[-1]
last_len = len(last)
for part in reversed(parts):
if len(part) > last_len:
last = last + part[last_len:]
last_len = len(last)
# result
print(last)
Thanks to the comments to my question, I could better/further refine my requirements. In my case the only control characters are carriage returns (CR, \r) and a rather simple solution works as tripleee suggested.
Why not simply the last part after \r? The output of
echo -e "abcd\r12"
can result in:
12cd
The questions under the subprocess tag (also suggested in a comment from tripleee) should help for realtime/interleaved output but are outside of my current focus. I will have to test the best approach. I was already using stdbuf for switching the buffering when needed.
So I can send strings just fine. What I want to do though is to more or less send a string representation of a list. I know quite a few to convert a list into something that should be able to sent as a string and then converted back.
#on sending
l = [1,2,3,4]
l_str = str(l)
#on receiving
l = ast.literal_eval(received_data)
## or pickle
l = pickle.dumps([1,2,3,4])
##then
l = pick.loads(received_data)
My issues however seems to be that something odd is happening between the receiving and sending.
Right now I have this
msg = pickle.dumps([sys.stdin.readline(), person])
s.send(msg)
where sys.stdin.readline() is the line typed into the console and person is a variable containing someone's name.
I then receive it like so.
d1 = sock.recv(4096)
pickles = False
try:
d1 = pickle.loads(d1)
pickles = True
It doesn't matter if I just make the list string by my first method and then use ast.literal_eval or use pickle, it never actually converts back to the list I want.
I currently have a try statement in place because I know there will be times where I will actually not be getting back something that was dumped using pickle or what not, so the idea is that it should fail on those and in the except just continue as if the data received was formatted correctly.
The error that is produced when I try to unpickle them for instance is
Traceback (most recent call last):
File "telnet.py", line 75, in <module>
d1 = pickle.loads(d1)
File "/usr/local/lib/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/local/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\r'
The pickle.loads never succeeds because pickles is never True...... Any ideas?
EDIT: I have the overall solution. The issue for me was actually not in the file you see in the error, that being telnet.py, but in another file. I didn't realize that the intermediate server was receiving input and changing it. However, after some suggestions, I realized that was what exactly was happening.
My issue actually came from another file. At the time I did not realize that this being a chat server / client was important. However, the chat server was actually sending data back to the client that it formatted.. honestly I don't know how that didn't hit me but thats what happened.
I am using the built in lzma python to decode compressed chunk of data. Depending on the chunk of data, I get the following exception :
Compressed data ended before the end-of-stream marker was reached
The data is NOT corrupted. It can be decompressed correctly with other tools, so it must be a bug in the library. There are other people experiencing the same issue:
http://bugs.python.org/issue21872
https://github.com/peterjc/backports.lzma/issues/6
Downloading large file in python error: Compressed file ended before the end-of-stream marker was reached
Unfortunately, none seems to have found a solution yet. At least, one that works on Python 3.5.
How can I solve this problem? Is there any work around?
I spent a lot of time trying to understand and solve this problem, so i thought it would a good idea to share it. The problem seems to be caused by the a chunk of data without the EOF byte properly set. In order to decompress a buffer, I used to use the lzma.decompress provided by the lzma python lib. However, this method expects each data buffer to contains a EOF bytes, otherwise it throws a LZMAError exception.
To work around this limitation, we can implement an alternative decompress function which uses LZMADecompress object to extract the data from a buffer. For example:
def decompress_lzma(data):
results = []
while True:
decomp = LZMADecompressor(FORMAT_AUTO, None, None)
try:
res = decomp.decompress(data)
except LZMAError:
if results:
break # Leftover data is not a valid LZMA/XZ stream; ignore it.
else:
raise # Error on the first iteration; bail out.
results.append(res)
data = decomp.unused_data
if not data:
break
if not decomp.eof:
raise LZMAError("Compressed data ended before the end-of-stream marker was reached")
return b"".join(results)
This function is similar to the one provided by the standard lzma lib with one key difference. The loop is broken if the entire buffer has been processed, before checking if we reached the EOF mark.
I hope this can be useful to other people.
So I am very new to networking and I was using the Python Socket library to connect to a server that is transmitting a stream of location data.
Here is the code used.
import socket
BUFFER_SIZE = 1024
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((gump.gatech.edu, 756))
try:
while (1):
data = s.recv(BUFFER_SIZE).decode('utf-8')
print(data)
except KeyboardInterrupt:
s.close()
The issue is that the data arrives in inconsistent forms.
Most of the times it arrives in the correct form like this:
2016-01-21 22:40:07,441,-84.404153,33.778685,5,3
Yet other times it can arrive split up into two lines like so:
2016-01-21
22:40:07,404,-84.396004,33.778085,0,0
The interesting thing is that when I establish a raw connection to the server using Putty I only get the correct form and never the split. So I imagine that there must be something happening that is splitting the message. Or something Putty is doing to always assemble it correctly.
What I need is for the variable data to contain the proper line always. Any idea how to accomplish this?
It is best to think of a socket as a continuous stream of data, that may arrive in dribs and drabs, or a flood.
In particular, it is the receivers job to break the data up into the "records" that it should consist of, the socket does not magically know how to do this for you. Here the records are lines, so you must read the data and split into lines yourself.
You cannot guarantee that a single recv will be a single full line. It could be:
just part of a line;
or several lines;
or, most probably, several lines and another part line.
Try something like: (untested)
# we'll use this to collate partial data
data = ""
while 1:
# receive the next batch of data
data += s.recv(BUFFER_SIZE).decode('utf-8')
# split the data into lines
lines = data.splitlines(keepends=True)
# the last of these may be a part line
full_lines, last_line = lines[:-1], lines[-1]
# print (or do something else!) with the full lines
for l in full_lines:
print(l, end="")
# was the last line received a full line, or just half a line?
if last_line.endswith("\n"):
# print it (or do something else!)
print(last_line, end="")
# and reset our partial data to nothing
data = ""
else:
# reset our partial data to this part line
data = last_line
The easiest way to fix your code is to print the received data without adding a new line, which the print statement (Python 2) and the print() function (Python 3) do by default. Like this:
Python 2:
print data,
Python 3:
print(data, end='')
Now print will not add its own new line character to the end of each printed value and only the new lines present in the received data will be printed. The result is that each line is printed without being split based on the amount of data received by each `socket.recv(). For example:
from __future__ import print_function
import socket
s = socket.socket()
s.connect(('gump.gatech.edu', 756))
while True:
data = s.recv(3).decode('utf8')
if not data:
break # socket closed, all data read
print(data, end='')
Here I have used a very small buffer size of 3 which helps to highlight the problem.
Note that this only fixes the problem from the POV of printing the data. If you wanted to process the data line-by-line then you would need to do your own buffering of the incoming data, and process the line when you receive a new line or the socket is closed.
Edit:
socket.recv() is blocking and like the others said, you wont get an exact line each time you call the method. So as a result, the socket is waiting for data, gets what it can get and then returns. When you print this, because of pythons default end argument, you may get more newlines than you expected. So to get the raw stuff from your server, use this:
import socket
BUFFER_SIZE = 1024
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('gump.gatech.edu', 756))
try:
while (1):
data=s.recv(BUFFER_SIZE).decode('utf-8')
if not data: break
print(data, end="")
except KeyboardInterrupt:
s.close()
DISCLAIMER: I am a total Python n00b and have never ever written anything in Python, I haven't programmed anything in years, and the last language I learned was Visual Basic 6. So bear with me!
So I have an Android app that transmits my phone's sensor (accelerometer, magnet, light etc) data to my Windows PC via UDP, and I have a Python 3.3 script to display that data on screen, and write it to a CSV:
#include libraries n stuff
import socket
import traceback
import csv
#assign variables n stuff
host = ''
port = 5555
csvf = 'accelerometer.csv'
#do UDP stuff
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.bind((host, port))
#do CSV stuff
with open(csvf, 'w', newline='', encoding='ascii') as csv_handle:
csv_writer = csv.writer(csv_handle, delimiter=',')
while 1:
try:
message, address = s.recvfrom(8192)
print(message) #display data on screen
csv_writer.writerow(message) #write data to CSV
except(KeyboardInterrupt, SystemExit):
raise
except:
traceback.print_exc()
The data on screen looks like this, which is correct:
b'7407.75961, 3, 0.865, 1.423, 9.022, 5,
The data in the CSV file looks like the numerical values of the ASCII codes of the data (note: codes won't match with above because data is slightly different):
57,48,48,50,46,54,51,57,57,57,44,32,51,44,32,32,32,48,46,53,52,57,44,32,32,53,46,54,56,56,44,32,32,56,46,51,53,53
How can I get my CSV to just write the string that the UDP socket is receiving? I tried adding "encoding='ascii'", as you can see, but that didn't make a difference from leaving it out.
writerow expects a sequence or iterable of values. But you just have one value.
The reason it sort of works, but does the wrong thing, is that your one value—a bytes string—is actually itself a sequence. But it's not a sequence of the comma-separated values, it's a sequence of bytes.
So, how do you get a sequence of the separate values?
One option is to use split. Either message.split(b', ') or map(str.strip, message.split(b',')) seems reasonable here. That will give you this sequence:
[b'7407.75961', b'3', b'0.865', b'1.423', b'9.022', b'5']
But really, this is exactly what the csv module is for. You can, e.g., wrap the input in a BytesIO and pass it to a csv.reader, and then you just copy rows from that reader to the writer.
But if you think about it, you're getting data in csv format, and you want to write it out in the exact same csv format, without using it in any other way… so you don't even need the csv module here. Just use a plain old binary file:
with open(csvf, 'wb') as csv_handle:
while True:
try:
message, address = s.recvfrom(8192)
print(message) #display data on screen
csv_handle.write(message + b'\n')
except(KeyboardInterrupt, SystemExit):
raise
except:
traceback.print_exc()
While we're at it, you almost never need to do this:
except(KeyboardInterrupt, SystemExit):
raise
except:
traceback.print_exc()
The only exceptions in Python 3.x that don't inherit from Exception are KeyboardInterrupt, SystemExit, GeneratorExit (which can't happen here), and any third-party exceptions that go out of their way to act like KeyboardInterrupt and SystemExit. So, just do this:
except Exception:
traceback.print_exc()
Try:
csv_writer.writerow([message])
csv_writer.writerow expects an iterable as it's first argument, which would be a list of comma-separated values. In your example the message is a string, which is iterable so the csv_writer writes each character into separate, comma-separated value.