I have been reading the contents of a file which is continuously updated. I'm Trying something like this.
offset = 0
now = datetime.now()
FileName = now.date()
logfile = open("FileName","a")
logfile.seek(offset)
data = logfile.read()
try:
http post
except:
Exceptions...
Now I want to read only the specific number of bytes from the file. Just because if I lose the Ethernet connection and get the connection again, it takes a long time to read the whole file. So that Can someone help me reg this?
You can use .read() with a numeric argument to read only a specific number of bytes, e.g. read(10) will read 10 bytes from the current position in the file.
http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
Related
I have a client and a server, where the server needs to send a number of text files to the client.
The send file function receives the socket and the path of the file to send:
CHUNKSIZE = 1_000_000
def send_file(sock, filepath):
with open(filepath, 'rb') as f:
sock.sendall(f'{os.path.getsize(filepath)}'.encode() + b'\r\n')
# Send the file in chunks so large files can be handled.
while True:
data = f.read(CHUNKSIZE)
if not data:
break
sock.send(data)
And the receive file function receives the client socket and the path where to save the incoming file:
CHUNKSIZE = 1_000_000
def receive_file(sock, filepath):
with sock.makefile('rb') as file_socket:
length = int(file_socket.readline())
# Read the data in chunks so it can handle large files.
with open(filepath, 'wb') as f:
while length:
chunk = min(length, CHUNKSIZE)
data = file_socket.read(chunk)
if not data:
break
f.write(data)
length -= len(data)
if length != 0:
print('Invalid download.')
else:
print('Done.')
It works by sending the file size as the first line, then sending the text file line by line.
Both are invoked in loops in the client and the server, so that files are sent and saved one by one.
It works fine if I put a breakpoint and invoke these functions slowly. But If I let the program run uninterrupted, it fails when reading the size of the second file:
File "/home/stark/Work/test/networking.py", line 29, in receive_file
length = int(file_socket.readline())
ValueError: invalid literal for int() with base 10: b'00,1851,-34,-58,782,-11.91,13.87,-99.55,1730,-16,-32,545,-12.12,19.70,-99.55,1564,-8,-10,177,-12.53,24.90,-99.55,1564,-8,-5,88,-12.53,25.99,-99.55,1564,-8,-3,43,-12.53,26.54,-99.55,0,60,0\r\n'
Clearly a lot more data is being received by that length = int(file_socket.readline()) line.
My questions: why is that? Shouldn't that line read only the size given that it's always sent with a trailing \n?
How can I fix this so that multiple files can be sent in a row?
Thanks!
It seems like you're reusing the same connection and what happens is your file_socket being buffered means... you've actually recved more from your socket then you'd think with your read loop.
I.e. the receiver consumes more data from your socket and next time you attempt to readline() you end up reading rest of the previous file up to the new line contained therein or of the next length information.
This also means your initial problem actually is you've skipped a while. Effect of which is next read line is not an int you expected and hence the observed failure.
You can say:
with sock.makefile('rb', buffering=0) as file_socket:
instead to force the file like access being unbuffered. Or actually handle the receiving and buffering and parsing of incoming bytes (understanding where one file ends and the next one begins) on your own (instead of file like wrapper and readline).
You have to understand that socket communication is based on TCP/IP, does not matter if it's same machine (you use loopback in such cases) or different machines. So, you've got some IP addresses between which the connection is established. Going further, it involves accessing your network adapter, ie takes relatively long in comparison to accessing eg. RAM. Additionally, the adapter itself manages when to send particular data frames (lower ISO/OSI layers). Basically, in case of TCP there's ACK required, but on standard PC this is usually not some industrial, real-time ethernet.
So, in your code, you've got a while True loop without any sleep and you don't check what does sock.send returns. Even if something goes wrong with particular data frame, you ignore it and try to send next. On first glance it appears that something has been cached and receiver received what was flushed once connection was re-established.
So, first thing which you should do is check if sock.send indeed returned number of bytes sent. If not, I believe the frame should be re-sent. Another thing which I strongly recommend in such cases is think of some custom protocol (this is usually called application layer in context of OSI/ISO stack). For example, you might have 4 types of frames: START, FILESIZE, DATA, END, assign unique ID and start each frame with the identifier. Then, START is gonna be empty, FILESIZE gonna contain single uint16, DATA is gonna contain {FILE NUMBER, LINE NUMBER, LINE_LENGTH, LINE} and END is gonna be empty. Then, once you've got entire frame on the client, you can safely assemble the information you received.
I am using my own algorithm and loading data in json format from s3. Because of the huge size of data, I need to setup pipe mode. I have followed the instructions as given in: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/pipe_bring_your_own/train.py.
As a result, I am able to setup pipe and read data successfully. Only problem is that fifo pipe is not reading the specified amount of bytes. For example, given path to s3-fifo-channel,
number_of_bytes_to_read = 555444333
with open(fifo_path, "rb", buffering=0) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
The length of data should be 555444333 bytes, but it is always less 12,123,123 bytes or so. Data in S3 looks as following:
s3://s3-bucket/1122/part1.json
s3://s3-bucket/1122/part2.json
s3://s3-bucket/1133/part1.json
s3://s3-bucket/1133/part2.json
and so.
Is there any way to enforce the number of bytes to be read? Any suggestion will be helpful. Thanks.
We just needed to add some positive value in the buffering and the problem was solved. Code will buffer 555444333 Bytes and then process 111222333 bytes each time. Since our files are in Json, we can easily convert incoming bytes to string and then clean strings by removing incomplete json parts. Final code looks like:
number_of_bytes_to_read = 111222333
number_of_bytes_to_buffer = 555444333
with open(fifo_path, "rb", buffering=number_of_bytes_to_buffer) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read() command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines() equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream
If it is ok to load one full message at a time, this is quite simple to implement by modifying the code you posted:
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read(10) # Maximum length of length prefix
while buf:
msg_len, new_pos = _DecodeVarint32(buf, 0)
buf = buf[new_pos:]
# read rest of the message
buf += f.read(msg_len - len(buf))
read_row = sfeed.standard_feed()
read_row.ParseFromString(buf)
buf = buf[msg_len:]
# do something with read_metric
print(read_row)
# read length prefix for next message
buf += f.read(10 - len(buf))
This reads 10 bytes, which is enough to parse the length prefix, and then reads the rest of the message once its length is known.
String mutations are not very efficient in Python (they make a lot of copies of the data), so using bytearray can improve performance if your individual messages are also large.
https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.
I have a control box and a Raspberry Pi which communicate over Serial (Serial to RJ45), and I need the commands sent from the control box which are sent every 50ms. I am able to read the code, but here's the issue. When I start reading, the starting byte is incorrect so I am unable to parse it.
For example (The output I am currently getting):
b'\0x21\0x21\0x98\0x98\0x21\0x21\0x18\0x12\0x21\0x12\0x02\0x32\0x11
The starting byte I need has to be 0x98, so I need it to be like this
b'\0x98\0x98\0x21\0x21\0x18\0x12\0x21\0x12\0x02\0x32\0x11\0x12\0x11
I need it this way so I can parse the line and say grab Byte[4]-(0x21) or something like that.
In terms of research, I ran into Struct. I have no idea how to use this though, and I have no idea if I even need to use it.
I currently don't have a full version of the code on me at this moment, but here is a quick example of what I currently have:
import serial
import time
port = serial.Serial("/dev/ttyS0", baudrate=9600)
while True:
output = port.read(13) # --- In Total there are 13 Bytes
print(output)
Since you are getting another lot of data every 50mS, you need to be able to sync with the start of the data:
buffer = b''
header = b'\0x98'
while True:
if port.in_waiting:
buffer += port.read(port.in_waiting)
while len(buffer) >= 2:
if buffer[0] == header and buffer[1] == header:
break
buffer=buffer[1:]
if len(buffer) >= 13:
print(buffer[:13]) # or otherwise process latest buffer
buffer = buffer[13:]
This code starts with an empty buffer and then reads whatever data arrives at the serial port. While the buffer does not start with the two header bytes, any excess at the front is discarded. Once the buffer starts with the right header and is long enough, the 13 bytes are printed here (but you might want to call another function to process a whole packet), and then that packet is thrown away, ready to start with whatever arrives next.
Is there a way to download only partial piece of file from last line (end of file). Like if file is over 40 MB, and I would like to retrieve only last block let's say of 2042 bytes. Is there possible way to do this using python 3 with ftplib ?
Try using the FTP.retrbinary() method and supply the rest argument, which is an offset into the requested file. Since the offset is from the beginning of the file, you will need to calculate the offset using the size of the file and the desired number of bytes of data. Here's an example using debian's FTP server:
from ftplib import FTP
hostname = 'ftp.debian.org'
filename = 'README'
num_bytes = 500 # how many bytes to retrieve from end of file
ftp = FTP(hostname)
ftp.login()
ftp.cwd('debian')
cmd = 'RETR {}'.format(filename)
offset = max(ftp.size(filename) - num_bytes, 0)
ftp.retrbinary(cmd, open(filename, 'wb').write, rest=offset)
ftp.quit()
This will retrieve the last num_bytes bytes from the end of the requested file and write it to a file of the same name in the current directory.
The second argument to retrbinary() is a callback function, in this case it's the write() method of a writeable file. You can write your own callback to process the retrieved data.
Just use the rest argument to retrbinary to tell the server at which file offset it should start to transfer data. From the documentation:
FTP.retrbinary(command, callback[, maxblocksize[, rest]])
... rest means the same thing as in the transfercmd() method.
FTP.transfercmd(cmd[, rest])
... If optional rest is given, a REST command is sent to the server, passing rest as an argument. rest is usually a byte offset into the requested file, telling the server to restart sending the file’s bytes at the requested offset, skipping over the initial bytes.