Python Reading a file from URL in chunks - python

I need to read a really big file of jsonl's from a URL the approach I am using is as follow
bulk_status_info = _get_bulk_info(shop)
url = bulk_status_info.get('bulk_info').get('url')
file = urllib.request.urlopen(url)
for line in file:
print(json.loads(line.decode("utf-8")))
However, my CPU and memory are limited so that brings me to two questions
Is the file loaded all at once or is it have some buffering mechanism to prevent memory from overflowing.
In case my task failed I want to start from the place I failed. Is there some sort of cursor I can save. Note things like seek or tell do not work here since it is not an actual file
Some additional info I am using Python3 and urllib

The file will be loaded in its entirety before running the for loop. The file will be loaded packet by packet but this is abstracted away by urllib. If you want to have closer access I'm sure there is a way similar to how it can be done using the requests library.
Generally there is no way to resume the downloading of a webpage, or any file request for that matter unless the server specifically supports it. That would require the server to allow for a start point to be specified, this is the case for video streaming protocols.

Related

Parsing local XML files with Scrapy: DOWNLOAD_TIMEOUT & DOWNLOAD_MAXSIZE not working

I am parsing local XML files with Scrapy, and the code seems to hang on one particular XML file.
The file may be too large (219M) or badly formatted? Either way the spider doesn't crash it just freezes. It freezes so bad I can't even ctrl+c out...
I have tried adjusting the DOWNLOAD_TIMEOUT and DOWNLOAD_MAXSIZE settings to get scrapy to skip this file, and any other similarly problematic files it encounters, but it doesn't seem to work. At least not if I use file:///Users/.../myfile.xml as the URL, which I am doing based on this post.
If I instead start a server with python -m http.server 8002 and access the files through that URL (http://localhost:8002/.../myfile.xml) then Scrapy does skip over the file with a cancelledError, like I want: expected response size larger than download max size.
So I guess if you use the file protocol the downloader settings are not used, because you're not actually downloading anything? Something like that? Is there a way to tell scrapy to timeout/skip over local files?
It seems like launching an http server is one solution but it adds complexity to running the spider (and may slow things down?) so I'd rather find a different solution.
I'm fairly certain that DOWNLOAD_TIMEOUT and DOWNLOAD_MAXSIZE work only when making calls via HTTP or another network protocols. Rather, you could override the start_requests method where you would have more control over how you read the files:
def start_requests(self, **kwargs):
for uri in self.uris:
...
You could, for example, use os.read with providing the _length parameter which would tell Python to read the file until _length amount of bytes have been read, and then return. This would possibly have the same effect as if you would use DOWNLOAD_MAXSIZE.

Intentionally cause a read/write timeout?

I'm trying to test some file io and I was wondering if there's a way to emulate the following situation:
I have a block-storage device that is constantly being read/written from, but I want to notify the users of the proper error when they are trying to read/write from a file stored in the block-storage device but the block-storage service/device becomes unavailable or detached mid write. In which case, the read or write command would "timeout," or "hang."
I'm trying to write a test case that reads a file and I want to emulate that situation as closely as possible, meaning I don't want to use signal or just some timeout, I want to be able to make some kind of file that will hang a python file.read() statement or a file.write() statement.
Is this possible? I'm testing on a linux machine and mounting a blockstorage to a folder, pretty simple.
It seems to me that fsdisk is the right tool your looking for. It can bind your storage and inject errors.

Use a timeout to prevent deadlock when opening a file in Python?

I need to open a file which is NFS mounted to my server. Sometimes, the NFS mount fails in a manner that causes all file operations to deadlock. In order to prevent this, I need a way to let the open function in python time out after a set period. E.g. something like open('/nfsdrive/foo', timeout=5). Of course, the default open procedure has no timeout or similar keyword.
Does anyone here know of a way to effectively stop trying to open a (local) file if the opening takes too long?
Note: I've already tried the urllib2 module, but it's timeout options only work for web requests, not local ones.
You can try using stopit
from stopit import SignalTimeout as Timeout
with Timeout(5.0) as timeout_ctx:
with open('/nfsdrive/foo', 'r') as f:
# do something with f
pass
There may be some issues with SignalTimeout in multithreaded environments (like Django). ThreadingTimeout on the other hand may cause problems with resources on some virtual hostings when you run too many "time-limited" functions
P.S. My example also limits processing time of opened file. To only limit file opening you should use different approach with manual file opening/closing and manual exception handling

Use python, should i cache large data in array and write to file in once?

I have a gevent powered crawler download pages all the time. The crawler adopt producer-consumer pattern, which i feed the queue with data like this {method:get, url:xxxx, other_info:yyyy}.
Now i want to assemble some response into files. The problem is, i can't just open and write when every request end, that io costly and the data is not in correct order.
I assume may be i should numbered all requests, cache response in order, open a greenlet to loop and assemble files, pseudo code may be like this:
max_chunk=1000
data=[]
def wait_and_assemble_file(): # a loop
while True:
if len(data)==28:
f= open('test.txt','a')
for d in data:
f.write(d)
f.close()
gevent.sleep(0)
def after_request(response, index): # Execute after every request ends
data[index]=response # every response is about 5-25k
Is there better solution? There are thousands concurrent requests, and i doubt the memory use may be grow too fast, or too many loop at one time, or something unexpectedly.
Update:
Codes above is just demonstrate how data caching and file writing does. In practical situation, there are maybe 1 hundred loop run to wait cacheing complete and write to different files.
Update2
#IT Ninja suggest to use queue system, so i write a alternative using Redis:
def after_request(response, session_id, total_block_count ,index): # Execute after every request ends
redis.lpush(session_id, msgpack.packb({'index':index, 'content':response})) # save data to redid
redis.incr(session_id+':count')
if redis.get(session_id+':count') == total_block_count: # which means all data blocks are prepared
save(session_name)
def save(session_name):
data_array=[]
texts = redis.lrange(session_name,0,-1)
redis.delete(session_name)
redis.delete(session_name+':count')
for t in texts:
_d = msgpack.unpackb(t)
index = _d['index']
content = _d['content']
data_array[index]=content
r= open(session_name+'.txt','w')
[r.write(i) for i in data_array]
r.close()
Looks a bit better, but i doubt if saving large data in Redis is a good idea, hope for more suggestion!
Something like this may be better handled with a queue system, instead of each thread having their own file handler. This is because you may run into race conditions when writing this file due to each thread having its own handler.
As far as resources go, this should not consume too many resources other than your disk writes, assuming that the information being passed to the file is not extremely large (Python is really good about this). If this does pose a problem though, reading into memory the file in chunks (and proportionally writing in chunks) can greatly reduce this problem, as long as this is available as an option for file uploads.
It depends on the size of the data. If it very big it can slow down the program having all the structure in memory.
If the memory is not a problem you should keep the structure in memory instead of reading all the time from a file. Open a file again and again with concurrents request is not a good solution.

Python: Two script working with same file , one updating it another deleting the data when processed

Firstly I am new to Python.
Now my question goes like this:
I have a call back script running in remote machine
which sends some data and run a script in local machine
which process that data and write to a file. Now another
script of mine locally needs to process the file data
one by one and delete them from the file if done.
The problem is the file may be updating continuoulsy.
How do i schyncronize the work so that it doesnt mess up
my file.
Also please suggest me if the same work can be done in some
better way.
I would suggest you to look into named pipes or sockets which seem to be more suited for your purpose than a file. If it's really just between those two applications and you have control on the source code of both.
For example, on unix, you could create a pipe like (see os.mkfifo):
import os
os.mkfifo("/some/unique/path")
And then access it like a file:
dest = open("/some/unique/path", "w") # on the sending side
src = open("/some/unique/path", "r") # on the reading side
The data will be queued between your processes. It's a First In First Out really, but it behaves like a file (mostly).
If you cannot go for named pipes like this, I'd suggest to use IP sockets over localhost from the socket module, preferably DGRAM sockets, as you do not need to do some connection handling there. You seem to know how to do networking already.
I would suggest using a database whose transactions allow for concurrent processing.

Categories