Let's assume I request a large object from a server, and I want to write it to the disk while I don't want to use a lot of memory (the object should not be fully loaded into memory, for example my object is about 5GB but my memory limitation is 100MB).
I write here a code in python for example
response = requests.get('https://example.com/verylargefile.txt', stream=True)
chunk_size = 1024*1024*15 # =15MB chunk for example
with open(f'some_file_name.txt', 'wb') as file:
for block in response.iter_content(chunk_size):
file.write(block)
In this code, for every 15M we get, we write them to the disk, but, in reality, will the file data pile up in memory? Like more data would come than the amount I will be able to write to the disk in a given time?
How does the streaming works? (I am familiar with HTTP streaming) does after each 15MB I handle I ask for the next 15MB from the server? (I'm sure it's not like that, but how it's like?)
Related
I need to read a really big file of jsonl's from a URL the approach I am using is as follow
bulk_status_info = _get_bulk_info(shop)
url = bulk_status_info.get('bulk_info').get('url')
file = urllib.request.urlopen(url)
for line in file:
print(json.loads(line.decode("utf-8")))
However, my CPU and memory are limited so that brings me to two questions
Is the file loaded all at once or is it have some buffering mechanism to prevent memory from overflowing.
In case my task failed I want to start from the place I failed. Is there some sort of cursor I can save. Note things like seek or tell do not work here since it is not an actual file
Some additional info I am using Python3 and urllib
The file will be loaded in its entirety before running the for loop. The file will be loaded packet by packet but this is abstracted away by urllib. If you want to have closer access I'm sure there is a way similar to how it can be done using the requests library.
Generally there is no way to resume the downloading of a webpage, or any file request for that matter unless the server specifically supports it. That would require the server to allow for a start point to be specified, this is the case for video streaming protocols.
I want to ask some advices about realtime audio data processing.
For the moment, I created a simple server and client using python sockets which send and receive audio data from microphone until I stop it (4096 bytes for each packet, but could be much more).
I saw two kinds of different analysis:
realtime: perform analysis on each X bytes packet and send back result in response
after receiving a lot of bytes (for example every 1h), append these bytes and store them into a DB. When the microphone is stopped, concatenate all the previous chunk and perform some actions on it (like create a waveplot image for this recorded session).
For this kind of usage, which kind of selfhosted DB can I use ?
how can I concatenate these large volumes of data at regular intervals and add them to the DB ?
For only 6 minutes, I received something like 32MB of data. Maybe I should put each chunk in a redis as soon as I receipt it, rather than keeping it in a python object. Another way could be serialize audio data into b64. I'm just afraid of losing speed since I'm currently using tcp for sending data.
Thanks for your help !
On your question about the size. Is there any reason not to compress the audio data? It's very easy. 32 MB for 6 mins of uncompressed audio (mono) is normal. You could Store smaller chunks and/or append incoming chunks to a bigger file. Have a look at this, it might help you:
https://realpython.com/playing-and-recording-sound-python/
How to join two wav files using python?
Why should I use iter_content and specially I'm really confused with the purpose using of chunk_size , as I have tried using it and in every way the file seems to be saved after downloading successfully.
g = requests.get(url, stream=True)
with open('c:/users/andriken/desktop/tiger.jpg', 'wb') as sav:
for chunk in g.iter_content(chunk_size=1000000):
print (chunk)
sav.write(chunk)
Help me understand the use of iter_content and what will happen as you see I am using 1000000 bytes as chunk_size, what is the purpose exactly and results?
This is to prevent loading the entire response into memory at once (it also allows you to implement some concurrency while you stream the response so that you can do work while waiting for request to finish).
The purpose of setting streaming request is usually for media. Like try to download a 500 MB .mp4 file using requests, you want to stream the response (and write the stream in chunks of chunk_size) instead of waiting for all 500mb to be loaded into python at once.
If you want to implement any UI feedback (such as download progress like "downloaded <chunk_size> bytes..."), you will need to stream and chunk. If your response contains a Content-Size header, you can calculate % completion on every chunk you save too.
From the documentations chunk_size is size of data, that app will be reading in memory when stream=True.
For example, if the size of the response is 1000 and chunk_size set to 100, we split the response into ten chunks.
In situations when data is delivered without a content-length header, using HTTP1.1 chunked transfer encoding (CTE) mode or HTTP2/3 data frames, where minimal latency is required it can be useful to deal with each HTTP chunk as it arrives, as opposed to waiting till the buffer hits a specific size.
This can be achieved by setting chunk_size = None.
I am trying to load a json file to GoogleBigquery using the script at
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification.
I added
,chunksize=10*1024*1024, resumable=True))
to MediaFileUpload.
The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with
socket.error: `[Errno 32] Broken pipe`
after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
Update: Talking to the engineering team, POST should work if you try a smaller chunksize.
I have a gevent powered crawler download pages all the time. The crawler adopt producer-consumer pattern, which i feed the queue with data like this {method:get, url:xxxx, other_info:yyyy}.
Now i want to assemble some response into files. The problem is, i can't just open and write when every request end, that io costly and the data is not in correct order.
I assume may be i should numbered all requests, cache response in order, open a greenlet to loop and assemble files, pseudo code may be like this:
max_chunk=1000
data=[]
def wait_and_assemble_file(): # a loop
while True:
if len(data)==28:
f= open('test.txt','a')
for d in data:
f.write(d)
f.close()
gevent.sleep(0)
def after_request(response, index): # Execute after every request ends
data[index]=response # every response is about 5-25k
Is there better solution? There are thousands concurrent requests, and i doubt the memory use may be grow too fast, or too many loop at one time, or something unexpectedly.
Update:
Codes above is just demonstrate how data caching and file writing does. In practical situation, there are maybe 1 hundred loop run to wait cacheing complete and write to different files.
Update2
#IT Ninja suggest to use queue system, so i write a alternative using Redis:
def after_request(response, session_id, total_block_count ,index): # Execute after every request ends
redis.lpush(session_id, msgpack.packb({'index':index, 'content':response})) # save data to redid
redis.incr(session_id+':count')
if redis.get(session_id+':count') == total_block_count: # which means all data blocks are prepared
save(session_name)
def save(session_name):
data_array=[]
texts = redis.lrange(session_name,0,-1)
redis.delete(session_name)
redis.delete(session_name+':count')
for t in texts:
_d = msgpack.unpackb(t)
index = _d['index']
content = _d['content']
data_array[index]=content
r= open(session_name+'.txt','w')
[r.write(i) for i in data_array]
r.close()
Looks a bit better, but i doubt if saving large data in Redis is a good idea, hope for more suggestion!
Something like this may be better handled with a queue system, instead of each thread having their own file handler. This is because you may run into race conditions when writing this file due to each thread having its own handler.
As far as resources go, this should not consume too many resources other than your disk writes, assuming that the information being passed to the file is not extremely large (Python is really good about this). If this does pose a problem though, reading into memory the file in chunks (and proportionally writing in chunks) can greatly reduce this problem, as long as this is available as an option for file uploads.
It depends on the size of the data. If it very big it can slow down the program having all the structure in memory.
If the memory is not a problem you should keep the structure in memory instead of reading all the time from a file. Open a file again and again with concurrents request is not a good solution.