Python: create a file on S3 - python

I have a function below for generating the rows of a huge text file.
def generate_content(n):
for _ in range(n):
yield 'xxx'
Instead of saving the file to disk, then uploading it to S3, is there any way to save the data directly to S3?
One thing to mention is the data could be so huge that I don't have enough disk space or memory to hold it.

boto3 needs a file, a bytes array, or a file like object to upload an object to S3. Of those, the only one that you can reasonably use that doesn't require the entire contents of the object in memory or on disk is the file like object, using a custom file object helper to satisfy the read requests.
Basically, you can call into your generator to satisfy the requests to read(), and boto3 will take care creating the object for you:
import boto3
def generate_content(n):
for i in range(n):
yield 'xxx'
# Convert a generator that returns a series of strings into
# a object that implements 'read()' in a method similar to how
# a file object operates.
class GenToBytes:
def __init__(self, generator):
self._generator = generator
self._buffers = []
self._bytes_avail = 0
self._at_end = False
# Emulate a file object's read
def read(self, to_read=1048576):
# Call the generate to read enough data to satisfy the read request
while not self._at_end and self._bytes_avail < to_read:
try:
row = next(self._generator).encode("utf-8")
self._bytes_avail += len(row)
self._buffers.append(row)
except StopIteration:
# We're all done reading
self._at_end = True
if len(self._buffers) > 1:
# We have more than one pending buffer, concat them together
self._buffers = [b''.join(self._buffers)]
# Pull out the requested data, and store the rest
ret, self._buffers = self._buffers[0][:to_read], [self._buffers[0][to_read:]]
self._bytes_avail -= len(ret)
return ret
s3 = boto3.client('s3')
generator = generate_content(100) #Generate 100 rows
s3.upload_fileobj(GenToBytes(generator), bucket, key)

Related

Trying to save list data to reuse while testing python scripts

I have a python script that goes out and pulls a huge chunk of JSON data and then iterates it to build 2 lists
# Get all price data
response = c.get_price_history_every_minute(symbol)
# Build prices list
prices = list()
for i in range (len(response.json()["candles"])):
prices.append (response.json()["candles"][i]["prices"])
# Build times list
times = list()
for i in range (len(response.json()["candles"])):
times.append (response.json()["candles"][i]["datetime"])
This works fine, but it takes a LONG time to pull in all of the data and build the lists. I am doing some testing trying to build out a complex script, and would like to save these two lists to two files, and then import the data from those files and recreate the lists when I run subsequent tests to skip generating, iterating and parsing the JSON.
I have been trying the following:
# Write Price to a File
a_file = open("prices7.txt", "w")
content = str(prices)
a_file.write(content)
a_file.close()
And then in future scripts:
# Load Prices from File
prices_test = array('d')
a_file = open("prices7.txt", "r")
prices_test = a_file.read()
The outputs from my json lists and the data loaded into the list created from the file output look identical, but when I try to do anything with the data loaded from a file it is garbage...
print (prices)
{The output looks like this} [69.73, 69.72, 69.64, ... 69.85, 69.82, etc]
print (prices_test)
The output looks identical
If I run a simple query like:
print (prices[1], prices[2])
I get the expected output {69.73, 69.72]
If I do the same on the list created from the file:
print (prices_test[1], prices_test[2])
I get the output ( [,6 )
It is pulling every character in the string individually instead of using the comma separated values as I would have expected...
I've googled every combination of search terms I could think of so any help would be GREATLY appreciated!!
I had to do something like this before. I used pickle to do it.
import pickle
def pickle_the_data(pickle_name, list_to_pickle):
"""This function pickles a given list.
Args:
pickle_name (str): name of the resulting pickle.
list_to_pickle (list): list that you need to pickle
"""
with open(pickle_name +'.pickle', 'wb') as pikd:
pickle.dump(list_to_pickle, pikd)
file_name = pickle_name + '.pickle'
print(f'{file_name}: Created.')
def unpickle_the_data(pickle_file_name):
"""This will unpickle a pickled file
Args:
pickle_file_name (str): file name of the pickle
Returns:
list: when we pass a pickled list, it will return an
unpickled list.
"""
with open(pickle_file_name, 'rb') as pk_file:
unpickleddata = pickle.load(pk_file)
return unpickleddata
so first pickle your list pickle_the_data(name_for_pickle, your_list)
then when you need to load the list unpickle_the_data(name_of_your_pickle_file)
This is what I'm trying to explain into the comments section. Note I replaced response.json() to jsonData, successfully taking it out of each for-loop, and reduced both loops into a single one for more efficiency. Now the code should run faster.
import json
def saveData(filename, data):
# Convert Data to a JSON String
data = json.dumps(data)
# Open the file, then save it
try:
file = open(filename, "wt")
except:
print("Failed to save the file.")
return False
else:
file.write(data)
file.close()
return True
def loadData(filename):
# Open the file, then load its contents
try:
file = open(filename, "rt")
except:
print("Failed to load the file.")
return None
else:
data = file.read()
file.close()
# Data is a JSON string, so now we convert it back
# to a Python Structure:
data = json.loads(data)
return data
# Get all price data
response = c.get_price_history_every_minute(symbol)
jsonData = response.json()
# Build prices and times list:
#
# As you're iterating over the same "candles" index on both loops
# when building those two lists, just reduce it to a single loop
prices = list()
times = list()
for i in range(len(jsonData["candles"])):
prices.append(jsonData["candles"][i]["prices"])
times.append(jsonData["candles"][i]["datetime"])
# Now, when you need, just save each list like this:
saveData("prices_list.json", prices)
saveData("times_list.json", times)
# And retrieve them back when you need it later:
prices = loadData("prices_list.json")
times = loadData("times_list.json")
Btw, pickle does the same thing, but it uses Binary Data instead of json, which is probably faster for save / load data. I don't know, didn't tested it.
In json, you have the advantage of readability, as you can open each file and read it directly, if you can understand JSON syntax.

How to save and load a large dictionary to storage in python?

I have a 1.5GB size dictionary that it takes about 90 seconds to calculate so I want to save it once to storage and load it every time I want to use it again. This creates two challenges:
Loading the file has to take less than 90 seconds.
As RAM is limited (in pycharm) at ~4GB it cannot be memory-intensive.
I also need it to be utf-8 capable.
I have tried solutions such as pickle but they always end up throwing a Memory Error.
Notice that my dictionary is made of Strings and thus solutions like in this post do not apply.
Things I do not care about:
Saving time (as long as it's not more than ~20 minutes, as I'm looking to do it once).
How much space it takes in storage to save the dictionary.
How can I do that? thanks
Edit:
I forgot to mention it's a dictionary containing sets, so json.dump() doesn't work as it can't handle sets.
If the dict consumes a lot of memory because it has many items, you could try dump many smaller dicts and combine them with update:
mk_pickle.py
import pickle
CHUNKSIZE = 10 #You will make this number of course bigger
def mk_chunks(d, chunk_size):
chunk = {}
ctr = chunk_size
for key, val in d.items():
chunk[key] = val
ctr -= 1
if ctr == 0:
yield chunk
ctr = chunk_size
chunk = {}
if chunk:
yield chunk
def dump_big_dict(d):
with open("dump.pkl", "wb") as fout:
for chunk in mk_chunks(d, CHUNKSIZE):
pickle.dump(chunk, fout)
# For testing:
N = 1000
big_dict = dict()
for n in range(N):
big_dict[n] = "entry_" + str(n)
dump_big_dict(big_dict)
read_dict.py
import pickle
d= {}
with open("dump.pkl", "rb") as fin:
while True:
try:
small_dict = pickle.load(fin)
except EOFError:
break
d.update(small_dict)
You could try to generate and save it by parts in several files. I mean generate some key value pairs, store them in a file with pickle, and delete the dict from memory, then continue until all key value pair are exausted.
Then to load the whole dict use dict.update for each part, but that could also run in memory trouble, so instead you can make a class derived from dict which reads the corresponding file on demand according to the key (I mean overriding __getitiem__), something like this:
class Dict(dict):
def __init__(self):
super().__init__()
self.dict = {}
def __getitiem__(key):
if key in self.dict:
return self.dict[key]
else:
del self.dict # destroy the old before the new is created
self.dict = pickle.load(self.getFileName(key))
return self.dict[key]
filenames = ['key1', 'key1000', 'key2000']
def getFileName(key):
'''assuming the keys are separated in files by alphabetical order,
each file name taken from its first key'''
if key in filenames:
return key
else:
A = list(sorted(filenames + [key]))
return A[A.index(key) - 1]
Have in count that smaller dicts will be loaded faster, so you should experiment and find the right amount of files.
Also you can let reside in memory more than one dict according to memory resource.

how to read a big csv file in Azure Blob

I will get a HUGE csv file as a blob in azure, and I need to parse line by line, in an azure function
I am reading each of the blobs in my container and then I get it as a string, but I think that load everything, and then I split it by new lines.
is there a smarter way to do this?
container_name = "test"
block_blob_service = BlockBlobService(account_name=container_name, account_key="mykey")
a = block_blob_service.get_container_properties(container_name)
generator = block_blob_service.list_blobs(container_name)
for b in generator:
r = block_blob_service.get_blob_to_text(container_name, b.name)
for i in r.content.split("\n"):
print(i)
I am not sure how huge your huge is, but for very large files > 200MB or so I would use a streaming approach. The call get_blob_to_text downloads the entire file in one go and places it all in memory. Using get_blob_to_stream allows you to read line by line and process individually, with only the current line and your working set in memory. This is very fast and very memory efficient. We use a similar approach to split 1GB files in to smaller files. 1GB takes a couple of minutes to process.
Keep in mind that depending on your function app service plan the maximum execution time is 5 mins by default (you can increase this to 10 minutes in the hosts.json). Also, on consumption plan, you are limited to 1.5 GB memory on each function service (not per function - for all functions in your function PaaS). So be aware of these limits.
From the docs:
get_blob_to_stream(container_name, blob_name, stream, snapshot=None, start_range=None, end_range=None, validate_content=False, progress_callback=None, max_connections=2, lease_id=None, if_modified_since=None, if_unmodified_since=None, if_match=None, if_none_match=None, timeout=None)
Here is a good read on the topic
After reading other websites and modifying some of the code on the link above,
import io
import datetime
from azure.storage.blob import BlockBlobService
acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"
block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size = 104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()
def worker(data):
print(data)
while index < blob_size:
now_chunk = datetime.datetime.now()
block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
if output is None:
continue
output.seek(index)
data = output.read()
length = len(data)
index += length
if length > 0:
worker(data)
if length < chunk_size:
break
else:
break

Optimize parsing of GB sized files in parallel

I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers).
For each pair of pointers (pointers[i], pointers[i+1]) for 0 <= i < len(pointers), I
seek to pointers[i]
read pointers[i+1]-pointer[i]
decompress it
do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?

What is the difference between numpy.save( ) and joblib.dump( ) in Python?

I save a lot of offline models/matrices/array in Python and came across these functions. Can somebody help me by listing pros and cons of numpy.save( ) and joblib.dump( )?
Here's the critical sections of code from joblib that should shed some light.
def _write_array(self, array, filename):
if not self.compress:
self.np.save(filename, array)
container = NDArrayWrapper(os.path.basename(filename),
type(array))
else:
filename += '.z'
# Efficient compressed storage:
# The meta data is stored in the container, and the core
# numerics in a z-file
_, init_args, state = array.__reduce__()
# the last entry of 'state' is the data itself
zfile = open(filename, 'wb')
write_zfile(zfile, state[-1],
compress=self.compress)
zfile.close()
state = state[:-1]
container = ZNDArrayWrapper(os.path.basename(filename),
init_args, state)
return container, filename
Basically, joblib.dump can optionally compress an array, which it either stores to disk with numpy.save, or (for compression) stores a zip-file. Also, joblib.dump stores a NDArrayWrapper (or ZNDArrayWrapper for compression), which is a lightweight object that stores the name of the save/zip file with the array contents, and the subclass of the array.

Categories