how to read a big csv file in Azure Blob

how to read a big csv file in Azure Blob - python

I will get a HUGE csv file as a blob in azure, and I need to parse line by line, in an azure function
I am reading each of the blobs in my container and then I get it as a string, but I think that load everything, and then I split it by new lines.
is there a smarter way to do this?
container_name = "test"
block_blob_service = BlockBlobService(account_name=container_name, account_key="mykey")
a = block_blob_service.get_container_properties(container_name)
generator = block_blob_service.list_blobs(container_name)
for b in generator:
r = block_blob_service.get_blob_to_text(container_name, b.name)
for i in r.content.split("\n"):
print(i)

I am not sure how huge your huge is, but for very large files > 200MB or so I would use a streaming approach. The call get_blob_to_text downloads the entire file in one go and places it all in memory. Using get_blob_to_stream allows you to read line by line and process individually, with only the current line and your working set in memory. This is very fast and very memory efficient. We use a similar approach to split 1GB files in to smaller files. 1GB takes a couple of minutes to process.
Keep in mind that depending on your function app service plan the maximum execution time is 5 mins by default (you can increase this to 10 minutes in the hosts.json). Also, on consumption plan, you are limited to 1.5 GB memory on each function service (not per function - for all functions in your function PaaS). So be aware of these limits.
From the docs:
get_blob_to_stream(container_name, blob_name, stream, snapshot=None, start_range=None, end_range=None, validate_content=False, progress_callback=None, max_connections=2, lease_id=None, if_modified_since=None, if_unmodified_since=None, if_match=None, if_none_match=None, timeout=None)
Here is a good read on the topic

After reading other websites and modifying some of the code on the link above,
import io
import datetime
from azure.storage.blob import BlockBlobService
acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"
block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size = 104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()
def worker(data):
print(data)
while index < blob_size:
now_chunk = datetime.datetime.now()
block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
if output is None:
continue
output.seek(index)
data = output.read()
length = len(data)
index += length
if length > 0:
worker(data)
if length < chunk_size:
break
else:
break

Related

Python: create a file on S3

I have a function below for generating the rows of a huge text file.
def generate_content(n):
for _ in range(n):
yield 'xxx'
Instead of saving the file to disk, then uploading it to S3, is there any way to save the data directly to S3?
One thing to mention is the data could be so huge that I don't have enough disk space or memory to hold it.

boto3 needs a file, a bytes array, or a file like object to upload an object to S3. Of those, the only one that you can reasonably use that doesn't require the entire contents of the object in memory or on disk is the file like object, using a custom file object helper to satisfy the read requests.
Basically, you can call into your generator to satisfy the requests to read(), and boto3 will take care creating the object for you:
import boto3
def generate_content(n):
for i in range(n):
yield 'xxx'
# Convert a generator that returns a series of strings into
# a object that implements 'read()' in a method similar to how
# a file object operates.
class GenToBytes:
def __init__(self, generator):
self._generator = generator
self._buffers = []
self._bytes_avail = 0
self._at_end = False
# Emulate a file object's read
def read(self, to_read=1048576):
# Call the generate to read enough data to satisfy the read request
while not self._at_end and self._bytes_avail < to_read:
try:
row = next(self._generator).encode("utf-8")
self._bytes_avail += len(row)
self._buffers.append(row)
except StopIteration:
# We're all done reading
self._at_end = True
if len(self._buffers) > 1:
# We have more than one pending buffer, concat them together
self._buffers = [b''.join(self._buffers)]
# Pull out the requested data, and store the rest
ret, self._buffers = self._buffers[0][:to_read], [self._buffers[0][to_read:]]
self._bytes_avail -= len(ret)
return ret
s3 = boto3.client('s3')
generator = generate_content(100) #Generate 100 rows
s3.upload_fileobj(GenToBytes(generator), bucket, key)

lmdb stores the data inefficiently?

I'm looking for a data file structure that enables fast reading of random data samples for deep learning, and have been experimenting with lmdb today. However, one thing that seems surprising to me is how inefficiently it seems to store the data.
I have an ASCII file that is around 120 GB with gene sequences.
Now I would have expected to be able to fit this data in a lmdb database of roughly the same size or perhaps even a bit smaller since ASCII is a highly inefficient storing method.
However what I'm seeing seems, to suggest that I need around 350 GB to store this data in a lmdb file and I simply don't understand that.
Am I not utilizing some setting correctly, or what exactly am I doing wrong here?
import time
import lmdb
import pyarrow as pa
def dumps_pyarrow(obj):
"""
Serialize an object.
Returns:
Implementation-dependent bytes-like object
"""
return pa.serialize(obj).to_buffer()
t0 = time.time()
filepath = './../../Uniparc/uniparc_active/uniparc_active.fasta'
output_file = './../data/out_lmdb.fasta'
write_freq = 100000
start_line = 2
nprot = 0
db = lmdb.open(output_file, map_size=1e9, readonly=False,
meminit=False, map_async=True)
txn = db.begin(write=True)
with open(filepath) as fp:
line = fp.readline()
cnt = 1
protein = ''
while line:
if cnt >= start_line:
if line[0] == '>': #Old protein finished, new protein starting on next line
txn.put(u'{}'.format(nprot).encode('ascii'), dumps_pyarrow((protein)))
nprot += 1
if nprot % write_freq == 0:
t1 = time.time()
print("comitting... nprot={} ,time={:2.2f}".format(nprot,t1-t0))
txn.commit()
txn = db.begin(write=True)
line_checkpoint = cnt
protein = ''
else:
protein += line.strip()
line = fp.readline()
cnt += 1
txn.commit()
keys = [u'{}'.format(k).encode('ascii') for k in range(nprot + 1)]
with db.begin(write=True) as txn:
txn.put(b'__keys__', dumps_pyarrow(keys))
txn.put(b'__len__', dumps_pyarrow(len(keys)))
print("Flushing database ...")
db.sync()
db.close()
t2 = time.time()
print("All done, time taken {:2.2f}s".format(t2-t0))
Edit:
Some additional information about the data:
In the 120 GB file the data is structured like this (Here I am showing the first 2 proteins):
>UPI00001E0F7B status=inactive
YPRSRSQQQGHHNAAQQAHHPYQLQHSASTVSHHPHAHGPPSQGGPGGPGPPHGGHPHHP
HHGGAGSGGGSGPGSHGGQPHHQKPRRTASQRIRAATAARKLHFVFDPAGRLCYYWSMVV
SMAFLYNFWVIIYRFAFQEINRRTIAIWFCLDYLSDFLYLIDILFHFRTGYLEDGVLQTD
ALKLRTHYMNSTIFYIDCLCLLPLDFLYLSIGFNSILRSFRLVKIYRFWAFMDRTERHTN
YPNLFRSTALIHYLLVIFHWNGCLYHIIHKNNGFGSRNWVYHDSESADVVKQYLQSYYWC
TLALTTIGDLPKPRSKGEYVFVILQLLFGLMLFATVLGHVANIVTSVSAARKEFQGESNL
RRQWVKVVWSAPASG
>UPI00001E0FBF status=active
MWRAQPSLWIWWIFLILVPSIRAVYEDYRLPRSVEPLHYNLRILTHLNSTDQRFEGSVTI
DLLARETTKNITLHAAYLKIDENRTSVVSGQEKFGVNRIEVNEVHNFYILHLGRELVKDQ
IYKLEMHFKAGLNDSQSGYYKSNYTDIVTKEVHHLAVTQFSPTFARQAFPCFDEPSWKAT
FNITLGYHKKYMGLSGMPVLRCQDHDSLTNYVWCDHDTLLRTSTYLVAFAVHDLENAATE
ESKTSNRVIFRNWMQPKLLGQEMISMEIAPKLLSFYENLFQINFPLAKVDQLTVPTHRFT
AMENWGLVTYNEERLPQNQGDYPQKQKDSTAFTVAHEYAHQWFGNLVTMNWWNDLWLKEG
PSTYFGYLALDSLQPEWRRGERFISRDLANFFSKDSNATVPAISKDVKNPAEVLGQFTEY
VYEKGSLTIRMLHKLVGEEAFFHGIRSFLERFSFGNVAQADLWNSLQMAALKNQVISSDF
NLSRAMDSWTLQGGYPLVTLIRNYKTGEVTLNQSRFFQEHGIEKASSCWWVPLRFVRQNL
PDFNQTTPQFWLECPLNTKVLKLPDHLSTDEWVILNPQVATIFRVNYDEHNWRLIIESLR
NDPNSGGIHKLNKAQLLDDLMALAAVRLHKYDKAFDLLEYLKKEQDFLPWQRAIGILNRL
GALLNVAEANKFKNYMQKLLLPLYNRFPKLSGIREAKPAIKDIPFAHFAYSQACRYHVAD
CTDQAKILAITHRTEGQLELPSDFQKVAYCSLLDEGGDAEFLEVFGLFQNSTNGSQRRIL
ASALGCVRNFGNFEQFLNYTLESDEKLLGDCYMLAVKSALNREPLVSPTANYIISHAKKL
GEKFKKKELTGLLLSLAQNLRSTEEIDRLKAQLEDLKEFEEPLKKALYQGKMNQKWQKDC
SSDFIEAIEKHL
When I store the data in the database I concatenate all the lines making up each protein, and store those as a single data point. I ignore the headerline (the line starting with >).
The reason why I believe that the data should be more compressed when stored in the database is because I expect it to be stored in some binary form which I would expect would be more compressed - though I will admit I don't know whether that is how it would actually work (For comparison the data is only 70 GB when compressed/zipped).
I would be okay with the data taking up a similar amount of space in lmdb format, but I don't understand why it should take up almost 3 times the space as it does in ASCII format.

LMDB does not implement any sort of compression on data. 1Byte in memory is 1Byte on disk.
But its internals can amplify space required:
data handling is in pages (generally 4KB)
each "record" need to store additional structures for the b-tree (for key and data) and the count of pages occupied by key and data
Bottom line : LMDB is designed for FAST data access, not to save space.

Sharing a large dataframe with multiprocessing.pool

I have a function which I want to compute in parallel using multiprocessing. The function takes an argument, but also loads subsets from two very large dataframe which has already been loaded into memory (one of which is about 1G and the other is just over 6G).
largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')
def f(x):
load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
#some computation happens here
new_data.to_csv(directory + 'output.csv', index = False)
def main():
multiprocessing.set_start_method('spawn', force = True)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
input = input_data['col']
pool.map_async(f, input)
pool.close()
pool.join()
The problem is that the files are too big and when I run them over multiple cores I get a memory issue. I want to know if there is a way where the loaded files can be shared across all processes.
I have tried manager() but could not get it to work. Any help is appreciated. Thanks.

If you were running this on a UNIX-like system (which uses the fork startmethod by default) the data would be shared out-of-the-box. Most operating systems use copy-on-write for memory pages. So even if you fork a process several times they would share most of the memory pages that contain the dataframes, al long as you don't modify those dataframes.
But when using the spawn start method, each worker process has to load the dataframe. I'm not sure if the OS is smart enough in that case to share the memory pages. Or indeed that these spawned processes would all have the same memory lay-out.
The only portable solution I can think of would be to leave the data on disk and use mmap in the workers to map it into memory read-only. That way the OS would notice that multiple processes are mapping the same file, and it would only load one copy.
The downside is that the data would be in memory in on-disk csv format, which makes reading data from it (without making a copy!) less convenient. So you might want to prepare the data beforehand into a form that it easier to use. Like e.g. convert the data from 'FirstRow' into a binary file of float or double that you can iterate over with struct.iter_unpack.
The function below (from my statusline script) uses mmap to count the amount of messages in a mailbox file.
def mail(storage, mboxname):
"""
Report unread mail.
Arguments:
storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
This dict will be *modified* by this function.
mboxname (str): name of the mailbox to read.
Returns: A string to display.
"""
stats = os.stat(mboxname)
if stats.st_size == 0:
return 'Mail: 0'
# When mutt modifies the mailbox, it seems to only change the
# ctime, not the mtime! This is probably releated to how mutt saves the
# file. See also stat(2).
newtime = stats.st_ctime
newsize = stats.st_size
if not storage or newtime > storage['time'] or newsize != storage['size']:
with open(mboxname) as mbox:
with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
start, total = 0, 1 # First mail is not found; it starts on first line...
while True:
rv = mm.find(b'\n\nFrom ', start)
if rv == -1:
break
else:
total += 1
start = rv + 7
start, read = 0, 0
while True:
rv = mm.find(b'\nStatus: R', start)
if rv == -1:
break
else:
read += 1
start = rv + 10
unread = total - read
# Save values for the next run.
storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
else:
unread = storage['unread']
return f'Mail: {unread}'
In this case I used mmap because it was 4x faster than just reading the file. See normal reading versus using mmap.

Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.
import csv
import itertools
from collections import deque
import time
my_queue = deque()
start_row = 500004
stop_row = start_row + 50000
with open('test.csv', 'rb') as fin:
#load into csv's reader
csv_f = csv.reader(fin)
#start logging time for performance
start = time.time()
for row in itertools.islice(csv_f, start_row, stop_row):
my_queue.append(float(row[4])*float(row[10]))
#stop logging time
end = time.time()
#display performance
print "Initial queue populating time: %.2f" % (end-start)

For example, a start_row of 4 will execute in 1s but a start_row of
500004 will take 11s
That is islice being intelligent. Or lazy, depending on which term you prefer.
Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).
Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.
As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)
In python
lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
for i,j in enumerate(f):
if i >= lines_I_want[0][0]:
if i >= lines_I_want[0][1]:
lines_I_want.pop(0)
if not lines_I_want: #list is empty
break
else:
#j is a line I want. Do something
And if you have any control over making that file, make every line the same length so you can seek. Or use a database.

The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.
If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.
Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.
import csv
from itertools import islice
import os
import sys
def open_binary_mode(filename, mode='r'):
""" Open a file proper way (depends on Python verion). """
kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
dict(mode=mode, newline=''))
return open(filename, **kwargs)
def split(infilename, num_chunks):
infile_size = os.path.getsize(infilename)
chunk_size = infile_size // num_chunks
offset = 0
num_rows = 0
bytes_read = 0
chunks = []
with open_binary_mode(infilename, 'r') as infile:
for _ in range(num_chunks):
while bytes_read < chunk_size:
try:
bytes_read += len(next(infile))
num_rows += 1
except StopIteration: # end of infile
break
chunks.append((infilename, offset, num_rows))
offset += bytes_read
num_rows = 0
bytes_read = 0
return chunks
chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
print('processing: {} rows starting at offset {}'.format(rows, offset))
with open_binary_mode(filename, 'r') as fin:
fin.seek(offset)
for row in islice(csv.reader(fin), rows):
print(row)

Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to read a big csv file in Azure Blob - python

Related

Python: create a file on S3

lmdb stores the data inefficiently?

Sharing a large dataframe with multiprocessing.pool

Python: performance issues with islice

Upper memory limit?

Categories

Resources