I have a big blf file, blf_file.blf, and an associated dbc file, dbc_file.dbc. I need to read and decode all the messages and store them in a list. For this I use the python-can library:
decoded_mess = []
db = cantools.db.load_file('dbc_file.dbc')
with can.BLFReader('blf_file.blf') as can_log:
for msg in can_log:
decoded_mess.append(
db.decode_message(msg.arbitration_id, msg.data)
)
However, for my blf files (> 100 MB), this takes up to 5 minutes.
Is there a way to speed this up? In the end, I want to store every signal in a seperate list, so list comprehension is not an option.
Related
I have a script that loops through ~335k filenames, opens the fits-tables from the filenames, performs a few operations on the tables and writes the results to a file. In the beginning the loop goes relatively fast but with time it consumes more and more RAM (and CPU resources, I guess) and the script also gets slower. I would like to know how can I improve the performance/make the code quicker. E.g. is there a better way to write to the output file (open the output file ones and do everything within a while-open loop vs opening the file every time a new to write in it)? Is there a better looping way? Can I dump memory that I don't need anymore?
My script looks like that:
#spectral is a package for manipulation of spectral data
from spectral import *
# I use this dictionary to store functions, that I don't want to generate a new each time I need them.
# Generating them a new would be more time consuming, I figured out
lam_resample_dic = {}
with open("/home/bla/Downloads/output.txt", "ab") as f:
for fname, ind in zip(list_of_fnames, range(len(list_of_fnames))):
data_s = Table.read('/home/nestor/Downloads/all_eBoss_QSO/'+fname, format='fits')
# lam_str_identifier is just the dic-key I need for finding the corresponding BandResampler function from below
lam_str_identifier = ''.join([str(x) for x in data_s['LOGLAM'].data.astype(str)])
if lam_str_identifier not in lam_resample_dic:
# BandResampler is the function I avoid doing everytime a new
# I do it only if necessary - when lam_str_identifier indicates a unique new set of data
resample = BandResampler(centers1=10**data_s['LOGLAM'], centers2=df_jpas["Filter.wavelength"].values, fwhm2=df_jpas["Filter.width"].values)
lam_resample_dic[lam_str_identifier] = resample
photo_spec = np.around(resample(data_s['FLUX']),4)
else:
photo_spec = np.around(lam_resample_dic[lam_str_identifier](data_s['FLUX']),4)
np.savetxt(f, [photo_spec], delimiter=',', fmt='%1.4f')
# this is just to keep track of the progress of the loop
if ind%1000==0:
print('num of files processed so far:',find)
Thanks for any suggestions!
I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1
Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations
It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.
The Goal I have a data generating process which creates one large TSV file on S3 (somewhere between 30-40 GB in size). Because of some data processing I want to do on it, it's easier to have it in many smaller files (~1 GB in size or smaller). Unfortunately I don't have a lot of ability to change the original data generating process to partition out the files at creation, so I'm trying to create a simple lambda to do it for me, my attempt is below
import json
import boto3
import codecs
def lambda_handler(event, context):
s3 = boto3.client('s3')
read_bucket_name = 'some-bucket'
write_bucket_name = 'some-other-bucket'
original_key = 'some-s3-file-key'
obj = s3.get_object(Bucket=read_bucket_name, Key=original_key)
lines = []
line_count = 0
file_count = 0
MAX_LINE_COUNT = 500000
def create_split_name(file_count):
return f'{original_key}-{file_count}'
def create_body(lines):
return ''.join(lines)
for ln in codecs.getreader('utf-8')(obj['Body']):
if line_count > MAX_LINE_COUNT:
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
lines = []
line_count = 0
file_count += 1
lines.append(ln)
line_count += 1
if len(lines) > 0:
file_count += 1
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
return {
'statusCode': 200,
'body': { 'file_count': file_count }
}
This functionally works which is great but the problem is on files that are sufficiently large enough this can't finish in the 15 min run window of an AWS lambda. So my questions are these
Can this code be optimized in any appreciable way to reduce run time (I'm not an expert on profiling lambda code)?
Will porting this to a compiled language provide any real benefit to run time?
Are there other utilities within AWS that can solve this problem? (A quick note here, I know that I could spin up an EC2 server to do this for me but ideally I'm trying to find a serverless solution)
UPDATE Another option I have tried is to not split up the file but tell different lambda jobs to simply read different parts of the same file using Range.
I can try to read a file by doing
obj = s3.get_object(Bucket='cradle-smorgasbord-drop', Key=key, Range=bytes_range)
lines = [line for line in codecs.getreader('utf-8')(obj['Body'])]
However on an approximately 30 GB file I had bytes_range=0-49999999 which is only the first 50 MB and the download is taking way longer than I would think that it should for that amount of data (I actually haven't even seen it finish yet)
To avoid hitting the limit of 15 minutes for the execution of AWS Lambda functions you have to ensure that you only read as much data from S3 as you can process in 15 minutes or less.
How much data from S3 you can process in 15 minutes or less depends on your function logic and the CPU and network performance of the AWS Lambda function. The available CPU performance of AWS Lambda functions scales with memory provided to the AWS Lambda function. From the AWS Lambda documentation:
Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).
So as a first step you could try to increase the provided memory to see if that improves the amount of data your function can process in 15 minutes.
Increasing the CPU performance for AWS Lambda functions might already solve your problem for now, but it doesn't scale well in case you have to process larger files in future.
Fortunately there is a solution for that: When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object(). From the boto3 documentation for get_object():
Range (string) -- Downloads the specified range bytes of an object. For more information about the HTTP Range header, go to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
In your case instead of triggering your AWS Lambda function once per object in S3 to process, you'd trigger it multiple times for the same objects, but to process different chunks of that object. Depending on how you invoke your function you might need another AWS Lambda function to examine the size of the objects in S3 to process (using head_object()) and trigger your actual Lambda function once for each chunk of data.
While you need that additional chunking logic, you wouldn't need to split the read data in your original AWS Lambda function anymore, as you could simply ensure that each chunk has a size of 1GB and only the data belonging to the chunk is read thanks to the range request. As you'd invoke a separate AWS Lambda function for each chunk you'd also parallelize your currently sequential logic, resulting in faster execution.
Finally you could drastically decrease the amount of memory consumed by your AWS Lambda function, by not reading the whole data into memory, but using streaming instead.
I would like to understand what is happening with a MemoryError that seem to occur more or less randomly.
I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).
To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.
For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk.
Before that, I'm transforming the datafram into a StringIO buffer using this function:
# static method from Utils class
#staticmethod
def df_to_buffer(my_df):
count_row, count_col = my_df.shape
buffer = io.StringIO() #creating an empty buffer
my_df.to_csv(buffer, index=False) #filling that buffer
LOGGER.info('Current data contains %d rows and %d columns, for a total
buffer size of %d bytes.', count_row, count_col, buffer.tell())
buffer.seek(0) #set to the start of the stream
return buffer
So in my "main" program the behaviour is :
# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
data = file_data.read(buffer_chunk_size)
if data:
...
# do the insert stuff
...
else:
# whole file has been loaded
break
# loop is over, close the buffer before processing a new file
file_data.close()
The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)
I'm monitoring the memory during the process (using htop) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...
Any information or advice would be appreciated,
thanks. :)
EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str that contains the 320000000 characters of file.
I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.
I have to read about 300 files to create an association with the following piece of code. Given the association, I have to read them all in memory.
with util.open_input_file(f) as f_in:
for l in f_in:
w = l.split(',')
dfm = dk.to_key((idx, i, int(w[0]), int(w[1]))) <-- guaranteed to be unique for each line in file.
cands = w[2].split(':')
for cand in cands:
tmp_data.setdefault(cand, []).append(dfm)
Then I need to write out the data structure above in this format:
k1, v1:v2,v3....
k2, v2:v5,v6...
I use the following code:
# Sort / join values.
cand2dfm_data = {}
for k,v in tmp_data.items():
cand2dfm_data[k] = ':'.join(map(str, sorted(v, key=int)))
tmp_data = {}
# Write cand2dfm CSV file.
with util.open_output_file(cand2dfm_file) as f_out:
for k in sorted(cand2dfm_data.keys()):
f_out.write('%s,%s\n' % (k, cand2dfm_data[k]))
Since I have to process a significant number of files, I'm observing two problems:
The memory used to store tmp_data is very big. In my use case, processing 300 files, it is using 42GB.
Writing out the CSV file is taking a long time. This is because I'm calling write() on each item() (about 2.2M). Furthermore, the output stream is using gzip compressor to save disk space.
In my use case, the numbers are guaranteed to be 32-bit unsigned.
Question:
To achieve memory reduction, I think it will be better to use a 32-bit int to store the data. Should I use ctypes.c_int() to store the values in the dict() (right now they are strings) or is there a better way?
For speeding up the writing, should I write to a StringIO object and then dump that to a file or is there a better way?
Alternatively, maybe there is a better way to accomplish the above logic without reading everything in memory?
Few thoughts.
Currently you are duplicating data multiple times in the memory.
You are loading it for the first time into tmp_data, then copying everything into cand2dfm_data and then creating list of keys by calling sorted(cand2dfm_data.keys()).
To reduce memory usage:
Get rid of the tmp_data, parse and write your data directly to the cand2dfm_data
Make cand2dfm_data a list of tuples, not the dict
Use cand2dfm_data.sort(...) instead of sorted(cand2dfm_data) to avoid creation of a new list
To speed up processing:
Convert keys into ints to improve sorting performance (this will reduce memory usage as well)
Write data to disk in chunks, like 100 or 500 or 1000 records in one go, this should improve I\O performance a bit
Use profiler to find other performance bottlenecks
If with above optimizations memory footprint will still be too large then consider using disk backed storage for storing and sorting temporary data, e.g. SQLite