I have a script that loops through ~335k filenames, opens the fits-tables from the filenames, performs a few operations on the tables and writes the results to a file. In the beginning the loop goes relatively fast but with time it consumes more and more RAM (and CPU resources, I guess) and the script also gets slower. I would like to know how can I improve the performance/make the code quicker. E.g. is there a better way to write to the output file (open the output file ones and do everything within a while-open loop vs opening the file every time a new to write in it)? Is there a better looping way? Can I dump memory that I don't need anymore?
My script looks like that:
#spectral is a package for manipulation of spectral data
from spectral import *
# I use this dictionary to store functions, that I don't want to generate a new each time I need them.
# Generating them a new would be more time consuming, I figured out
lam_resample_dic = {}
with open("/home/bla/Downloads/output.txt", "ab") as f:
for fname, ind in zip(list_of_fnames, range(len(list_of_fnames))):
data_s = Table.read('/home/nestor/Downloads/all_eBoss_QSO/'+fname, format='fits')
# lam_str_identifier is just the dic-key I need for finding the corresponding BandResampler function from below
lam_str_identifier = ''.join([str(x) for x in data_s['LOGLAM'].data.astype(str)])
if lam_str_identifier not in lam_resample_dic:
# BandResampler is the function I avoid doing everytime a new
# I do it only if necessary - when lam_str_identifier indicates a unique new set of data
resample = BandResampler(centers1=10**data_s['LOGLAM'], centers2=df_jpas["Filter.wavelength"].values, fwhm2=df_jpas["Filter.width"].values)
lam_resample_dic[lam_str_identifier] = resample
photo_spec = np.around(resample(data_s['FLUX']),4)
else:
photo_spec = np.around(lam_resample_dic[lam_str_identifier](data_s['FLUX']),4)
np.savetxt(f, [photo_spec], delimiter=',', fmt='%1.4f')
# this is just to keep track of the progress of the loop
if ind%1000==0:
print('num of files processed so far:',find)
Thanks for any suggestions!
Related
I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.
I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1
Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations
It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.
I have a for loop that does an API call over thousands of rows.
(I know for loops are not recommended but this api is rate limited so slow is better. I know I can also do iterrows but this is just an example)
Sometimes I come back to find the loop has failed or there's something wrong with the api and I need to stop the loop. This means I lose all my data.
I was thinking of pickling the dataframe at the end of each loop, and re-loading it at the start. This would save all updates made to the dataframe.
Fake example (not working code - this is just a 'what if'):
for i in range(len(df1)):
# check if df pickle file in directory
if pickle in directory:
# load file
df1 = pickle.load(df1)
# append new data
df1.loc[i,'api_result'] = requests(http/api/call/data/)
# dump it to file
pickle.dump(df1)
else:
# start of loop
# append new data
df1.loc[i,'api_result'] = requests(http/api/call/data/)
# dump to file
pickle.dump(df1)
And if this is not a good way to keep the updated file in case of failure or early stoppage, what is?
I think that a good solution would be to save in append all the updates in a file
with open("updates.txt", "a") as f_o:
for i in range(len(df1)):
# append new data
f_o.write(requests(http/api/call/data/)+"\n")
And if all the rows are present in the file, you can do a bulk update. If not, restart the updates from the last failed record
I'm loading a 4gb .csv file in python. Since it's 4gb I expected it would be ok to load it at once, but after a while my 32gb of ram gets completely filled.
Am I doing something wrong? Why does 4gb is becoming so much larger in ram aspects?
Is there a faster way of loading this data?
fname = "E:\Data\Data.csv"
a = []
with open(fname) as csvfile:
reader = csv.reader(csvfile,delimiter=',')
cont = 0;
for row in reader:
cont = cont+1
print(cont)
a.append(row)
b = np.asarray(a)
You copied the entire content of the csv at least twice.
Once in a and again in b.
Any additional work over that data consumes additional memory to hold values and such
You could del a once you have b, but note that pandas library provides you a read_csv function and a way to count the rows of the generated Dataframe
You should be able to do an
a = list(reader)
which might be a little better.
Because it's Python :-D One of easiest approaches: create your own class for rows, which would store data at least in slots, which can save couple hundreds bytes per row (reader put them into dict, which is kinda huge even when empty)..... if go further, than you may try to store binary representation of data.
But maybe you can process data without saving entire data-array? It's will consume significantly less memory.
I have a large text file (>1GB) of three comma-separated values that I want to read into a Pandas DataFrame in chunks. An example of the DataFrame is below:
I'd like to filter through this file while reading it in and output a "clean" version. One issue I have is that some Timestamps are out-of-order, but the problem is usually quite local (usually a tick is out-of-order by a few slots before or below). Are there any ways to do localized, "sliding window" sorting?
Also, as I'm fairly new to Python, and learning about the I/O methods, I'm unsure of the best class/method to use for filtering large data files. TextIOBase?
This is a really interesting question, as the data's big enough to not fit in memory easily.
First of all, about I/O: if it's a CSV I'd use a standard library csv.reader() object, like so (I'm assuming Python 3):
with open('big.csv', newline='') as f:
for row in csv.reader(f):
...
Then I'd probably keep a sliding window of rows in a collections.deque(maxlen=WINDOW_SIZE) instance with window size set to maybe 20 based on your description. Read the first WINDOW_SIZE rows into the deque, then enter the main read loop which would output the left-most item (row) in the deque, then append the current row.
After appending each row, if the current row's timestamp comes before the timestamp of the previous rows (window[-2]), then sort the deque. You can't sort a deque directly, but do something like:
window = collections.deque(sorted(window), maxlen=WINDOW_SIZE)
Python's Timsort algorithm handles already-sorted runs efficiently, so this should be very fast (linear time).
If the window size and the number of out-of-order rows are small (as it sounds like they can be), I believe the overall algorithm will be O(N) where N is the number of rows in the data file, so linear time.
UPDATE: I wrote some demo code to generate a file like this and then sort it using the above technique -- see this Gist, tested on Python 3.5. It's much faster than the sort utility on the same data, and also faster than Python's sorted() function after about N = 1,000,000. Incidentally the function that generates a demo CSV is significantly slower than the sorting code. :-) My results timing process_sliding() for various N (definitely looks linear-ish):
N = 1,000,000: 3.5s
N = 2,000,000: 6.6s
N = 10,000,000: 32.9s
For reference, here's the code for my version of process_sliding():
def process_sliding(in_filename, out_filename, window_size=20):
with (open(in_filename, newline='') as fin,
open(out_filename, 'w', newline='') as fout):
reader = csv.reader(fin)
writer = csv.writer(fout)
first_window = sorted(next(reader) for _ in range(window_size))
window = collections.deque(first_window, maxlen=window_size)
for row in reader:
writer.writerow(window.popleft())
window.append(row)
if row[0] < window[-2][0]:
window = collections.deque(sorted(window), maxlen=window_size)
for row in window:
writer.writerow(row)