Improving the efficiency (memory/time) of the following python code - python

I have to read about 300 files to create an association with the following piece of code. Given the association, I have to read them all in memory.
with util.open_input_file(f) as f_in:
for l in f_in:
w = l.split(',')
dfm = dk.to_key((idx, i, int(w[0]), int(w[1]))) <-- guaranteed to be unique for each line in file.
cands = w[2].split(':')
for cand in cands:
tmp_data.setdefault(cand, []).append(dfm)
Then I need to write out the data structure above in this format:
k1, v1:v2,v3....
k2, v2:v5,v6...
I use the following code:
# Sort / join values.
cand2dfm_data = {}
for k,v in tmp_data.items():
cand2dfm_data[k] = ':'.join(map(str, sorted(v, key=int)))
tmp_data = {}
# Write cand2dfm CSV file.
with util.open_output_file(cand2dfm_file) as f_out:
for k in sorted(cand2dfm_data.keys()):
f_out.write('%s,%s\n' % (k, cand2dfm_data[k]))
Since I have to process a significant number of files, I'm observing two problems:
The memory used to store tmp_data is very big. In my use case, processing 300 files, it is using 42GB.
Writing out the CSV file is taking a long time. This is because I'm calling write() on each item() (about 2.2M). Furthermore, the output stream is using gzip compressor to save disk space.
In my use case, the numbers are guaranteed to be 32-bit unsigned.
Question:
To achieve memory reduction, I think it will be better to use a 32-bit int to store the data. Should I use ctypes.c_int() to store the values in the dict() (right now they are strings) or is there a better way?
For speeding up the writing, should I write to a StringIO object and then dump that to a file or is there a better way?
Alternatively, maybe there is a better way to accomplish the above logic without reading everything in memory?

Few thoughts.
Currently you are duplicating data multiple times in the memory.
You are loading it for the first time into tmp_data, then copying everything into cand2dfm_data and then creating list of keys by calling sorted(cand2dfm_data.keys()).
To reduce memory usage:
Get rid of the tmp_data, parse and write your data directly to the cand2dfm_data
Make cand2dfm_data a list of tuples, not the dict
Use cand2dfm_data.sort(...) instead of sorted(cand2dfm_data) to avoid creation of a new list
To speed up processing:
Convert keys into ints to improve sorting performance (this will reduce memory usage as well)
Write data to disk in chunks, like 100 or 500 or 1000 records in one go, this should improve I\O performance a bit
Use profiler to find other performance bottlenecks
If with above optimizations memory footprint will still be too large then consider using disk backed storage for storing and sorting temporary data, e.g. SQLite

Related

How to load large json(multiple object) into pandas dataframes in chuncks to avoid high memory usage?

I have a very large json file that is in the form of multiple objects, for small dataset, this works
data=pd.read_json(file,lines=True)
but on the same but larger dataset it would crash on 8gb ram computer, so i tried to convert it to list first with below code
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
data.append(d)'
then convert the list into dataframe with
df = pd.DataFrame(data)
this does convert it into a list fine even with the large dataset file, but it crashes when i try to convert it into a dataframe due to it using to much memory i presume
i have tried doing
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
df=pd.DataFrame([d])'
I thought it would append it one by one but i think it still create one large copy in memory at once insteads, so it still crashes
how would i convert the large json file into dataframe by chuncks so it limit the memory useage?
There are several possible solution, depending on your specific case. Given we don't have a data example or information on the data structure, I could offer the following:
If the data in the json file is numeric, consider breaking it into chunks, reading each one and converting to the smallest type (float32/int), as pandas will use float64 which is more memory intensive
use Dask for bigger-than-memory datasets, like you have.
To avoid the intermediate data structures you can use a generator.
def load_jsonl(filename):
with open(filename) as fd:
for line in fd:
yield json.loads(line)
df = pd.DataFrame(load_jsonl(filename))

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

Processing data using pandas in a memory efficient manner using Python

I have to read multiple csv files and group them by "event_name". I also might have some duplicates, so I need to drop them. paths contains all the paths of the csv files, and my code is as follows:
data = []
for path in paths:
csv_file = pd.read_csv(path)
data.append(csv_file)
events = pd.concat(data)
events = events.drop_duplicates()
event_names = events.groupby('event_name')
ev2 = []
for name, group in event_names:
a, b = group.shape
ev2.append([name, a])
This code is going to tell me how many unique event_name unique there are, and how many entries per event_name. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?
I read about using dir() and globals() to delete variables, which I could certainly use, because once I have event_names, I don't need the DataFrame events any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.
Just keep a hash value of each row to reduce the data size.
csv_file = pd.read_csv(path)
# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)
# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])
If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.
# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')
Reference: pandas official docs page

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Loading a .csv file in python consuming too much memory

I'm loading a 4gb .csv file in python. Since it's 4gb I expected it would be ok to load it at once, but after a while my 32gb of ram gets completely filled.
Am I doing something wrong? Why does 4gb is becoming so much larger in ram aspects?
Is there a faster way of loading this data?
fname = "E:\Data\Data.csv"
a = []
with open(fname) as csvfile:
reader = csv.reader(csvfile,delimiter=',')
cont = 0;
for row in reader:
cont = cont+1
print(cont)
a.append(row)
b = np.asarray(a)
You copied the entire content of the csv at least twice.
Once in a and again in b.
Any additional work over that data consumes additional memory to hold values and such
You could del a once you have b, but note that pandas library provides you a read_csv function and a way to count the rows of the generated Dataframe
You should be able to do an
a = list(reader)
which might be a little better.
Because it's Python :-D One of easiest approaches: create your own class for rows, which would store data at least in slots, which can save couple hundreds bytes per row (reader put them into dict, which is kinda huge even when empty)..... if go further, than you may try to store binary representation of data.
But maybe you can process data without saving entire data-array? It's will consume significantly less memory.

Categories