Using python generators with lots of data - python

I have a dataset consisting of 250k items that need to meet certain criteria before being added to a list/generator. To speed up the processing, I want to use generators, but I am uncertain about whether to filter the data with a function that yields the filtered sample, or if I should just return the filtered sample to a generator object, etc. I would like the final object to only include samples that met the filter criteria, but by default python will return/yield a NoneType object. I have included example filter functions, data (the real problem uses strings, but for simplicity I use floats from random normal distribution), and what I intend to do with the data below.
How should I efficiently use generators in this instance? Is it even logical/efficient to use generators for this purpose? I know that I can check to see if an element from the return function is None to exclude it from the container (list/generator), but how can I do this with the function that yields values?
# For random data
import numpy as np
# Functions
def filter_and_yield(item_in_data):
if item_in_data > 0.0:
yield item_in_data
def filter_and_return(item_in_data):
if item_in_data > 0.0:
return item_in_data
# Arbitrary data
num_samples = 250 * 10**3
data = np.random.normal(size=(num_samples,))
# Should I use this: generator with generator elements?
filtered_data_as_gen_with_gen_elements = (filter_and_yield(item) for item in data)
# Should I use this: list with generator elements?
filtered_data_as_lst_with_gen_elements = [filter_and_yield(item) for item in data]
# Should I use this: generator with non-generator elements?
filtered_data_as_gen_with_non_gen_elements = (
filter_and_return(item) for item in data if filter_and_return(item) is not None)
# Should I use this: list with non-generator elements?
filtered_data_as_lst_with_non_gen_elements = [
filter_and_return(item) for item in data if filter_and_return(item) is not None]
# Saving the data as csv -- note, `filtered_data` is NOT defined
# but is a place holder for whatever correct way of filtering the data is
df = pd.DataFrame({'filtered_data': filtered_data})
df.to_csv('./filtered_data.csv')

The short answer is that none of these are best. Numpy and pandas include a lot of C and Fortan code that works on hardware level data types stored in contiguous arrays. Python objects, even low level ones like int and float are relatively bulky. They include the standard python object header and are allocated on the heap. And even simple operations like > require a call to one of its methods.
Its better use use numpy/pandas functions and operators as much as possible. These packages have overloaded the standard python operators to work on entire sets of data in one call.
df = pd.DataFrame({'filtered_data': data[data > 0.0]})
Here, data > 0.0 created a new numpy array of true/false for the comparison. data[...] created a new array holding only the values of data that were also true.
Other notes
filter_and_yield is a generator that will iterate 0 or 1 values. Python turned it into a generator because it has a yield. When it returns None, python turns it into a StopIteration exception. The consumer of this generator will not see the None.
(filter_and_yield(item) for item in data) is a generator that returns generators. If you use it, you'll end up with dataframe column of generators.
[filter_and_yield(item) for item in data] is a list of generators (because filter_and_yield is a generator). When pandas creates a column, it needs to know the column size. So it expands generators into lists like you've done here. You can make this for pandas, doesn't really matter. Except that pandas deletes that list when done, which reduces memory usage.
(filter_and_return(item) for item in data if filter_and_return(item) is not None) This one works, but its pretty slow. data holds a hardware level array of integers. for item in data has to convert each of those integers into python level integers and the nfilter_and_return(item) is a relatively expensive function call. This could be rewritten as (value for value in (filter_and_return(item) for item in data) if value is not None) to halve the number of function calls.
[filter_and_return(item) for item in data if filter_and_return(item) is not None] As mentioned above. its okay to do this, but delete when done to conserve memory.

Related

Dynamically creating arrays for multiple datasets

This is a quality of life query that I feel like there is an answer to, but can't find (maybe I'm using the wrong terms)
Essentially, I have multiple sets of large data files that I would like to perform analysis on. This involves reading each of these datafiles and storing them as an array (of variable length).
So far I have been doing
import numpy as np
input1 = np.genfromtxt('data1.dat')
input2 = np.genfromtxt('data2.dat')
etc. I was wondering if there is a method of dynamically assigning an array to each of these datasets. Since you can read these dynamically with a for loop,
for i in xrange(2):
input = np.genfromtxt('data%i.dat'%i)
I was hoping to combine the above to create a bunch of arrays; input1, input2, etc. without myself typing out genfromtxt multiple times. Surely there is a method if I had 100 datasets (aptly named data0, data1, etc) to import.
A solution I can think of is maybe creating a function,
import numpy as np
def input(a):
return np.genfromtxt('data%i.dat'%a)
But obviously, I would prefer to store this in memory instead of constantly regenerate a list, and would be extremely grateful to know if this is possible in Python.
You can choose to store your arrays in either a dict or a list:
Option 1
Using a dict.
data = {}
for i in xrange(2):
data['input{}'.format(i)] = np.genfromtxt('data{}.dat'.format(i))
You can access each array by key.
Option 2
Using a list.
data = []
for i in xrange(2):
data.append(np.genfromtxt('data{}.dat'.format(i)))
Alternatively, using a list comprehension:
data = [np.genfromtxt('data{}.dat'.format(i)) for i in xrange(2)]
You can also use a map, it returns a list:
data = map(lambda x: np.genfromtxt('data{}.dat'.format(x)), xrange(2))
Now you can access each array by index.

Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

I have a (large) length-N array of k distinct functions, and a length-N array of abcissa. I want to evaluate the functions at the abcissa to return a length-N array of ordinates, and critically, I need to do it very fast.
I have tried the following loop over a call to np.where, which is too slow:
Create some fake data to illustrate the problem:
def trivial_functional(i): return lambda x : i*x
k = 250
func_table = [trivial_functional(j) for j in range(k)]
func_table = np.array(func_table) # possibly unnecessary
We have a table of 250 distinct functions. Now I create a large array with many repeated entries of those functions, and a set of points of the same length at which these functions should be evaluated.
Npts = 1e6
abcissa_array = np.random.random(Npts)
function_indices = np.random.random_integers(0,len(func_table)-1,Npts)
func_array = func_table[function_indices]
Finally, loop over every function used by the data and evaluate it on the set of relevant points:
desired_output = np.zeros(Npts)
for func_index in set(function_indices):
idx = np.where(function_indices==func_index)[0]
desired_output[idx] = func_table[func_index](abcissa_array[idx])
This loop takes ~0.35 seconds on my laptop, the biggest bottleneck in my code by an order of magnitude.
Does anyone see how to avoid the blind lookup call to np.where? Is there a clever use of numba that can speed this loop up?
This does almost the same thing as your (excellent!) self-answer, but with a bit less rigamarole. It seems marginally faster on my machine as well -- about 30ms based on a cursory test.
def apply_indexed_fast(array, func_indices, func_table):
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(array)
for f, start, end in zip(func_table, func_ranges, func_ranges[1:]):
ix = func_argsort[start:end]
out[ix] = f(array[ix])
return out
Like yours, this splits a sequence of argsort indices into chunks, each of which corresponds to a function in func_table. It then uses each chunk to select input and output indices for its corresponding function. To determine the chunk boundaries, it uses np.searchsorted instead of np.unique -- where searchsorted(a, b) could be thought of as a binary search algorithm that returns the index of the first value in a equal to or greater than the given value or values in b.
Then the zip function simply iterates over its arguments in parallel, returning a single item from each one, collected together in a tuple, and stringing those together into a list. (So zip([1, 2, 3], ['a', 'b', 'c'], ['b', 'c', 'd']) returns [(1, 'a', 'b'), (2, 'b', 'c'), (3, 'c', 'd')].) This, along with the for statement's built-in ability to "unpack" those tuples, allows for a terse but expressive way to iterate over multiple sequences in parallel.
In this case, I've used it to iterate over the functions in func_tables alongside two out-of-sync copies of func_ranges. This ensures that the item from func_ranges in the end variable is always one step ahead of the item in the start variable. By appending None to func_ranges, I ensure that the final chunk is handled gracefully -- zip stops when any one of its arguments runs out of items, which cuts off the final value in the sequence. Conveniently, the None value also serves as an open-ended slice index!
Another trick that does the same thing requires a few more lines, but has lower memory overhead, especially when used with the itertools equivalent of zip, izip:
range_iter_a = iter(func_ranges) # create generators that iterate over the
range_iter_b = iter(func_ranges) # values in `func_ranges` without making copies
next(range_iter_b, None) # advance the second generator by one
for f, start, end in itertools.izip(func_table, range_iter_a, range_iter_b):
...
However, these low-overhead generator-based approaches can sometimes be a bit slower than vanilla lists. Also, note that in Python 3, zip behaves more like izip.
Thanks to hpaulj for the suggestion to pursue a groupby approach. There are lots of canned routines out there for this operation, such as Pandas DataFrames, but they all come with the overhead cost of the data structure initialization, which is one-time-only, but can be costly if using for just a single calculation.
Here is my pure numpy solution that is a factor of 13 faster than the original where loop I was using. The upshot summary is that I use np.argsort and np.unique together with some fancy indexing gymnastics.
First we sort the function indices, and then find the elements of the sorted array where each new index begins
idx_funcsort = np.argsort(function_indices)
unique_funcs, unique_func_indices = np.unique(function_indices[idx_funcsort], return_index=True)
Now there is no longer a need for blind lookups, since we know exactly which slice of the sorted array corresponds to each unique function. So we still loop over each called function, but without calling where:
for func_index in range(len(unique_funcs)-1):
idx_func = idx_funcsort[unique_func_indices[func_index]:unique_func_indices[func_index+1]]
func = func_table[unique_funcs[func_index]]
desired_output[idx_func] = func(abcissa_array[idx_func])
That covers all but the final index, which somewhat annoyingly we need to call individually due to Python indexing conventions:
func_index = len(unique_funcs)-1
idx_func = idx_funcsort[unique_func_indices[func_index]:]
func = func_table[unique_funcs[func_index]]
desired_output[idx_func] = func(abcissa_array[idx_func])
This gives identical results to the where loop (a bookkeeping sanity check), but the runtime of this loop is 0.027 seconds, a speedup of 13x over my original calculation.
That is a beautiful example of functional programming being somewhat emulated in Python.
Now, if you want to apply your function to a set of points, I'd recommend numpy's ufunc framework, which will allow you to create blazingly fast vectorized versions of your functions.

Biopython SeqIO to Pandas Dataframe

I have a FASTA file that can easily be parsed by SeqIO.parse.
I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
I could do it with only one iteration, but I get a dict :
records = SeqIO.parse(fastaFile, 'fasta')
and I somehow can't get DataFrame.from_dict to work...
My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame through each iteration.
Here is a short FASTA file for those who want to help.
You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy arrays later.
SeqIO.parse() returns a generator, so you can iterate record-by-record, building a list like so:
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.append(seq_record.id)
lengths.append(len(seq_record.seq))
See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.
The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas, you can read below:
On minimizing memory usage
Consulting the source of panda.Series, we can see that data is stored interally as a numpy ndarray:
class Series(np.ndarray, Picklable, Groupable):
"""Generic indexed series (time series or otherwise) object.
Parameters
----------
data: array-like
Underlying values of Series, preferably as numpy ndarray
If you make identifiers an ndarray, it can be used directly in Series without constructing a new array (the parameter copy, default False) will prevent a new ndarray being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray.
Avoid initializing lists
If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray to hold identifiers like so:
num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))
Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy convert from an existing list. However, this is technically the fastest way to store your data for use in pandas.
David has given you a nice answer on the pandas side, on the Biopython side you don't need to use SeqRecord objects via Bio.SeqIO if all you want is the record identifiers and their sequence length - this should be faster:
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for title, sequence in SimpleFastaParser(fasta_file):
identifiers.append(title.split(None, 1)[0]) # First word is ID
lengths.append(len(sequence))

Vectorize iteration over two large numpy arrays in parallel

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

Efficient Array replacement in Python

I'm wondering what is the most efficient way to replace elements in an array with other random elements in the array given some criteria. More specifically, I need to replace each element which doesn't meet a given criteria with another random value from that row. For example, I want to replace each row of data as a random cell in data(row) which is between -.8 and .8. My inefficinet solution looks something like this:
import numpy as np
data = np.random.normal(0, 1, (10, 100))
for index, row in enumerate(data):
row_copy = np.copy(row)
outliers = np.logical_or(row>.8, row<-.8)
for prob in np.where(outliers==1)[0]:
fixed = 0
while fixed == 0:
random_other_value = r.randint(0,99)
if random_other_value in np.where(outliers==1)[0]:
fixed = 0
else:
row_copy[prob] = row[random_other_value]
fixed = 1
Obviously, this is not efficient.
I think it would be faster to pull out all the good values, then use random.choice() to pick one whenever you need it. Something like this:
import numpy as np
import random
from itertools import izip
data = np.random.normal(0, 1, (10, 100))
for row in data:
good_ones = np.logical_and(row >= -0.8, row <= 0.8)
good = row[good_ones]
row_copy = np.array([x if f else random.choice(good) for f, x in izip(good_ones, row)])
High-level Python code that you write is slower than the C internals of Python. If you can push work down into the C internals it is usually faster. In other words, try to let Python do the heavy lifting for you rather than writing a lot of code. It's zen... write less code to get faster code.
I added a loop to run your code 1000 times, and to run my code 1000 times, and measured how long they took to execute. According to my test, my code is ten times faster.
Additional explanation of what this code is doing:
row_copy is being set by building a new list, and then calling np.array() on the new list to convert it to a NumPy array object. The new list is being built by a list comprehension.
The new list is made according to the rule: if the number is good, keep it; else, take a random choice from among the good values.
A list comprehension walks over a sequence of values, but to apply this rule we need two values: the number, and the flag saying whether that number is good or not. The easiest and fastest way to make a list comprehension walk along two sequences at once is to use izip() to "zip" the two sequences together. izip() will yield up tuples, one at a time, where the tuple is (f, x); f in this case is the flag saying good or not, and x is the number. (Python has a built-in feature called zip() which does pretty much the same thing, but actually builds a list of tuples; izip() just makes an iterator that yields up tuple values. But you can play with zip() at a Python prompt to learn more about how it works.)
In Python we can unpack a tuple into variable names like so:
a, b = (2, 3)
In this example, we set a to 2 and b to 3. In the list comprehension we unpack the tuples from izip() into variables f and x.
Then the heart of the list comprehension is a "ternary if" statement like so:
a if flag else b
The above will return the value a if the flag value is true, and otherwise return b. The one in this list comprehension is:
x if f else random.choice(good)
This implements our rule.

Categories