Dynamically creating arrays for multiple datasets

Dynamically creating arrays for multiple datasets - python

This is a quality of life query that I feel like there is an answer to, but can't find (maybe I'm using the wrong terms)
Essentially, I have multiple sets of large data files that I would like to perform analysis on. This involves reading each of these datafiles and storing them as an array (of variable length).
So far I have been doing
import numpy as np
input1 = np.genfromtxt('data1.dat')
input2 = np.genfromtxt('data2.dat')
etc. I was wondering if there is a method of dynamically assigning an array to each of these datasets. Since you can read these dynamically with a for loop,
for i in xrange(2):
input = np.genfromtxt('data%i.dat'%i)
I was hoping to combine the above to create a bunch of arrays; input1, input2, etc. without myself typing out genfromtxt multiple times. Surely there is a method if I had 100 datasets (aptly named data0, data1, etc) to import.
A solution I can think of is maybe creating a function,
import numpy as np
def input(a):
return np.genfromtxt('data%i.dat'%a)
But obviously, I would prefer to store this in memory instead of constantly regenerate a list, and would be extremely grateful to know if this is possible in Python.

You can choose to store your arrays in either a dict or a list:
Option 1
Using a dict.
data = {}
for i in xrange(2):
data['input{}'.format(i)] = np.genfromtxt('data{}.dat'.format(i))
You can access each array by key.
Option 2
Using a list.
data = []
for i in xrange(2):
data.append(np.genfromtxt('data{}.dat'.format(i)))
Alternatively, using a list comprehension:
data = [np.genfromtxt('data{}.dat'.format(i)) for i in xrange(2)]
You can also use a map, it returns a list:
data = map(lambda x: np.genfromtxt('data{}.dat'.format(x)), xrange(2))
Now you can access each array by index.

Related

Using python generators with lots of data

I have a dataset consisting of 250k items that need to meet certain criteria before being added to a list/generator. To speed up the processing, I want to use generators, but I am uncertain about whether to filter the data with a function that yields the filtered sample, or if I should just return the filtered sample to a generator object, etc. I would like the final object to only include samples that met the filter criteria, but by default python will return/yield a NoneType object. I have included example filter functions, data (the real problem uses strings, but for simplicity I use floats from random normal distribution), and what I intend to do with the data below.
How should I efficiently use generators in this instance? Is it even logical/efficient to use generators for this purpose? I know that I can check to see if an element from the return function is None to exclude it from the container (list/generator), but how can I do this with the function that yields values?
# For random data
import numpy as np
# Functions
def filter_and_yield(item_in_data):
if item_in_data > 0.0:
yield item_in_data
def filter_and_return(item_in_data):
if item_in_data > 0.0:
return item_in_data
# Arbitrary data
num_samples = 250 * 10**3
data = np.random.normal(size=(num_samples,))
# Should I use this: generator with generator elements?
filtered_data_as_gen_with_gen_elements = (filter_and_yield(item) for item in data)
# Should I use this: list with generator elements?
filtered_data_as_lst_with_gen_elements = [filter_and_yield(item) for item in data]
# Should I use this: generator with non-generator elements?
filtered_data_as_gen_with_non_gen_elements = (
filter_and_return(item) for item in data if filter_and_return(item) is not None)
# Should I use this: list with non-generator elements?
filtered_data_as_lst_with_non_gen_elements = [
filter_and_return(item) for item in data if filter_and_return(item) is not None]
# Saving the data as csv -- note, `filtered_data` is NOT defined
# but is a place holder for whatever correct way of filtering the data is
df = pd.DataFrame({'filtered_data': filtered_data})
df.to_csv('./filtered_data.csv')

The short answer is that none of these are best. Numpy and pandas include a lot of C and Fortan code that works on hardware level data types stored in contiguous arrays. Python objects, even low level ones like int and float are relatively bulky. They include the standard python object header and are allocated on the heap. And even simple operations like > require a call to one of its methods.
Its better use use numpy/pandas functions and operators as much as possible. These packages have overloaded the standard python operators to work on entire sets of data in one call.
df = pd.DataFrame({'filtered_data': data[data > 0.0]})
Here, data > 0.0 created a new numpy array of true/false for the comparison. data[...] created a new array holding only the values of data that were also true.
Other notes
filter_and_yield is a generator that will iterate 0 or 1 values. Python turned it into a generator because it has a yield. When it returns None, python turns it into a StopIteration exception. The consumer of this generator will not see the None.
(filter_and_yield(item) for item in data) is a generator that returns generators. If you use it, you'll end up with dataframe column of generators.
[filter_and_yield(item) for item in data] is a list of generators (because filter_and_yield is a generator). When pandas creates a column, it needs to know the column size. So it expands generators into lists like you've done here. You can make this for pandas, doesn't really matter. Except that pandas deletes that list when done, which reduces memory usage.
(filter_and_return(item) for item in data if filter_and_return(item) is not None) This one works, but its pretty slow. data holds a hardware level array of integers. for item in data has to convert each of those integers into python level integers and the nfilter_and_return(item) is a relatively expensive function call. This could be rewritten as (value for value in (filter_and_return(item) for item in data) if value is not None) to halve the number of function calls.
[filter_and_return(item) for item in data if filter_and_return(item) is not None] As mentioned above. its okay to do this, but delete when done to conserve memory.

Is putting a numpy array inside a list pythonic?

I am trying to break a long sequence into sub-sequence with a smaller window size by using the get_slice function defined by me.
Then I suddenly realized that my code is too clumsy, since my raw data is already a numpy array, then I need to store it into a list in my get_slice function. After that, when I read each row in the data_matrix, I need another list to stored the information again.
The code works fine, yet the conversion between numpy array and list back and forth seems non-pythonic to me. I wonder if I am doing it right. If not, how to do it more efficiently and more pythonic?
Here's my code:
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
x_slice = []
for i in range(int(len(X)/stride)):
if i*stride < len(X)-windows+1:
x_slice.append(X[i*stride:i*stride+windows])
return np.array(x_slice)
##---------End--------------##
x_list = []
for row in data_matrix:
temp_data = get_slice(row) #getting time slice as numpy array
x_list.append(temp_data) #appending the time slice into a list
X = np.array(x_list) #Converting the list back to numpy array

Putting this here as a semi-complete answer to address your two points - making the code more "pythonic" and more "efficient."
There are many ways to write code and there's always a balance to be found between the amount of numpy code and pure python code used.
Most of that comes down to experience with numpy and knowing some of the more advanced features, how fast the code needs to run, and personal preference.
Personal preference is the most important - you need to be able to understand what your code does and modify it.
Don't worry about what is pythonic, or even worse - numpythonic.
Find a coding style that works for you (as you seem to have done), and don't stop learning.
You'll pick up some tricks (like #B.M.'s answer uses), but for the most part these should be saved for rare instances.
Most tricks tend to require extra work, or only apply in some circumstances.
That brings up the second part of your question.
How to make code more efficient.
The first step is to benchmark it.
Really.
I've been surprised at the number of things I thought would speed up code that barely changed it, or even made it run slower.
Python's lists are highly optimized and give good performance for many things (Although many users here on stackoverflow remain convinced that using numpy can magically make any code faster).
To address your specific point, mixing lists and arrays is fine in most cases. Particularly if
You don't know the size of your data beforehand (lists expand much more efficiently)
You are creating a large number of views into an array (a list of arrays is often cheaper than one large array in this case)
You have irregularly shaped data (arrays must be square)
In your code, case 2 applies. The trick with as_strided would also work, and probably be faster in some cases, but until you've profiled and know what those cases are I would say your code is good enough.

There is very fews case where mixing list and array is necessary. You can efficiently have the same data with only array primitives:
data_matrix=np.add.outer([0,100,1000],np.linspace(1,60,60,dtype=int))
X=np.lib.stride_tricks.as_strided(data_matrix2,shape=(3, 56, 5),strides=(4*60,4,4))
It's just a view. A fresh array can be obtained by X=X.copy().

Appending to the list will be slow. Try a list comprehension to make the numpy array.
something like below
import numpy as np
##Artifical Data Generation##
X_row1 = np.linspace(1,60,60,dtype=int)
X_row2 = np.linspace(101,160,60,dtype=int)
X_row3 = np.linspace(1001,1060,60,dtype=int)
data_matrix = np.append(X_row1.reshape(1,-1),X_row2.reshape(1,-1),axis=0)
data_matrix = np.append(data_matrix,X_row3.reshape(1,-1,),axis=0)
##---------End--------------##
##The function for generating time slice for sequence##
def get_slice(X,windows=5, stride=1):
return np.array([X[i*stride:i*stride+windows]
for i in range(int(len(X)/stride))
if i*stride < len(X)-windows+1])
##---------End--------------##
X = np.array([get_slice(row) for row in data_matrix])
print(X)
This may be odd, because you have a numpy array of numpy arrays. If you want a 3 dimensional array this is perfectly fine. If you don't want a 3 dimensional array then you may want to vstack or append the arrays.
# X = np.array([get_slice(row) for row in data_matrix])
X = np.vstack((get_slice(row) for row in data_matrix))
List Comprehension speed
I am running Python 3.4.4 on Windows 10.
import timeit
TEST_RUNS = 1000
LIST_SIZE = 2000000
def make_list():
li = []
for i in range(LIST_SIZE):
li.append(i)
return li
def make_list_microopt():
li = []
append = li.append
for i in range(LIST_SIZE):
append(i)
return li
def make_list_comp():
li = [i for i in range(LIST_SIZE)]
return li
print("List Append:", timeit.timeit(make_list, number=TEST_RUNS))
print("List Comprehension:", timeit.timeit(make_list_comp, number=TEST_RUNS))
print("List Append Micro-optimization:", timeit.timeit(make_list_microopt, number=TEST_RUNS))
Output
List Append: 222.00971377954895
List Comprehension: 125.9705268094408
List Append Micro-optimization: 157.25782340883387
I am very surprised with how much the micro-optimization helps. Still, List Comprehensions are a lot faster for large lists on my system.

Put ordered data back into a dictionary

I have a (normal, unordered) dictionary that is holding my data and I extract some of the data into a numpy array to do some linear algebra. Once that's done I want to put the resulting ordered numpy vector data back into the dictionary with all of data. What's the best, most Pythonic, way to do this?
Joe Kington suggests in his answer to "Writing to numpy array from dictionary" that two solutions include:
Using Ordered Dictionaries
Storing the sorting order in another data structure, such as a dictionary
Here are some (possibly useful) details:
My data is in nested dictionaries. The outer is for groups: {groupKey: groupDict} and group keys start at 0 and count up in order to the total number of groups. groupDict contains information about items: (itemKey: itemDict}. itemDict has keys for the actual data and these keys typically start at 0, but can skip numbers as not all "item locations" are populated. itemDict keys include things like 'name', 'description', 'x', 'y', ...
Getting to the data is easy, dictionaries are great:
data[groupKey][itemKey]['x'] = 0.12
Then I put data such as x and y into a numpy vectors and arrays, something like this:
xVector = numpy.empty( xLength )
vectorIndex = 0
for groupKey, groupDict in dataDict.items()
for itemKey, itemDict in groupDict.items()
xVector[vectorIndex] = itemDict['x']
vectorIndex += 1
Then I go off and do my linear algebra and calculate a z vector that I want to add back into dataDict. The issue is that dataDict is unordered, so I don't have any way of getting the proper index.
The Ordered Dict method would allow me to know the order and then index through the dataDict structure and put the data back in.
Alternatively, I could create another dictionary while inside the inner for loop above that stores the relationship between vectorIndex, groupKey and itemKey:
sortingDict[vectorIndex]['groupKey'] = groupKey
sortingDict[vectorIndex]['itemKey'] = itemKey
Later, when it's time to put the data back, I could just loop through the vectors and add the data:
vectorIndex = 0
for z in numpy.nditer(zVector):
dataDict[sortingDict[vectorIndex]['groupKey']][sortingDict[vectorIndex]['itemKey']]['z'] = z
Both methods seem equally straight forward to me. I'm not sure if changing dataDict to an ordered dictionary will have any other effects elsewhere in my code, but probably not. Adding the sorting dictionary also seems pretty easy as it will get created at the same time as the numpy arrays and vectors. Left on my own I think I would go with the sortingDict method.
Is one of these methods better than the others? Is there a better way I'm not thinking of? My data structure works well for me, but if there's a way to change that to improve everything else I'm open to it.

I ended up going with option #2 and it works quite well.

Biopython SeqIO to Pandas Dataframe

I have a FASTA file that can easily be parsed by SeqIO.parse.
I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
I could do it with only one iteration, but I get a dict :
records = SeqIO.parse(fastaFile, 'fasta')
and I somehow can't get DataFrame.from_dict to work...
My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame through each iteration.
Here is a short FASTA file for those who want to help.

You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy arrays later.
SeqIO.parse() returns a generator, so you can iterate record-by-record, building a list like so:
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.append(seq_record.id)
lengths.append(len(seq_record.seq))
See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.
The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas, you can read below:
On minimizing memory usage
Consulting the source of panda.Series, we can see that data is stored interally as a numpy ndarray:
class Series(np.ndarray, Picklable, Groupable):
"""Generic indexed series (time series or otherwise) object.
Parameters
----------
data: array-like
Underlying values of Series, preferably as numpy ndarray
If you make identifiers an ndarray, it can be used directly in Series without constructing a new array (the parameter copy, default False) will prevent a new ndarray being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray.
Avoid initializing lists
If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray to hold identifiers like so:
num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))
Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy convert from an existing list. However, this is technically the fastest way to store your data for use in pandas.

David has given you a nice answer on the pandas side, on the Biopython side you don't need to use SeqRecord objects via Bio.SeqIO if all you want is the record identifiers and their sequence length - this should be faster:
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for title, sequence in SimpleFastaParser(fasta_file):
identifiers.append(title.split(None, 1)[0]) # First word is ID
lengths.append(len(sequence))

Vectorize iteration over two large numpy arrays in parallel

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.

In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).

Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.

It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.