Biopython SeqIO to Pandas Dataframe - python

I have a FASTA file that can easily be parsed by SeqIO.parse.
I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
I could do it with only one iteration, but I get a dict :
records = SeqIO.parse(fastaFile, 'fasta')
and I somehow can't get DataFrame.from_dict to work...
My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame through each iteration.
Here is a short FASTA file for those who want to help.

You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy arrays later.
SeqIO.parse() returns a generator, so you can iterate record-by-record, building a list like so:
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.append(seq_record.id)
lengths.append(len(seq_record.seq))
See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.
The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas, you can read below:
On minimizing memory usage
Consulting the source of panda.Series, we can see that data is stored interally as a numpy ndarray:
class Series(np.ndarray, Picklable, Groupable):
"""Generic indexed series (time series or otherwise) object.
Parameters
----------
data: array-like
Underlying values of Series, preferably as numpy ndarray
If you make identifiers an ndarray, it can be used directly in Series without constructing a new array (the parameter copy, default False) will prevent a new ndarray being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray.
Avoid initializing lists
If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray to hold identifiers like so:
num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))
Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy convert from an existing list. However, this is technically the fastest way to store your data for use in pandas.

David has given you a nice answer on the pandas side, on the Biopython side you don't need to use SeqRecord objects via Bio.SeqIO if all you want is the record identifiers and their sequence length - this should be faster:
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for title, sequence in SimpleFastaParser(fasta_file):
identifiers.append(title.split(None, 1)[0]) # First word is ID
lengths.append(len(sequence))

Related

Using python generators with lots of data

I have a dataset consisting of 250k items that need to meet certain criteria before being added to a list/generator. To speed up the processing, I want to use generators, but I am uncertain about whether to filter the data with a function that yields the filtered sample, or if I should just return the filtered sample to a generator object, etc. I would like the final object to only include samples that met the filter criteria, but by default python will return/yield a NoneType object. I have included example filter functions, data (the real problem uses strings, but for simplicity I use floats from random normal distribution), and what I intend to do with the data below.
How should I efficiently use generators in this instance? Is it even logical/efficient to use generators for this purpose? I know that I can check to see if an element from the return function is None to exclude it from the container (list/generator), but how can I do this with the function that yields values?
# For random data
import numpy as np
# Functions
def filter_and_yield(item_in_data):
if item_in_data > 0.0:
yield item_in_data
def filter_and_return(item_in_data):
if item_in_data > 0.0:
return item_in_data
# Arbitrary data
num_samples = 250 * 10**3
data = np.random.normal(size=(num_samples,))
# Should I use this: generator with generator elements?
filtered_data_as_gen_with_gen_elements = (filter_and_yield(item) for item in data)
# Should I use this: list with generator elements?
filtered_data_as_lst_with_gen_elements = [filter_and_yield(item) for item in data]
# Should I use this: generator with non-generator elements?
filtered_data_as_gen_with_non_gen_elements = (
filter_and_return(item) for item in data if filter_and_return(item) is not None)
# Should I use this: list with non-generator elements?
filtered_data_as_lst_with_non_gen_elements = [
filter_and_return(item) for item in data if filter_and_return(item) is not None]
# Saving the data as csv -- note, `filtered_data` is NOT defined
# but is a place holder for whatever correct way of filtering the data is
df = pd.DataFrame({'filtered_data': filtered_data})
df.to_csv('./filtered_data.csv')
The short answer is that none of these are best. Numpy and pandas include a lot of C and Fortan code that works on hardware level data types stored in contiguous arrays. Python objects, even low level ones like int and float are relatively bulky. They include the standard python object header and are allocated on the heap. And even simple operations like > require a call to one of its methods.
Its better use use numpy/pandas functions and operators as much as possible. These packages have overloaded the standard python operators to work on entire sets of data in one call.
df = pd.DataFrame({'filtered_data': data[data > 0.0]})
Here, data > 0.0 created a new numpy array of true/false for the comparison. data[...] created a new array holding only the values of data that were also true.
Other notes
filter_and_yield is a generator that will iterate 0 or 1 values. Python turned it into a generator because it has a yield. When it returns None, python turns it into a StopIteration exception. The consumer of this generator will not see the None.
(filter_and_yield(item) for item in data) is a generator that returns generators. If you use it, you'll end up with dataframe column of generators.
[filter_and_yield(item) for item in data] is a list of generators (because filter_and_yield is a generator). When pandas creates a column, it needs to know the column size. So it expands generators into lists like you've done here. You can make this for pandas, doesn't really matter. Except that pandas deletes that list when done, which reduces memory usage.
(filter_and_return(item) for item in data if filter_and_return(item) is not None) This one works, but its pretty slow. data holds a hardware level array of integers. for item in data has to convert each of those integers into python level integers and the nfilter_and_return(item) is a relatively expensive function call. This could be rewritten as (value for value in (filter_and_return(item) for item in data) if value is not None) to halve the number of function calls.
[filter_and_return(item) for item in data if filter_and_return(item) is not None] As mentioned above. its okay to do this, but delete when done to conserve memory.

Find most common string in 2D numpy array

Im making a 2D numpy array in python which looks like this
['0.001251993149471442' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'winter'] ['0.002232327408019874' 'winter']
I want to get the most common string from the entire array.
I did find some ways to do this already but all of those have the same problem that it wont work because there are 2 datatypes in the array.
Is there an easier way to get the most common element from an entire column (not row) besides just running it through a for loop and counting?
You can get a count of all the values using numpy and collections. It's not clear from your question whether the numeric values in your 2D list are actually numbers or strings, but this works for both as long as the numeric values are first and the words are second:
import numpy
from collections import Counter
input1 = [['0.001251993149471442', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'winter'], ['0.002232327408019874', 'winter']]
input2 = [[0.001251993149471442, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'winter'], [0.002232327408019874, 'winter']]
def count(input):
oneDim = list(numpy.ndarray.flatten(numpy.array(input))) # flatten the list
del oneDim[0::2] # remove the 'numbers' (i.e. elements at even indices)
counts = Counter(oneDim) # get a count of all unique elements
maxString = counts.most_common(1)[0] # find the most common one
print(maxString)
count(input1)
count(input2)
If you want to also include the numbers in the count, simply skip the line del oneDim[0::2]
Unfortunately, mode() method exists only in Pandas, not in Numpy,
so the first step is to flatten your array (arr) and convert it to
a pandasonic Series:
s = pd.Series(arr.flatten())
Then if you want to find the most common string (and note that Numpy
arrays have all elements of the same type), the most intuitive solution
is to execute:
s.mode()[0]
(s.mode() alone returns a Series, so we just take the initial element
of it).
The result is:
'0.002232327408019874'
But if you want to leave out strings that are convertible to numbers,
you need a different approach.
Unfortunately, you can not use s.str.isnumeric() because it finds
strings composed solely of digits, but your "numeric" strings contain
also dots.
So you have to narrow down your Series (s) using str.match and
then invoke mode:
s[~s.str.match('^[+-]?(?:\d|\d+\.\d*|\d*\.\d+)$')].mode()[0]
This time the result is:
'herfst'

Python: What's the best way to unpack a struct array from binary data

I'm parsing a binary file format (OpenType font file). The format is a complex tree of many different struct types, but one recurring pattern is to have an array of records of a particular format. I've written code using struct.unpack to get one record at a time. But I'm wondering if there's a way I'm missing to parse the entire array of records?
The following is an example of unpacked results for one particular kind of record array:
[{'glyphID': 288, 'paletteIndex': 0}, {'glyphID': 289, 'paletteIndex': 1}, {'glyphID': 518, 'paletteIndex': 0}, ...] list
This is what I'm doing at present: I've created a generic function to unpack an arbitrary records array (consistent record format in any given call).
def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
recordLength = struct.calcsize(format)
array = []
index = 0
for i in range(numRecords):
record = {}
vals = struct.unpack(format, buffer[index : index + recordLength])
for k, v in zip(fieldNames, vals):
record[k] = v
array.append(record)
index += recordLength
return array
The buffer parameter is a byte sequence at least the size of the array, with the first record to be unpacked at the start of the sequence.
The format parameter is a struct format string, according to the type of record array being read. In one case, the format string might be ">3H"; in another case, it might be ">4s2H"; etc. For the above example of results, it was ">2H".
The fieldNames parameter is a sequence of strings for the field names in the given record type. For the above example of results, this was ("glyphID", "paletteIndex").
So, I'm stepping through the buffer (byte sequence data), getting sequential slices and unpacking the records one at a time, creating a dict for each record and appending them to the array list.
Is there a better way to do this, a method like unpack in some module that allows defining a format as an array of structs and unpacking the whole shebang at once?
Take a look at kaitai - https://kaitai.io/, a library for parsing binary files across multiple languages, with a skeleton to define the file format in a language independent way.
It is capable of defining conditions inside the file format, and adapt the parsing as needed. While the learning curve isn't immediately trivial, it's not too hard either.
Assuming you want to do it yourself and not use an external library, there are a few things to consider that can improve the perforamce/code:
Use struct.unpack_from(format, buffer, offset=0) instead of the current method, as buffer[index : index + recordLength] is possibly creating new objects and copying memory around which is not necessary
If you want to unpack an array of the same format, you can improve it further with struct.iter_unpack(format, buffer) and then iterate over the results:
import itertools
import struct
def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
unpack_iter = struct.iter_unpack(buffer, format)
return [
# I like this better than dict(zip(...)) but you can also do that
{k: v for k, v in zip(fieldNames, vals)}
# We use `islice` to only take the first numRecords values
for vals in itertools.islice(unpack_iter, numRecords)
]

Dynamically creating arrays for multiple datasets

This is a quality of life query that I feel like there is an answer to, but can't find (maybe I'm using the wrong terms)
Essentially, I have multiple sets of large data files that I would like to perform analysis on. This involves reading each of these datafiles and storing them as an array (of variable length).
So far I have been doing
import numpy as np
input1 = np.genfromtxt('data1.dat')
input2 = np.genfromtxt('data2.dat')
etc. I was wondering if there is a method of dynamically assigning an array to each of these datasets. Since you can read these dynamically with a for loop,
for i in xrange(2):
input = np.genfromtxt('data%i.dat'%i)
I was hoping to combine the above to create a bunch of arrays; input1, input2, etc. without myself typing out genfromtxt multiple times. Surely there is a method if I had 100 datasets (aptly named data0, data1, etc) to import.
A solution I can think of is maybe creating a function,
import numpy as np
def input(a):
return np.genfromtxt('data%i.dat'%a)
But obviously, I would prefer to store this in memory instead of constantly regenerate a list, and would be extremely grateful to know if this is possible in Python.
You can choose to store your arrays in either a dict or a list:
Option 1
Using a dict.
data = {}
for i in xrange(2):
data['input{}'.format(i)] = np.genfromtxt('data{}.dat'.format(i))
You can access each array by key.
Option 2
Using a list.
data = []
for i in xrange(2):
data.append(np.genfromtxt('data{}.dat'.format(i)))
Alternatively, using a list comprehension:
data = [np.genfromtxt('data{}.dat'.format(i)) for i in xrange(2)]
You can also use a map, it returns a list:
data = map(lambda x: np.genfromtxt('data{}.dat'.format(x)), xrange(2))
Now you can access each array by index.

How to read contents of datasets of a h5py file into a numpy array given a list of keys?

Inputs to my function are a h5py file and a text file. Text file has two columns. First column has some utterance information and second column has the speaker information (for that utterance). The keys of h5py file (created using create_datasets) are the utterances (first column of the file). Each of this datasets will have a numpy array (only one) of fixed dimension (600, ). The h5py file has more utterances than the utterances in the text file. Some of the utterances in the text file may not be present in h5py file.
Expected outputs of my function: two numpy arrays
1st array (let us call it X) should be of shape ((nutts, 600), dtype='float')
2nd array (let us call it y) should be of shape ((nutts,), dtype='int' )
where nutts are the number of utterances of the text file that are (actually) present in the h5file (nutts <= total_lines_in_text_file)
NOTE: I don't know the nutts in advance. I have to create X and y dynamically.
Method for creating X should be clear from the above description:
For the first utterance in text file, I check if that utterance exists as a key in h5file. If it exists, I take the numpy array (600 dimensional) and put in first row of X and then iterate (Reminder I don't know 'nutts' in advance so I cannot pre initialize X with zeros. So I saved them as a dict and tried to convert dict to a numpy array once I have them all in place
More about y:
All the nutts may correspond to nspkrs (nspkrs <= nutts). Multiple utterances can get mapped to same speaker. I want to encode the speaker information of each utterance in a numpy array format. For spkr1 I give a label of 0, for spkr n I give a label of n-1.
Here is what I did:
h5f = h5py.File('some.h5', 'r')
import pandas as pd
import numpy as np
def load_data(h5f, src_u2s_list):
with open(src_u2s_list) as f:
content = f.read().splitlines()
utt2ivec = {}
utt2lbl = {}
spk2spk_class = {}
spk_id = -1
for u2s in content:
utt, spk = u2s.split()
if spk not in spk2spk_class.keys():
spk_id += 1
spk2spk_class[spk] = spk_id
if utt in h5f.keys():
utt2ivec[utt] = h5f[utt][:]
utt2lbl[utt] = spk2spk_class[spk]
else:
print("Utterance {0} does not exist in h5file".format(utt))
data_X = pd.Series(utt2ivec)
data_y = pd.Series(utt2lbl)
return data_X.values, data_y.values
Main considerations:
the h5file has around 100,000 utterances and text file has around 70000 utterances. So this code runs very slow
I am new to using h5py files. Suggestions or entire restructuring of the code is welcome.
I want to avoid using pandas.
Very Important: The order of the utterances in X and y should be the same. That means the rows of X and y should correspond to same utterance.
Sorry for a long description. I want to make things clear to avoid confusion.
Thank you in advance
Normally when we want to collect a bunch of arrays (of equal size) into 2d array, we first append them to a list, and then create the array from that at the end:
utt2ilist = []
...
if utt in h5f.keys():
utt2ilist.append(h5f[utt][:])
...
utt2iarr = np.array(utt2ilist)
I don't know of any way of loading multiple datasets at once, at least not with h5py. You just have to find the relevant key and load it as you do. If you have a large number of datasets, each relatively small (600 elements) this could take time. Watch out if the the datasets differ in size - then the result will be an object dtype array, not a 2d one.
I assume the spk2spk_class[spk] can be collected in a similar list.
Using lists like this will order them in the order that they are read. Using the dictionaries and pandas will index them by utt. With a dictionary intermediary you may loose information about the read order.
I was going to suggest defaultdict for the spk2spk_class, but you aren't counting the objects, but rather just giving them unique ids.
My guess is that the slowness comes from just reading a large number of uterances, not from the collection mechanism itself. But I don't have your data to test.

Categories